Wednesday, July 02, 2008

What is cloud computing?

I've been hearing about this thing called Cloud Computing, Hadoop etc for some time now and I wanted to know whats all this fuss about. As usual my friend came to my help

Me: Do you know what cloud computing is? If you know can you please explain it to me in simple terms?

Friend: Sure. In real world terms imagine if we had one government for the whole world. It may be quite hard for it to manage everything. So it is split into many countries, states, districts sort of small chunks. So it becomes manageable. Like that thousands of cheap CPUs processing massive amount of data managed by an admin computer is cloud computing. Hadoop, MapReduce and Amazon Cluster are all different implementation of cloud computing.

Me: Who maintains the cheap CPUs?

Friend: It won’t need much of maintenance. They are cheap hardware available in the market, you put them in a rack and that’s it. If something fails, the central cloud managing process will probably alert the system admin and take the task elsewhere. Distributed computing is what it is all about.

Me: But then that’s what a normal server also does right. Servers at Google, Verio and Media3 are also a group of many computers right?

Friend: Yes they are. But they all do the same task.

Ok think this for a task… your average Media3 server is good at receiving a request and serving it that’s all it does. And that’s what it is supposed to do. But when you ask the same server to go fetch web pages from the internet do some work on it and index them, it can still do it may be 10,000 pages a day. When you are talking about trillions of web pages then it will be there all day all year still doing that very basic task.

Let’s assume that it has to index all the title tags on all the web pages it will take too long for it to complete because it is single threading it. Now you add another computer to it as in split the task to 2 computers. You suddenly made it to finish the task in half the time plus some little extra time as it was supposed to take originally. That extra time is basically for the management of tasks i.e. they have to talk to each other makes sure they won’t repeat the tasks etc.

Me: cool

Friend: Now if you add one more and another and keep on adding say 10,000 computers doing it all day for you imagine time you save. Each of these machines at most should cost you about $500 but sometimes even lesser. There are many problems in this world that can be addressed with this kind of solution. This is something very close to a super computer. A super computer is capable of parallel processing. That’s what you are achieving here by making use of many cheap hardware to do parallel processing

Me: So what is Hadoop?

Friend: It’s a framework. I.e. it’s some rule using which a group of comps work. When you give it a task and tell it to run on massive amount of data it sets off to work, instructing all its workers to get on with it.

Me: Who owns Hadoop? ie Can Google have a Hadoop based parallel processing computer? Or is Hadoop the name of a parallel processing system owned by someone which can be bought by others? Sorry if my questions are stupid

Friend: No No

Friend: Hadoop is apache's free open source program. You can use it if you want on 10 computers or 2 or thousand. To use it on just one computer does not make much sense :-) Its open source like the apache web server

Me: Oooh so it’s like those multi-threaded download managers...

Friend: Exactly

Me: We tell an admin machine/program to do something and the admin distributes the work to multiple machines, gather output and give out

Friend: Yes correct

Me: Hiya thanks a lot man

Wow I already feel like a cloud computing expert now.. hehe ;-)
Post a Comment