I recently bought Hadoop: The Definitive Guide Second Edition from O’Reilly.
The book is pretty good and gives a good overview of Hadoop and its family of projects
- MapReduce (a distributed data processing model and execution environment that runs on large clusters of commodity machines)
- HDFS (is an abstraction layer for multiple filesystems and a distributed filesystem that runs on large clusters of commodity machines)
- Avro (an efficient serialization system for cross-language RPC, and persistent data storage)
- Pig (a data flow language and execution environment for exploring very large datasets)
- Hive (SQL like language which is translated to MapReduce jobs)
- HBase (a distributed column-oriented(family) database which uses HDFS)
- Zookeeper (a distributed, highly available coordination service)
- Sqoop (a tool for efficiently moving data between relational databases and HDFS)
I can see Hadoop being useful for certain scenarios but the datasets would have to be huge and growing quickly. A commodity server, running an RDBMS, these days can be 48 cores and 1TB of RAM. That’s pretty powerful. It also depends on how the data is to be used. If the data can be summarised / aggregated daily then this can be efficiently stored within an RDBMS.
The book does include some good case studies. If you are looking to learn about Hadoop and its family of projects then this is a good book.