After my posts with links about Scalability and MapReduce, it’s time to share my links about Hadoop (thanks to @asehmi for his links):
The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these subprojects:
Other Hadoop-related projects at Apache include:
- Avro™: A data serialization system.
- Cassandra™: A scalable multi-master database with no single points of failure.
- Chukwa™: A data collection system for managing large distributed systems.
- HBase™: A scalable, distributed database that supports structured data storage for large tables.
- Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout™: A Scalable machine learning and data mining library.
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper™: A high-performance coordination service for distributed applications.
Papers – Hadoop Wiki
Realtime Hadoop usage at Facebook — Part 1
HDFS: Realtime Hadoop usage at Facebook — Part 2 – Workload Types
The top five most powerful Hadoop projects – SD Times: Software Development News
How to Deploy a Hadoop Cluster on Windows Azure – Windows Azure
Hadoop in Azure – Distributed Development
Radoop – It’s Like Yahoo Pipes for Hadoop | SiliconANGLE
Introduction to MapReduce and Hadoop
Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)
Interning at Facebook: Bridging Marketing and Engineering (18)
High Performance Computing: Understanding What is Hadoop
Microsoft adds Hadoop support to SQL Server, data warehouse
Parallel Data Warehouse News and Hadoop Interoperability Plans – SQL Server Team Blog
Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.
Twitter Engineering: A Storm is coming: more details and plans for release
"A Storm cluster is superficially similar to a Hadoop cluster"
Preview of Storm: The Hadoop of Realtime Processing – BackType Technology
Mesos: Dynamic Resource Sharing for Clusters
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark (a new framework for low-latency interactive and iterative jobs), and other applications.
Big Analytics for Big Data on Hadoop
More about Big Data
Good white papers
Hadoop Summit 2010 – Yahoo! Developer Network
DBMS Musings: Hadoop’s tremendous inefficiency on graph data management (and how to avoid it)
Hoop – Hadoop HDFS over HTTP | Apache Hadoop for the Enterprise | Cloudera
Bioinformatics and the Future of Hadoop
Seven Java projects that changed the world – O’Reilly Radar
InfoQ: Introduction to Oozie
Within the Hadoop ecosystem, there is a relatively new component Oozie, which allows one to combine multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task
The Future of Hadoop in Bioinformatics | insideHPC.com
HDFS: Realtime Hadoop usage at Facebook: The Complete Story
SNA Projects Blog : Tech Talk: Anil Madan (eBay) — “Hadoop at eBay”
Ceph as a scalable alternative to the Hadoop Distributed File System
The elephant in the room … Hadoop and BigData!
Hadoop, Hive and Redis for Foursquare Analytics :: myNoSQL
The Hadoop Distributed File System
IBM Jeopardy: Building Watson: An Overview of the DeepQA Project
"To preprocess the corpus and create fast runtime indices we used Hadoop"
Jeopardy Goes to Hadoop :: myNoSQL
ElephantDB, a Distributed Database for Working with Hadoop
InfoQ: Hadoop Redesign for Upgrades and Other Programming Paradigms
Riding the Elephant | The Molecular Ecologist
Yahoo focusing on Apache Hadoop, discontinuing “The Yahoo Distribution of Hadoop”
Lessons learned putting Hadoop into production « Cloudera » Apache Hadoop for the Enterprise
Dimensional Reduction – Apache Mahout – Apache Software Foundation
Beyond Hadoop – Next-Generation Big Data Architectures – NYTimes.com
Large Scale Natural Language Processing
Hadoop and Realtime Cloud Computing | Cloud Computing Journal
Hadoop and NoSQL Downfall Parody on Vimeo
Hadoop: The Definitive Guide, Second Edition – O’Reilly Media
Hadoop Ecosystem World-Map « Sanjay Sharma’s Weblog
MapReduce, Hadoop: Young, But Worth A Look — Data Management — InformationWeek
Distributed data processing with Hadoop – Part-3: App Build
HDFS: Facebook has the world’s largest Hadoop cluster!
High Availability MySQL: Hadoop and MySQL
High Scalability – How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
Realtime Search for Hadoop – Scalable Log Data Management with Hadoop, Part 3 « mgm technology blog
Behind Caffeine May Be Software to Inspire Hadoop 2.0
Hadoop in a box
Scalability of the Hadoop Distributed File System (Hadoop and Distributed Computing at Yahoo!)
Introduction to Hadoop, HBase, and NoSQL
InfoQ: Horizontal Scalability via Transient, Shardable, and Share-Nothing Resources
Neuroph on Hadoop: Massive Parallel Neural Network System? | NetBeans Zone
Pushing the Limits of Distributed Processing « Cloudera » Apache Hadoop for the Enterprise
More links are coming (distributed computing? NoSql?).
Angel “Java” Lopez