Angel \”Java\” Lopez on Blog

August 15, 2011

Hadoop: Links, News and Resources (1)

Filed under: Distributed Computing, Open Source Projects, Scalability — ajlopez @ 9:54 am

After my posts with links about Scalability and MapReduce, it’s time to share my links about Hadoop (thanks to @asehmi for his links):

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these subprojects:

Other Hadoop-related projects at Apache include:

  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

Papers – Hadoop Wiki


Realtime Hadoop usage at Facebook — Part 1

HDFS: Realtime Hadoop usage at Facebook — Part 2 – Workload Types

The top five most powerful Hadoop projects – SD Times: Software Development News

How to Deploy a Hadoop Cluster on Windows Azure – Windows Azure

Hadoop in Azure – Distributed Development

Radoop – It’s Like Yahoo Pipes for Hadoop | SiliconANGLE

Introduction to MapReduce and Hadoop

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)

Interning at Facebook: Bridging Marketing and Engineering (18)

High Performance Computing: Understanding What is Hadoop

Microsoft adds Hadoop support to SQL Server, data warehouse

Parallel Data Warehouse News and Hadoop Interoperability Plans – SQL Server Team Blog

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.

Twitter Engineering: A Storm is coming: more details and plans for release
"A Storm cluster is superficially similar to a Hadoop cluster"

Preview of Storm: The Hadoop of Realtime Processing – BackType Technology

Mesos: Dynamic Resource Sharing for Clusters
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark (a new framework for low-latency interactive and iterative jobs), and other applications.

Big Analytics for Big Data on Hadoop

More about Big Data
Good white papers

Hadoop Summit 2010 – Yahoo! Developer Network

DBMS Musings: Hadoop’s tremendous inefficiency on graph data management (and how to avoid it)

Hoop – Hadoop HDFS over HTTP | Apache Hadoop for the Enterprise | Cloudera

Bioinformatics and the Future of Hadoop

Seven Java projects that changed the world – O’Reilly Radar

InfoQ: Introduction to Oozie
Within the Hadoop ecosystem, there is a relatively new component Oozie, which allows one to combine multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task

The Future of Hadoop in Bioinformatics |

HDFS: Realtime Hadoop usage at Facebook: The Complete Story

SNA Projects Blog : Tech Talk: Anil Madan (eBay) — “Hadoop at eBay”

Ceph as a scalable alternative to the Hadoop Distributed File System

The elephant in the room … Hadoop and BigData!

Hadoop, Hive and Redis for Foursquare Analytics :: myNoSQL

The Hadoop Distributed File System

IBM Jeopardy: Building Watson: An Overview of the DeepQA Project
"To preprocess the corpus and create fast runtime indices we used Hadoop"

Jeopardy Goes to Hadoop :: myNoSQL

ElephantDB, a Distributed Database for Working with Hadoop

InfoQ: Hadoop Redesign for Upgrades and Other Programming Paradigms

Riding the Elephant | The Molecular Ecologist

Yahoo focusing on Apache Hadoop, discontinuing “The Yahoo Distribution of Hadoop”

Lessons learned putting Hadoop into production « Cloudera » Apache Hadoop for the Enterprise

Dimensional Reduction – Apache Mahout – Apache Software Foundation

Beyond Hadoop – Next-Generation Big Data Architectures –

Large Scale Natural Language Processing

Hadoop and Realtime Cloud Computing | Cloud Computing Journal

Hadoop and NoSQL Downfall Parody on Vimeo

Hadoop: The Definitive Guide, Second Edition – O’Reilly Media

Hadoop Ecosystem World-Map « Sanjay Sharma’s Weblog

MapReduce, Hadoop: Young, But Worth A Look — Data Management — InformationWeek

Distributed data processing with Hadoop – Part-3: App Build

HDFS: Facebook has the world’s largest Hadoop cluster!

High Availability MySQL: Hadoop and MySQL

High Scalability – How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

Realtime Search for Hadoop – Scalable Log Data Management with Hadoop, Part 3 « mgm technology blog

Behind Caffeine May Be Software to Inspire Hadoop 2.0

Hadoop in a box

Scalability of the Hadoop Distributed File System (Hadoop and Distributed Computing at Yahoo!)

Introduction to Hadoop, HBase, and NoSQL

InfoQ: Horizontal Scalability via Transient, Shardable, and Share-Nothing Resources

Neuroph on Hadoop: Massive Parallel Neural Network System? | NetBeans Zone

Pushing the Limits of Distributed Processing « Cloudera » Apache Hadoop for the Enterprise
April Joke ;-)

My Links

More links are coming (distributed computing? NoSql?).

Keep tuned!

Angel “Java” Lopez

The Shocking Blue Green Theme. Blog at


Get every new post delivered to your Inbox.

Join 72 other followers