Angel \”Java\” Lopez on Blog

August 15, 2011

Hadoop: Links, News and Resources (1)

Filed under: Distributed Computing, Open Source Projects, Scalability — ajlopez @ 9:54 am

After my posts with links about Scalability and MapReduce, it’s time to share my links about Hadoop (thanks to @asehmi for his links):

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these subprojects:

Other Hadoop-related projects at Apache include:

  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

Papers – Hadoop Wiki


Realtime Hadoop usage at Facebook — Part 1

HDFS: Realtime Hadoop usage at Facebook — Part 2 – Workload Types

The top five most powerful Hadoop projects – SD Times: Software Development News

How to Deploy a Hadoop Cluster on Windows Azure – Windows Azure

Hadoop in Azure – Distributed Development

Radoop – It’s Like Yahoo Pipes for Hadoop | SiliconANGLE

Introduction to MapReduce and Hadoop

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)

Interning at Facebook: Bridging Marketing and Engineering (18)

High Performance Computing: Understanding What is Hadoop

Microsoft adds Hadoop support to SQL Server, data warehouse

Parallel Data Warehouse News and Hadoop Interoperability Plans – SQL Server Team Blog

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.

Twitter Engineering: A Storm is coming: more details and plans for release
"A Storm cluster is superficially similar to a Hadoop cluster"

Preview of Storm: The Hadoop of Realtime Processing – BackType Technology

Mesos: Dynamic Resource Sharing for Clusters
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark (a new framework for low-latency interactive and iterative jobs), and other applications.

Big Analytics for Big Data on Hadoop

More about Big Data
Good white papers

Hadoop Summit 2010 – Yahoo! Developer Network

DBMS Musings: Hadoop’s tremendous inefficiency on graph data management (and how to avoid it)

Hoop – Hadoop HDFS over HTTP | Apache Hadoop for the Enterprise | Cloudera

Bioinformatics and the Future of Hadoop

Seven Java projects that changed the world – O’Reilly Radar

InfoQ: Introduction to Oozie
Within the Hadoop ecosystem, there is a relatively new component Oozie, which allows one to combine multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task

The Future of Hadoop in Bioinformatics |

HDFS: Realtime Hadoop usage at Facebook: The Complete Story

SNA Projects Blog : Tech Talk: Anil Madan (eBay) — “Hadoop at eBay”

Ceph as a scalable alternative to the Hadoop Distributed File System

The elephant in the room … Hadoop and BigData!

Hadoop, Hive and Redis for Foursquare Analytics :: myNoSQL

The Hadoop Distributed File System

IBM Jeopardy: Building Watson: An Overview of the DeepQA Project
"To preprocess the corpus and create fast runtime indices we used Hadoop"

Jeopardy Goes to Hadoop :: myNoSQL

ElephantDB, a Distributed Database for Working with Hadoop

InfoQ: Hadoop Redesign for Upgrades and Other Programming Paradigms

Riding the Elephant | The Molecular Ecologist

Yahoo focusing on Apache Hadoop, discontinuing “The Yahoo Distribution of Hadoop”

Lessons learned putting Hadoop into production « Cloudera » Apache Hadoop for the Enterprise

Dimensional Reduction – Apache Mahout – Apache Software Foundation

Beyond Hadoop – Next-Generation Big Data Architectures –

Large Scale Natural Language Processing

Hadoop and Realtime Cloud Computing | Cloud Computing Journal

Hadoop and NoSQL Downfall Parody on Vimeo

Hadoop: The Definitive Guide, Second Edition – O’Reilly Media

Hadoop Ecosystem World-Map « Sanjay Sharma’s Weblog

MapReduce, Hadoop: Young, But Worth A Look — Data Management — InformationWeek

Distributed data processing with Hadoop – Part-3: App Build

HDFS: Facebook has the world’s largest Hadoop cluster!

High Availability MySQL: Hadoop and MySQL

High Scalability – How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

Realtime Search for Hadoop – Scalable Log Data Management with Hadoop, Part 3 « mgm technology blog

Behind Caffeine May Be Software to Inspire Hadoop 2.0

Hadoop in a box

Scalability of the Hadoop Distributed File System (Hadoop and Distributed Computing at Yahoo!)

Introduction to Hadoop, HBase, and NoSQL

InfoQ: Horizontal Scalability via Transient, Shardable, and Share-Nothing Resources

Neuroph on Hadoop: Massive Parallel Neural Network System? | NetBeans Zone

Pushing the Limits of Distributed Processing « Cloudera » Apache Hadoop for the Enterprise
April Joke 😉

My Links

More links are coming (distributed computing? NoSql?).

Keep tuned!

Angel “Java” Lopez


  1. Hello Friends,

    New batch of Hadoop training will be starting from 07-Nov’11 . It will be an online training of 35 to 40 hrs and the following contents will be covered:

    2.Map Reduce
    5.Working with Hadoop on ec2 (cloud).

    80 % of the training is completely practical.

    Our institute has its leadership in providing online training at the lowest price with high quality. The trainer has 6 yrs of experience on this technology has conducted multiple online trainings.

    Please drop me an email for registration at or give us a call . Guys hurry up last few seats are available.

    Intellipaat Team

    Mob: 91-9019368913

    Comment by Abhirup Mazumder — October 31, 2011 @ 3:20 pm

    • such a fraud training.. you will cry after taking course from intellipaat… I m experienced.. Intellipaat is a Big Fraud company.

      Comment by anonymous — January 17, 2017 @ 3:53 am

  2. Great post, Angel. Just a suggestion, as I don’t see it:

    They’re an excellent company that I’ve dealt with and highly suggest that everyone at least checks them out. Great training and great price, along with a certification at the end!

    Comment by galoor — October 8, 2014 @ 11:00 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at

%d bloggers like this: