Hadoop: Links, News and Resources (1)

After my posts with links about Scalability and MapReduce, it’s time to share my links about Hadoop (thanks to @asehmi for his links):

http://hadoop.apache.org/

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these subprojects:

Other Hadoop-related projects at Apache include:

  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

http://wiki.apache.org/hadoop/

Papers – Hadoop Wiki
http://wiki.apache.org/hadoop/Papers

HDFS
http://hadoopblog.blogspot.com/

Realtime Hadoop usage at Facebook — Part 1
http://hadoopblog.blogspot.com/2011/05/realtime-hadoop-usage-at-facebook-part.html

HDFS: Realtime Hadoop usage at Facebook — Part 2 – Workload Types
http://hadoopblog.blogspot.com/2011/05/realtime-hadoop-usage-at-facebook-part_28.html

The top five most powerful Hadoop projects – SD Times: Software Development News
http://www.sdtimes.com/l/35596

How to Deploy a Hadoop Cluster on Windows Azure – Windows Azure
http://blogs.msdn.com/b/windowsazure/archive/2011/05/17/how-to-deploy-a-hadoop-cluster-on-windows-azure.aspx

Hadoop in Azure – Distributed Development
http://blogs.msdn.com/b/mariok/archive/2011/05/11/hadoop-in-azure.aspx

Radoop – It’s Like Yahoo Pipes for Hadoop | SiliconANGLE
http://siliconangle.com/blog/2011/08/11/radoop-its-like-yahoo-pipes-for-hadoop/?

Introduction to MapReduce and Hadoop
http://www.theserverside.com/discussions/thread.tss?thread_id=62376

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)
http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-academic-papers-4th-update-may-2011/

Interning at Facebook: Bridging Marketing and Engineering (18)
http://www.facebook.com/note.php?note_id=10150254305343920

High Performance Computing: Understanding What is Hadoop
http://patodirahul.blogspot.com/2011/03/understanding-what-is-hadoop.html

Microsoft adds Hadoop support to SQL Server, data warehouse
http://www.tmcnet.com/usubmit/2011/08/10/5696037.htm

Parallel Data Warehouse News and Hadoop Interoperability Plans – SQL Server Team Blog
http://blogs.technet.com/b/dataplatforminsider/archive/2011/08/08/parallel-data-warehouse-news-and-hadoop-interoperability-plans.aspx

Cascading
http://www.cascading.org/
Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.

Twitter Engineering: A Storm is coming: more details and plans for release
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
"A Storm cluster is superficially similar to a Hadoop cluster"

Preview of Storm: The Hadoop of Realtime Processing – BackType Technology
http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-proce

Mesos: Dynamic Resource Sharing for Clusters
http://www.mesosproject.org/
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark (a new framework for low-latency interactive and iterative jobs), and other applications.

Big Analytics for Big Data on Hadoop
http://karmasphere.com/

More about Big Data
http://www.bigdata.com/bigdata/blog
Good white papers

Hadoop Summit 2010 – Yahoo! Developer Network
http://developer.yahoo.com/events/hadoopsummit2010/

DBMS Musings: Hadoop’s tremendous inefficiency on graph data management (and how to avoid it)
http://dbmsmusings.blogspot.com/2011/07/hadoops-tremendous-inefficiency-on.html

Hoop – Hadoop HDFS over HTTP | Apache Hadoop for the Enterprise | Cloudera
http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/

Bioinformatics and the Future of Hadoop
http://www.genomeweb.com/blog/bioinformatics-and-future-hadoop

Seven Java projects that changed the world – O’Reilly Radar
http://radar.oreilly.com/2011/07/7-java-projects.html

InfoQ: Introduction to Oozie
http://www.infoq.com/articles/introductionOozie
Within the Hadoop ecosystem, there is a relatively new component Oozie, which allows one to combine multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task

The Future of Hadoop in Bioinformatics | insideHPC.com
http://insidehpc.com/2011/07/03/the-future-of-hadoop-in-bioinformatics/

HDFS: Realtime Hadoop usage at Facebook: The Complete Story
http://hadoopblog.blogspot.com/2011/07/realtime-hadoop-usage-at-facebook.html

SNA Projects Blog : Tech Talk: Anil Madan (eBay) — “Hadoop at eBay”
http://sna-projects.com/blog/2011/06/hadoop-at-ebay/

Ceph as a scalable alternative to the Hadoop Distributed File System
http://www.usenix.org/publications/login/2010-08/openpdfs/maltzahn.pdf

The elephant in the room … Hadoop and BigData!
http://mikethetechie.com/post/6822576191/the-elephant-in-the-room-hadoop-and-bigdata

Hadoop, Hive and Redis for Foursquare Analytics :: myNoSQL
http://nosql.mypopescu.com/post/3872483038/hadoop-hive-and-redis-for-foursquare-analytics

The Hadoop Distributed File System
http://storageconference.org/2010/Papers/MSST/Shvachko.pdf

IBM Jeopardy: Building Watson: An Overview of the DeepQA Project
https://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf
"To preprocess the corpus and create fast runtime indices we used Hadoop"

Jeopardy Goes to Hadoop :: myNoSQL
http://nosql.mypopescu.com/post/3406224331/jeopardy-goes-to-hadoop

ElephantDB, a Distributed Database for Working with Hadoop
http://www.readwriteweb.com/hack/2011/02/ravendb-a-distributed-database.php

InfoQ: Hadoop Redesign for Upgrades and Other Programming Paradigms
http://www.infoq.com/news/2011/02/hadoop_redesign

Riding the Elephant | The Molecular Ecologist
http://tomato.biol.trinity.edu/blog/2011/02/riding-the-elephant/

Yahoo focusing on Apache Hadoop, discontinuing “The Yahoo Distribution of Hadoop”
http://developer.yahoo.com/blogs/hadoop/posts/2011/01/announcement-yahoo-focusing-on-apache-hadoop-discontinuing-the-yahoo-distribution-of-hadoop/

Lessons learned putting Hadoop into production « Cloudera » Apache Hadoop for the Enterprise
http://www.cloudera.com/blog/2010/12/lessons-learned-putting-hadoop-into-production/

Dimensional Reduction – Apache Mahout – Apache Software Foundation
https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

Beyond Hadoop – Next-Generation Big Data Architectures – NYTimes.com
https://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html

Large Scale Natural Language Processing
http://us.pycon.org/media/2010/talkdata/PyCon2010/098/large-scale-nlp-pycon-2010.pdf

Hadoop and Realtime Cloud Computing | Cloud Computing Journal
http://cloudcomputing.sys-con.com/node/1572508

Hadoop and NoSQL Downfall Parody on Vimeo
http://vimeo.com/15782414

Hadoop: The Definitive Guide, Second Edition – O’Reilly Media
http://oreilly.com/catalog/9781449389734

Hadoop Ecosystem World-Map « Sanjay Sharma’s Weblog
http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/

MapReduce, Hadoop: Young, But Worth A Look — Data Management — InformationWeek
http://www.informationweek.com/news/business_intelligence/warehouses/showArticle.jhtml?articleID=226600088

Distributed data processing with Hadoop – Part-3: App Build
http://www.gnarc.com/tutorials/distributed-data-processing-with-hadoop-part-3-app-build

HDFS: Facebook has the world’s largest Hadoop cluster!
http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html

High Availability MySQL: Hadoop and MySQL
http://mysqlha.blogspot.com/2007/10/hadoop-and-mysql.html

High Scalability – How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

Realtime Search for Hadoop – Scalable Log Data Management with Hadoop, Part 3 « mgm technology blog
http://blog.mgm-tp.com/2010/06/hadoop-log-management-part3/

Behind Caffeine May Be Software to Inspire Hadoop 2.0
http://gigaom.com/2010/06/11/behind-caffeine-may-be-software-to-inspire-hadoop-2-0

Hadoop in a box
http://www.slideshare.net/tim.lossen.de/hadoop-in-a-box

Scalability of the Hadoop Distributed File System (Hadoop and Distributed Computing at Yahoo!)
http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html

Introduction to Hadoop, HBase, and NoSQL
http://www.slideshare.net/xefyr/introduction-to-hadoop-hbase-and-nosql

InfoQ: Horizontal Scalability via Transient, Shardable, and Share-Nothing Resources
http://www.infoq.com/presentations/Horizontal-Scalability

Neuroph on Hadoop: Massive Parallel Neural Network System? | NetBeans Zone
http://netbeans.dzone.com/neuroph-hadoop-nb

Pushing the Limits of Distributed Processing « Cloudera » Apache Hadoop for the Enterprise
http://www.cloudera.com/blog/2010/04/pushing-the-limits-of-distributed-processing/
April Joke 😉

My Links
http://www.delicious.com/ajlopez/hadoop
http://www.delicious.com/ajlopez/hadoop+tutorial
http://www.delicious.com/ajlopez/hadoop+video
http://www.delicious.com/ajlopez/hadoop+nosql
http://www.delicious.com/ajlopez/hadoop+distributedcomputing
http://www.delicious.com/ajlopez/hadoop+scalability
http://www.delicious.com/ajlopez/hadoop+machinelearning
http://www.delicious.com/ajlopez/hadoop+artificialintelligence

More links are coming (distributed computing? NoSql?).

Keep tuned!

Angel “Java” Lopez
http://www.ajlopez.com
http://twitter.com/ajlopez

3 thoughts on “Hadoop: Links, News and Resources (1)

  1. Abhirup Mazumder

    Hello Friends,

    New batch of Hadoop training will be starting from 07-Nov’11 . It will be an online training of 35 to 40 hrs and the following contents will be covered:

    1.HDFS
    2.Map Reduce
    3.Hive
    4.HBASE
    5.Working with Hadoop on ec2 (cloud).

    80 % of the training is completely practical.

    Our institute has its leadership in providing online training at the lowest price with high quality. The trainer has 6 yrs of experience on this technology has conducted multiple online trainings.

    Please drop me an email for registration at sales@intellipaat.com or give us a call . Guys hurry up last few seats are available.

    Regards,
    Sales
    Intellipaat Team

    Mob: 91-9019368913

    Reply
    1. anonymous

      such a fraud training.. you will cry after taking course from intellipaat… I m experienced.. Intellipaat is a Big Fraud company.

      Reply

Leave a comment