After my posts with links about Scalability and MapReduce, it’s time to share my links about Hadoop (thanks to @asehmi for his links):
The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these subprojects:
- Hadoop Common: The common utilities that support the other Hadoop subprojects.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Other Hadoop-related projects at Apache include:
- Avro™: A data serialization system.
- Cassandra™: A scalable multi-master database with no single points of failure.
- Chukwa™: A data collection system for managing large distributed systems.
- HBase™: A scalable, distributed database that supports structured data storage for large tables.
- Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout™: A Scalable machine learning and data mining library.
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper™: A high-performance coordination service for distributed applications.
http://wiki.apache.org/hadoop/
Papers – Hadoop Wiki
http://wiki.apache.org/hadoop/Papers
HDFS
http://hadoopblog.blogspot.com/
Realtime Hadoop usage at Facebook — Part 1
http://hadoopblog.blogspot.com/2011/05/realtime-hadoop-usage-at-facebook-part.html
HDFS: Realtime Hadoop usage at Facebook — Part 2 – Workload Types
http://hadoopblog.blogspot.com/2011/05/realtime-hadoop-usage-at-facebook-part_28.html
The top five most powerful Hadoop projects – SD Times: Software Development News
http://www.sdtimes.com/l/35596
How to Deploy a Hadoop Cluster on Windows Azure – Windows Azure
http://blogs.msdn.com/b/windowsazure/archive/2011/05/17/how-to-deploy-a-hadoop-cluster-on-windows-azure.aspx
Hadoop in Azure – Distributed Development
http://blogs.msdn.com/b/mariok/archive/2011/05/11/hadoop-in-azure.aspx
Radoop – It’s Like Yahoo Pipes for Hadoop | SiliconANGLE
http://siliconangle.com/blog/2011/08/11/radoop-its-like-yahoo-pipes-for-hadoop/?
Introduction to MapReduce and Hadoop
http://www.theserverside.com/discussions/thread.tss?thread_id=62376
Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)
http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-academic-papers-4th-update-may-2011/
Interning at Facebook: Bridging Marketing and Engineering (18)
http://www.facebook.com/note.php?note_id=10150254305343920
High Performance Computing: Understanding What is Hadoop
http://patodirahul.blogspot.com/2011/03/understanding-what-is-hadoop.html
Microsoft adds Hadoop support to SQL Server, data warehouse
http://www.tmcnet.com/usubmit/2011/08/10/5696037.htm
Parallel Data Warehouse News and Hadoop Interoperability Plans – SQL Server Team Blog
http://blogs.technet.com/b/dataplatforminsider/archive/2011/08/08/parallel-data-warehouse-news-and-hadoop-interoperability-plans.aspx
Cascading
http://www.cascading.org/
Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.
Twitter Engineering: A Storm is coming: more details and plans for release
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
"A Storm cluster is superficially similar to a Hadoop cluster"
Preview of Storm: The Hadoop of Realtime Processing – BackType Technology
http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-proce
Mesos: Dynamic Resource Sharing for Clusters
http://www.mesosproject.org/
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark (a new framework for low-latency interactive and iterative jobs), and other applications.
Big Analytics for Big Data on Hadoop
http://karmasphere.com/
More about Big Data
http://www.bigdata.com/bigdata/blog
Good white papers
Hadoop Summit 2010 – Yahoo! Developer Network
http://developer.yahoo.com/events/hadoopsummit2010/
DBMS Musings: Hadoop’s tremendous inefficiency on graph data management (and how to avoid it)
http://dbmsmusings.blogspot.com/2011/07/hadoops-tremendous-inefficiency-on.html
Hoop – Hadoop HDFS over HTTP | Apache Hadoop for the Enterprise | Cloudera
http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/
Bioinformatics and the Future of Hadoop
http://www.genomeweb.com/blog/bioinformatics-and-future-hadoop
Seven Java projects that changed the world – O’Reilly Radar
http://radar.oreilly.com/2011/07/7-java-projects.html
InfoQ: Introduction to Oozie
http://www.infoq.com/articles/introductionOozie
Within the Hadoop ecosystem, there is a relatively new component Oozie, which allows one to combine multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task
The Future of Hadoop in Bioinformatics | insideHPC.com
http://insidehpc.com/2011/07/03/the-future-of-hadoop-in-bioinformatics/
HDFS: Realtime Hadoop usage at Facebook: The Complete Story
http://hadoopblog.blogspot.com/2011/07/realtime-hadoop-usage-at-facebook.html
SNA Projects Blog : Tech Talk: Anil Madan (eBay) — “Hadoop at eBay”
http://sna-projects.com/blog/2011/06/hadoop-at-ebay/
Ceph as a scalable alternative to the Hadoop Distributed File System
http://www.usenix.org/publications/login/2010-08/openpdfs/maltzahn.pdf
The elephant in the room … Hadoop and BigData!
http://mikethetechie.com/post/6822576191/the-elephant-in-the-room-hadoop-and-bigdata
Hadoop, Hive and Redis for Foursquare Analytics :: myNoSQL
http://nosql.mypopescu.com/post/3872483038/hadoop-hive-and-redis-for-foursquare-analytics
The Hadoop Distributed File System
http://storageconference.org/2010/Papers/MSST/Shvachko.pdf
IBM Jeopardy: Building Watson: An Overview of the DeepQA Project
https://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf
"To preprocess the corpus and create fast runtime indices we used Hadoop"
Jeopardy Goes to Hadoop :: myNoSQL
http://nosql.mypopescu.com/post/3406224331/jeopardy-goes-to-hadoop
ElephantDB, a Distributed Database for Working with Hadoop
http://www.readwriteweb.com/hack/2011/02/ravendb-a-distributed-database.php
InfoQ: Hadoop Redesign for Upgrades and Other Programming Paradigms
http://www.infoq.com/news/2011/02/hadoop_redesign
Riding the Elephant | The Molecular Ecologist
http://tomato.biol.trinity.edu/blog/2011/02/riding-the-elephant/
Yahoo focusing on Apache Hadoop, discontinuing “The Yahoo Distribution of Hadoop”
http://developer.yahoo.com/blogs/hadoop/posts/2011/01/announcement-yahoo-focusing-on-apache-hadoop-discontinuing-the-yahoo-distribution-of-hadoop/
Lessons learned putting Hadoop into production « Cloudera » Apache Hadoop for the Enterprise
http://www.cloudera.com/blog/2010/12/lessons-learned-putting-hadoop-into-production/
Dimensional Reduction – Apache Mahout – Apache Software Foundation
https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
Beyond Hadoop – Next-Generation Big Data Architectures – NYTimes.com
https://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html
Large Scale Natural Language Processing
http://us.pycon.org/media/2010/talkdata/PyCon2010/098/large-scale-nlp-pycon-2010.pdf
Hadoop and Realtime Cloud Computing | Cloud Computing Journal
http://cloudcomputing.sys-con.com/node/1572508
Hadoop and NoSQL Downfall Parody on Vimeo
http://vimeo.com/15782414
Hadoop: The Definitive Guide, Second Edition – O’Reilly Media
http://oreilly.com/catalog/9781449389734
Hadoop Ecosystem World-Map « Sanjay Sharma’s Weblog
http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
MapReduce, Hadoop: Young, But Worth A Look — Data Management — InformationWeek
http://www.informationweek.com/news/business_intelligence/warehouses/showArticle.jhtml?articleID=226600088
Distributed data processing with Hadoop – Part-3: App Build
http://www.gnarc.com/tutorials/distributed-data-processing-with-hadoop-part-3-app-build
HDFS: Facebook has the world’s largest Hadoop cluster!
http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html
High Availability MySQL: Hadoop and MySQL
http://mysqlha.blogspot.com/2007/10/hadoop-and-mysql.html
High Scalability – How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
Realtime Search for Hadoop – Scalable Log Data Management with Hadoop, Part 3 « mgm technology blog
http://blog.mgm-tp.com/2010/06/hadoop-log-management-part3/
Behind Caffeine May Be Software to Inspire Hadoop 2.0
http://gigaom.com/2010/06/11/behind-caffeine-may-be-software-to-inspire-hadoop-2-0
Hadoop in a box
http://www.slideshare.net/tim.lossen.de/hadoop-in-a-box
Scalability of the Hadoop Distributed File System (Hadoop and Distributed Computing at Yahoo!)
http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html
Introduction to Hadoop, HBase, and NoSQL
http://www.slideshare.net/xefyr/introduction-to-hadoop-hbase-and-nosql
InfoQ: Horizontal Scalability via Transient, Shardable, and Share-Nothing Resources
http://www.infoq.com/presentations/Horizontal-Scalability
Neuroph on Hadoop: Massive Parallel Neural Network System? | NetBeans Zone
http://netbeans.dzone.com/neuroph-hadoop-nb
Pushing the Limits of Distributed Processing « Cloudera » Apache Hadoop for the Enterprise
http://www.cloudera.com/blog/2010/04/pushing-the-limits-of-distributed-processing/
April Joke 😉
My Links
http://www.delicious.com/ajlopez/hadoop
http://www.delicious.com/ajlopez/hadoop+tutorial
http://www.delicious.com/ajlopez/hadoop+video
http://www.delicious.com/ajlopez/hadoop+nosql
http://www.delicious.com/ajlopez/hadoop+distributedcomputing
http://www.delicious.com/ajlopez/hadoop+scalability
http://www.delicious.com/ajlopez/hadoop+machinelearning
http://www.delicious.com/ajlopez/hadoop+artificialintelligence
More links are coming (distributed computing? NoSql?).
Keep tuned!
Angel “Java” Lopez
http://www.ajlopez.com
http://twitter.com/ajlopez
Hello Friends,
New batch of Hadoop training will be starting from 07-Nov’11 . It will be an online training of 35 to 40 hrs and the following contents will be covered:
1.HDFS
2.Map Reduce
3.Hive
4.HBASE
5.Working with Hadoop on ec2 (cloud).
80 % of the training is completely practical.
Our institute has its leadership in providing online training at the lowest price with high quality. The trainer has 6 yrs of experience on this technology has conducted multiple online trainings.
Please drop me an email for registration at sales@intellipaat.com or give us a call . Guys hurry up last few seats are available.
Regards,
Sales
Intellipaat Team
Mob: 91-9019368913
such a fraud training.. you will cry after taking course from intellipaat… I m experienced.. Intellipaat is a Big Fraud company.
Great post, Angel. Just a suggestion, as I don’t see it: http://www.dezyre.com/Big-Data-and-Hadoop/19
They’re an excellent company that I’ve dealt with and highly suggest that everyone at least checks them out. Great training and great price, along with a certification at the end!