Designing & Optimizing a Content Recommendation System using MapReduce

Last Month I had the opportunity to design a content recommendation system and I decided to go with BigData due to the scope of the problem. After successfully implementing it and running it on a training dataset consisting of around 2.8 million ratings given by 73,000 people spread across 1600 movies, I’m about to share…

Read More

Demystifying Yarn – Parallelism in Hadoop2

With the introduction of YARN in Hadoop2, the resource scheduling engine got a massive overhaul. Gone are the days of the older slot based system for MapReduce framework, which was removed and replaced with a container based system. Intended to be more flexible than its predecessor, this new system is far more dynamic and allows…

Read More

Small files.. Big problem !! – Using CombineTextInputFormat to optimize MapReduce & handle large number of small input files.

The problem I am going to address in this article is not just limited to Hadoop Distributed File System (The storage system of BigData) , its prevalent across other platforms as well. When you have some data split up in a lot of smaller files, it puts unnecessary strain on the OS responsible for managing…

Read More