Tuning Kafka for Consistency

Having worked with Kafka for almost a year and half , I wanted to share my thoughts on Kafka’s consistency guarantees and how to achieve them. I will be skipping over the details of basic configurations and suggestions like having multiple partitions & setting the replication factor to be more than one because those are…

Read More

Choosing the correct Flink Restart Strategy & avoiding Production Gotchas

Having spent the last year and half designing & developing real time data pipelines, here are my 2 cents on the production gotchas I encountered related to Flink’s Restart strategy which I wish I knew sooner. Also if you are new to stream system development, checkout my previous article where i wrote about the change…

Read More

Data Enrichment – Designing & Optimizing a Real Time Stream Joining Pipeline

In my previous article, I wrote my thoughts on the Paradigm Shift I underwent adapting to the idiosyncrasies of real time systems when compared to batch processing. Picking up from where I left off, this article focuses on applying all those concepts to build a realtime data enrichment pipeline which performs join & lookup in…

Read More

Undergoing the Paradigm Shift – Batch Processing to RealTime Streaming

On this article, I share my experiences on undergoing a paradigm shift from Batch based to RealTime stream based processing over the course of building a realtime data enrichment pipeline which performs join & lookup in real time. When the volume of data increases to a point that a single DB cant handle it in…

Read More

Designing & Optimizing a Content Recommendation System using MapReduce

Last Month I had the opportunity to design a content recommendation system and I decided to go with BigData due to the scope of the problem. After successfully implementing it and running it on a training dataset consisting of around 2.8 million ratings given by 73,000 people spread across 1600 movies, I’m about to share…

Read More

Demystifying Yarn – Parallelism in Hadoop2

With the introduction of YARN in Hadoop2, the resource scheduling engine got a massive overhaul. Gone are the days of the older slot based system for MapReduce framework, which was removed and replaced with a container based system. Intended to be more flexible than its predecessor, this new system is far more dynamic and allows…

Read More

Small files.. Big problem !! – Using CombineTextInputFormat to optimize MapReduce & handle large number of small input files.

The problem I am going to address in this article is not just limited to Hadoop Distributed File System (The storage system of BigData) , its prevalent across other platforms as well. When you have some data split up in a lot of smaller files, it puts unnecessary strain on the OS responsible for managing…

Read More

Hadoop on Windows : Install BigData environment and run WordCounter on your Pc.

Being a Big-data developer, one of the biggest challenges I faced initially is installing and configuring a development system for trying out the Hadoop framework and for running my Hadoop code. The whole BigData architecture by nature is distributed across multiple commodity hardwares, with multiple services running and coordinating the cluster. If all this sounds…

Read More