Tail Recursion in Scala

Since the time I started programming using Scala, I’ve vehemently tried to adopt the paradigms of functional programming as much as possible. Recently I had to work on a recursive piece of code and my architect suggested me to use something known as Tail Recursion instead of Traditional recursion and this blog talks all about…

Read More

Tuning Kafka for Consistency

Having worked with Kafka for almost a year and half , I wanted to share my thoughts on Kafka’s consistency guarantees and how to achieve them. I will be skipping over the details of basic configurations and suggestions like having multiple partitions & setting the replication factor to be more than one because those are…

Read More

Synchronizing Multi Threaded code based on Object Value.

On my most recent coding escapades, I had to design & implement a webserver capable of handling a staggering number of requests concurrently. Now, I could have gone with an abstracted high level multi-threading framework like Akka or Vert.x but the challenge was to only use the libraries provided by core Java. The entire source…

Read More

Choosing the correct Flink Restart Strategy & avoiding Production Gotchas

Having spent the last year and half designing & developing real time data pipelines, here are my 2 cents on the production gotchas I encountered related to Flink’s Restart strategy which I wish I knew sooner. Also if you are new to stream system development, checkout my previous article where i wrote about the change…

Read More

Data Enrichment – Designing & Optimizing a Real Time Stream Joining Pipeline

In my previous article, I wrote my thoughts on the Paradigm Shift I underwent adapting to the idiosyncrasies of real time systems when compared to batch processing. Picking up from where I left off, this article focuses on applying all those concepts to build a realtime data enrichment pipeline which performs join & lookup in…

Read More

Undergoing the Paradigm Shift – Batch Processing to RealTime Streaming

On this article, I share my experiences on undergoing a paradigm shift from Batch based to RealTime stream based processing over the course of building a realtime data enrichment pipeline which performs join & lookup in real time. When the volume of data increases to a point that a single DB cant handle it in…

Read More

Designing & Optimizing a Content Recommendation System using MapReduce

Last Month I had the opportunity to design a content recommendation system and I decided to go with BigData due to the scope of the problem. After successfully implementing it and running it on a training dataset consisting of around 2.8 million ratings given by 73,000 people spread across 1600 movies, I’m about to share…

Read More

MultiThreading in Applications .. featuring JDBC

Nearly all the developers who’ve dealt with database operations have at least once faced the the scenario where their code is essentially waiting for SQL queries to finish, which results in bit of a slowdown around these parts. Java Database Connectivity (JDBC) as it is called, can be one of the slowest parts of the application because…

Read More

Demystifying Yarn – Parallelism in Hadoop2

With the introduction of YARN in Hadoop2, the resource scheduling engine got a massive overhaul. Gone are the days of the older slot based system for MapReduce framework, which was removed and replaced with a container based system. Intended to be more flexible than its predecessor, this new system is far more dynamic and allows…

Read More

Small files.. Big problem !! – Using CombineTextInputFormat to optimize MapReduce & handle large number of small input files.

The problem I am going to address in this article is not just limited to Hadoop Distributed File System (The storage system of BigData) , its prevalent across other platforms as well. When you have some data split up in a lot of smaller files, it puts unnecessary strain on the OS responsible for managing…

Read More

Hadoop on Windows : Install BigData environment and run WordCounter on your Pc.

Being a Big-data developer, one of the biggest challenges I faced initially is installing and configuring a development system for trying out the Hadoop framework and for running my Hadoop code. The whole BigData architecture by nature is distributed across multiple commodity hardwares, with multiple services running and coordinating the cluster. If all this sounds…

Read More

Abstracting Exception propagation – Avoid throwing Model Layer SqlExceptions directly to Service layer.

Around 2 years ago when I started my career as a Java developer, one of the first things I learnt was the MVC design approach for developing a Web application. Without going into too much detail about the design paradigm, the basic idea here is to split an Application into 3 parts which is in line with…

Read More

Making Data Clustering 768x times faster !!

From 64 hours to 300 seconds !!!That’s the sort of runtime improvement I was able to achieve by using Arrays as the primary data structure & reducing the time complexity from n3mC1  to n*m in a clustering algorithm I improved. Now this post is a bit lengthy & do I recommend going through the entire post to understand…

Read More