MultiThreading in Applications .. featuring JDBC

Nearly all the developers who’ve dealt with database operations have at least once faced the the scenario where their code is essentially waiting for SQL queries to finish, which results in bit of a slowdown around these parts. Java Database Connectivity (JDBC) as it is called, can be one of the slowest parts of the application because it’s entirely dependent on the database I/O for its performance. So in this article I share my views to speed things up by using MultiThreading.

Now one of the most obvious ways to increase db performance is query optimization. But when working primarily with updates and in JDBC mode there is hardly anything left to optimize on the query end. We can however, leverage the multiple cores & threads offered by modern machines and work our way through the insert statements in parallel using multiple threads. Each thread will be given a separate Connection (JavaDoc – Pitfalls of sharing connection among threads) object to work with, and those connections will be supplied from a connection pool.

Let’s have a look at such one such practical example of application of this use case. We need to access data from the database and for each row, depending on its value, modify it and update the row back into the database. A single threaded application performs these operations (read one row at a time.. perform some operations on the data.. update it back) would look like this. Now on this single threaded example the application is literally doing nothing while the for sql queries are being executed in the background on the database side. This time can be better spent preparing & working on the next batch of updates. (which is exactly what we will see in a minute).


Horizontal vs Vertical Scalability.

Before I dive deep into the whole MultiThreading section, I just want to give a brief info on scalability and why it’s important especially from a design standpoint. We think about scaling because softwares evolve over time. As the data they need to process increases, the they become time bound and the simplest way to increase the speed is by scaling/upgrading the machine.

  • Vertical scaling
    We scale vertically by adding more computational power (CPU, RAM, Network) to an existing machine, often upgrading it in the process.
    The number of machines in the system does not change.
    Usually more expensive to buy a single powerful component than adding 2 smaller components, which will outperform the former when combined.
  • Horizontal scaling
    Here we scale horizontally by adding more machines into our pool of resources.
    The number of machines in the system increases but most of them are similar in computational power so as to avoid a heterogenous cluster.
    Usually more cost effective as infrastructure can be upgraded on-demand.

Planning for Scalability is very important as it will influence which MultiThreading model to chose from the below 2 and will dictate how the software will evolve.

The 2 Approaches to MultiThreading.

We can roughly classify Multi threaded programs in 2 types depending on their design.

Task parallelism is when the program is divided into multiple sections each doing a unique task, different than the other modules but usually on the same data. It resembles a real life Assembly line, where there are a lot of workers (Threads) working on an assembly line (Data pipeline). Each Worker after getting an input from the pipeline, does its job (Processing) and then puts the output back to the assembly line, which forms the input to the next worker.

Here, each thread is only responsible for doing one and only one job. Queue data structures are used as the pipelines through which the threads interact with the data. This design is usually used when there are parts of the code which perform significantly slower than the rest & also when the software is being written from the ground up as it involves carefully breaking the application’s logic into threads and then trying to work them in parallel.

Thus the Advantages of this approach is that in most cases the software is custom built keeping into consideration the complexity of the code (identifying slow parts of the code) & the computational power of the actual hardware machine onto which the app will be deployed. Experienced developers can thus squeeze every bit of performance out of existing hardware by controlling the thread count and tailoring the application and hence the performance offered by such an approach is top notch due to its custom nature. Also this approach can be used even when the data can’t be broken down/worked upon in parallel.

Disadvantages of such a design include fact that this model is like a double-edged sword & it offers top notch performance.. but only if executed properly. Due to the high level of data piping, this design needs a very high sync among the threads which act as workers and failure to achieve a proper coordination between then will increases the chances of running into thread sync issues like deadlock which threaten to bring down the entire assembly line to a halt, and in most cases, these issues are are notoriously hard to debug. Also since most of the code is custom written wrt. a particular machine, there is a Lack of Horizontal scalability as the software simply isn’t designed to utilize additional machines if allocated.

Data parallelism is when the total data needed to process is broken down and multiple threads of the application, each work on a subset of the data. The important distinction to note here is that, in this model the threads don’t share the data they process. There is no pipeline. Each thread works on its part of the data and performs all the required operations before committing the data to say a database.

Also, all the threads have the same logical function, which gets applied on the data which that thread is currently processing.
This approach has a lot of advantages.

  • Easy for porting existing applications.
    Unlike the complete redesign needed in Task parallelism, here one simply needs to shard/split the input data effectively across the worker threads. After that it’s a simple matter of wrapping the application method calls inside the thread’s run method, thus creating an encapsulated instance of the application’s logic (responsible for data processing). Now running multiple threads in parallel will cause the application logic to be parallelized. This can be done to easily multi-thread existing code.
  • Easy to re-trigger failures
    During data processing if any there occurs any Failures, will be localized to that specific worker thread and can be easily identified and re-run on that very subset of data which the thread was processing before it crashed using mechanisms like Thread.UncaughtExceptionHandler which can be used for rescheduling.
  • Failures won’t crash the entire system
    Since there is no data pipeline dependency where the output of a thread forms the input to another, errors which happen in worker threads will be localized and if handled appropriately, will not cause a cascading error which crashes the entire system. This is only possible because the threads work independently and there is no data sharing between them. (or very little/insignificant)

Google in 2004 took this approach to new level with its implementation of MapReduce which was later made public and gave rise to the Hadoop Framework as we now know it.

  • Heart of MapReduce
    Without going into too much detail about it, the whole idea of MapReduce is to “Map” a function to each and every shard of the data and then “Reduce” to produce the final output. Needless to say, this paradigm is heavily inspired from data parallelism and offers the advantages I mentioned along with new features like distributed computing and
  • Efficient Horizontal Scaling
    Hadoop programs are designed to take advantage of multiple machines in the cluster without needing any code rework. The framework automatically distributes the workload among newly added machines in the cluster. Compared to Task parallelism which favours Vertical scaling, Horizontal Scaling is much more cost effective due to its on-demand nature.

The better choice ?.. or lack thereof..

To be honest Hadoop is the de facto standard in terms of bulk data processing & warehousing, & for the right reasons as it offers a lot of benefits and allows people to focus more on the application logic itself & abstracts away the complexities of handling a distributed system in general. But in reality, not all projects need to leverage that power & most can simply use a significant performance boost by going the MultiThreaded route without needing to invest in the Hadoop ecosystem.

As I mentioned earlier, the choice of model is decided by which stage of software development we step into. During the inception we have complete control over the architecture/design and thus can go ahead with Task parallelism. If on the other hand, we are working with existing code, then Task parallel approach will need heavy code refactoring and due to that reason it’s better to go with the Data parallel approach.

Going by my personal experience in working with these 2 very distinct threading models, if given a choice I’ll mostly prefer the data parallel approach as it is the most effective way to gain a speed boost because due to the nature of the design, there is very little interoperability between the threads, and thus we are always skirting around the finicky parts of MultiThreading (sync issues, sharing of mutable data). This is harder to avoid in Task parallel approach.