Making Data Clustering 768x times faster !!

From 64 hours to 300 seconds !!!
That’s the sort of runtime improvement I was able to achieve by using Arrays as the primary data structure & reducing the time complexity from n3mC1  to n*in a clustering algorithm I improved. Now this post is a bit lengthy & do I recommend going through the entire post to understand the problem faced by me it’s solution. But if you are in a hurry, click here to directly jump to my algorithm / solution.

The Objective

So, before I get all technical about my algorithm, let me give a brief overview about the problem itself. My input data consisted of a huge collection of sets & each one of the those sets had a list of integers. Each integer value is used to distinctly identify one particular entity.

Example of Sample Input

For Eg, the actual meaning or value for the numbers can be

  1. Primary key of data in the table.
  2. IDs assigned to generated bi-grams.
  3. Line numbers of entries present in a text file.

So, the input can be thought of as a Collection<Set<Integer>>. This numeric representation of the data which is to be classified, is done because working with integers is fast & more memory efficient than working with Objects like Strings. My objective was to cluster the Sets based on the following criteria.

  1. Consider each Set in the collection.
  2. Compare it with every other set in this collection.
  3. If there exists any common elements (basically an intersection) between these 2 sets which are being compared, perform a distinct union of those 2 sets.
  4. Add that newly created parent set in the collection.
  5. Remove the 2 child sets which were combined.
  6. Keep repeating till new cluster formation stops. (No more possible intersections) I.e. there aren’t any 2 clusters in the collection which have common elements.

That’s it! Now compared to other clustering algorithms like K-means, this requirement is very simple.

The Problem

If we directly map it to code, the pseudo code will be resembling something like,

while (clusters keep forming)                               O(n)
    for (each set s1 : in collection)                       O(n)
        for (every other set s2 : in the collection)        O(n)
            for (every element e : of s1)                   O(m)
                if (element e present in s2)                O(1)
                    s1.addALL(s2)                           O(m) * O(1)
                    remove set s2
                    continue while loop                                            
    if (none of the sets share any common elements with other sets)
    break while loop;

Now, consider I have n sets in the collection, each consisting of m elements.

  • Comparing each element of set1 (with m1 elements) with set2 (with m2 elements) will have a complexity of O(m1) * O(1)
    O(1) is the cost of performing the isPresent (contains) operation on set2,  assuming that we use a hashed data structure like HashSet.
  • Set1 addAll set2 would again take time proportional to  O(m1) * O(1)
    O(1) is the cost of performing the put operation on set2, which is done O(m1) times as m1 elements of set1 is added into set2.
  • So the total cost for merging 2 sets if they have any common element is 2 * O(m1) * O(1).
  • This is just for comparing 2 sets. We have N sets !!
    So for comparing N sets with each other, we’ll need time proportional to O(n2) due to nested for-loop.
  • So the total cost required to check the formation of a single cluster is 2 * (n* m * O(1)).
  • Now, each new cluster formed will need to be compared with all the other clusters which are present in the collection as its possible that the newly formed cluster has common elements with a cluster which was previous compared but shared no common elements.

Let me give an example :
In the ideal scenario, all the consecutive sets have common elements and hence for such cases the while loop can be effectively eliminated.

But for most of the case (average case scenario), this is how the data was arranged,

Thus, for N cluster formations we’ll have to repeat the above process n times to not just cover the consecutive sets, but to cover all the possible combinations, which will further lead to a O(n) multiplied for each cluster formed. So now the total cost of the algorithm is n * (2 * (n* m * O(1))).

This can be approximated to  ~n3mC1 where C1 is the constant time needed for each constant operation like  the contains() and insert() operation of HashSet.

Now normally we don’t give significant importance to constant operations in the algorithm performance analysis. But my input dataset was huge & I had around 2 million sets (n was 2 million) and each set containing roughly 1 million integers (m was 1 million). So even constant time operations like the HashSet contains() and insert() were adding up to a LOT of time and hence I factored it into the approximation due to the sheer input data being fed to an algorithm whose time complexity ranges from O(n3) to almost O(n4) if m equals n.

The Solution

My solution revolved around the idea that I had to reduce the amount of loops / iterations going on over the input data which was slowing down the speed considerably. In other words, get the input clustered without making costly set union operations to the data structure each and every time a new cluster is formed. With this in mind, I started to think along the lines of Lazy evaluation and very loosely based on QuickUnion Algorithm & devised my algorithm which goes like,

  1. Iterate each and every the set one by one, considering all its elements at a time.
  2. Generate a unique pointer Pn (Here’s where Lazy Evaluation kicks in)  for each set.
    Where n is the current set being considered.
  3. Assign Pn to all elements of the current iteration (to signify that all elements belong to same set).
  4. If any element of the set already has a pointer Pm  assigned to it, change the pointer value and make it point to the unique pointer for the current set.
    ie Pm –> Pn
  5. By doing step 4, this one single operation now has effectively performed a union of all the elements of 2 sets & since we only change value of one pointer, this saved us the costly set union operation.
  6. In the end, calculate the final value of the pointer for all the elements and stop.

Let me demonstrate the working using the average case scenario, where the 2nd last & last cluster is going to cause a LOT of merge operations. What happens here is that for each set, I assign it a auto-incremented clusterID as the unique pointer, which for my implementation is integers, but for sake of brevity, lets consider them as alphabets. The ClusterMapping is an Array implementation of the pointer data structure for Java.

  • For the first set, which is { 1 , 3 , 5 } we iterate through all the elements and since they don’t have any assigned value in MainArray, they will be assigned clusterID – ‘A’.
  • Similarly all the elements in the sets { 2 , 4 , 6 } were encountered for the first time and were assigned clusterIDs ‘B’.
  • Things get interesting when the 3rd input { 1 , 2 , 7 } is parsed.
  • The current clusterID is ‘C’
  • 1 already has clusterID ‘A’ .. So change the pointer of ‘A’ to ‘C’
  • This is done by changing the value of index A in ClusterMapping and making it point to C. (QuickMerge — changing one entry combines all elements of set 1 and 3)
  • Similarly for 2, which points to B, we change the mapping of ‘B’ and ‘C’.
  • Lastly we process { 7 , 8 , 9 }
  • The current clusterID is ‘D’
  • The entry 7 changes mapping of ‘C’ to ‘D’. 
  • So now all entries pointing to C will now point to D (A –> C –> D and B –> C –> D).

The above operation took time proportional to O(m*n) because we iterated all the n elements of m sets. By the end of this step, all the input is processed.
The last step is to do the final one time calculation of the lazy evaluation which was done earlier, which is to iterate through the MainArray and for each element, find its final pointer. This in the worst case, would also take O(m*n) time. In the above example, all the elements of 4 sets have their pointers point to the same clusterID –> ‘D’. This indicates that they belong to the same cluster. This marks the end of the clustering operation which took less than 5 mins (280 seconds to be precise).

So, now by analyzing the time complexity, we can see that this approach only takes 2 * O(m*n) * O(1).
Where O(1) is the constant time taken for array accesses. Lets call that C2. So the final cost of the algorithm can be approximated to  ~n*mC2

Also for the sake of comparison, since Time needed C1  (performing HashMap isPresent() and put) is vastly greater than time needed for C2 (array access cost) we can say that compared to the original approach’s complexity of  ~n3mC1 , my new approach only has a complexity of ~n*m.

 The Impact

The significance of the time reduction is made apparent by the running time analysis. This gap between the 2 versions widen even more as the input data increases.

  • For input of consisting of  n –> 1 million & m –> 1  million
    Time taken is ; Old –> 64 hours ; New –> 300 seconds
  • For input of consisting of  n –> 8 million & m –> 1  million
    Time taken is ; Old –> 8 days ; New –> 2 hours

With that outta the way, lemme know what you people think about my approach using the comments. Wrapping things with a quote which I feel is perfect for this post.

“Sometimes, Some of the Biggest Problems Have the Simplest Solutions”