## Approximating Linear Cost Computation for Polynomial Cost Computation Problems

Polynomial cost computation is simply not scalable while linear cost computation is usually scalable depending on the angle of its growth slope. Fixed cost computation is almost always guaranteed to be scalable regardless of the data size. For this reason linear cost computation is always preferable to polynomial cost computation. The fact that fixed cost computation is by far preferable to either of the other two is a no brainer.

Unfortunately computational complexity theory guarantees that most polynomial cost computation algorithms simply cannot be reduced to linear cost or fixed cost algorithms. If someone found a way to do so they would probably win a Nobel prize or become extremely rich.

**So how do we scale inherently polynomial cost algorithms ?**

Well the only way to do this is to sort of think sideways, working around the problem to reduce the effect of the factors that cause polynomial time computation. The end goal is to approach a reasonable linear computation cost for the average data size that the implementation is built for.

**Examples of polynomial cost algorithms:**

- Non Planar object Graphs and even worse graphs with subgraph isomorphism
- Demographic usage statistics computation on massively concurrent applications and advertising platforms.
- Almost any kind of cluster analysis in platforms such as data mining, semantic content analysis etc…

**An approach for approximating linear cost computation for polynomial cost algorithms:**

The computation cost with polynomial cost algorithms is usually E^{(n*x)} or Log(n*x) Where:

N represents the cost of the algorithm components that contribute to the overall computational complexity.

X is the volume of data to be processed.

It is a fact that there is no way to get rid of the polynomial cost function E^{(n*x)} or Log(n*x). However the values of both N and X can almost always be reduced. The final objective here is to reduce the computational cost such that it approximates a Linear Cost for the volume of data that the application is expected to handle. This is a two pronged approach. By reducing the compute costs of the expensive components of the algorithm we allow for a larger data size before the cost begins to rise too steeply. By reducing the size of the data that needs to be processed for a given total data volume we effective do the same as above.

**Background – What do Hash Tables have to offer:**

Wading through a large set of data is one of the primary contributing factors towards the computational cost when implementing polynomial cost algorithms.

Hash tables are almost a magic wand here. Hash tables by definition are a construct which reduces an inherently non scalable computation cost into an almost fixed cost one for very large data sizes.

The computational cost sorting a list of items increases exponentially as the size of the list grows regardless of the sorting algorithm. Similarly the computational cost of finding an item within a list of items is supposed to grow as the size of the list grows. The cost of finding an item within a list of 1000 is supposed to take 10 times as long as finding one in a list of 100. The hash table construct reduces the computation cost of finding an item within a large list to almost a fixed value.

Any decent hash table should provide a very small fixed computation cost for data sizes upto 100,000 items, and only twice or thrice the cost upto 1000,000 items.

Standard hash table implementations begin to break down as we approach the 1000,000 item range. This is due to other constraints such as memory allocation. For example on 32 bit platforms crossing the 2GB mark for a single object requires memory address extensions. Even on 64 bit platforms unless the machine has a ton of memory finding a contiguous block of over 1GB in main memory is not always possible.

For implementations that require over 1000,000 items consider using the google sparse hash implementation or a custom segmented hashtable (hash table that uses a list of hash tables internally) construct.

**An Approach Using Hash Tables:**

As we have seen hash tables give us what is in effect for all practical purposes a fixed computational cost for finding a piece of data within a list.

When we have a polynomial cost implementation, inspect the implementation in order to determine which components can be offloaded to hash table based alternative implementations. While this does not give us a linear cost computation for any arbitrary data size, it allows us to approximate a linear cost for the average data size that the implementation is intended to process.

Three distinct examples follow:

**For Object Graphs:**

For a non planar graph or even a planar graph consider holding in main memory all combinations of vertex – edge – vertex as hashtables of object pointers and/or object metadata. In addition consider holding each independent set in another set of hashtables. The processing algorithms can then be modified to use the in memory hash table representation to rapidly navigate the entire graph. navigation of a graph using this type of hash table representation is many orders of magnitude faster than navigating more natural definitions such as doubly linked lists.

**For Demographic Usage Statistics on Highly Concurrent Platforms:**

Here we can reduce an otherwise ridiculously expensive computational cost into a negligibly small, almost fixed cost computation. The slope of the linear cost using this approach is so low that for most purposes it can be considered fixed cost.

A series of hashtables can be used to create in memory data buckets based for all demographic attributes. This data can then be easily aggregated over a period of time and commited to the data store on a delayed cycle once every few minutes or hours (depending on how live the stored data should be).

**For Semantic Content Analysis:**

In these implementations many of the sub-components have a linear cost similar to X * N where X is the number of known data items and N is the number of data items in the content being analyzed. Here using hash tables we can reduce the cost to simply a N. For example consider the comparison of known contextual content patterns versus those from the content being analyzed.

Known contextual content patterns and their metadata can be stored in a series of interlinked hashtables. The content patterns within a piece of content to be analyzed can be stored in a separate series of interlinked hashtables. The final otherwise expensive process of exhaustive comparison between extracted content patterns versus known content patterns will no longer be an expensive computation. This is because partial equality comparison between two sets of hashtables is extremely rapid and efficient.