Last weekend, a friend of mine had a peculiar request - to decypher sha-256 hashes back to the original strings.
We had little to go on, except that the original strings are IP addresses. As most people in IT know,
the goal of hashing is to encode strings unilaterally, with a kink:
you can generate all possible combinations and lookup the hash.
Long story short, there are two ways to reverse hashing:
rainbow tables (using reduction functions as per Philippe Oechsin's 2003 paper, ~99% probability usually)
precomputed full combination of all possibilities (100% probability, basically reverse lookup)
Rainbow tables are a probabilistic approach; we need 100% certainty. I'll deep dive into performance there in a separate post.
We will focus on the "brute force" approach of generating SHA-256 hashes for the whole (finite) set of strings.
In general, our chosen approach falls somewhere between impractical, slow and quite expensive:
if we are to generate all lower alphabetical strings (namespace [a-z]), we'll be looking at 2^28 combinations, some 308M combinations.
We have an important advantage, however: our namespace, length and string structure are all finite and predefined.
The total universe is 2^32 ≈ 4.3B. According to specification, we'll remove the following networks:
0.x.x.x /8
255.x.x.x /8
224.x.x.x /4
The approach I'm taking is as follows:
Generate sha-256 hashes and export them to a widely available, text-based format (e.g. csv)
Ingest the generated data in a database or key-value store for fast lookup. Suggested technology is Azure Table .
Initial version
Let's warm up with this:
from hashlib import sha256
import pandas as pd
for a in range(1, 256, 1):
for b in range(0, 256, 1):
ip_list = pd.DataFrame()
for c in range(0, 256, 1):
for d in range(0, 256, 1):
ip = '.'.join([str(a), str(b), str(c), str(d)])
hash = sha256(ip.encode()).hexdigest()
dict = {"RowKey": hash, "PartitionKey": hash[0:4], "ip": ip}
ip_list = ip_list.append(dict, ignore_index=True)
ip_list.to_csv('.'.join([str(a), str(b), 'csv']), index=False)
How anticmatic - to generate a /16 network, the average time was 368 seconds (sample size 40).
We'll need to optimize, as we have a total of 60928 lists to generate, taking ~212 days at our current rate!
Nested generators
Here's an improved version of the same piece, this time making use of (nested) generators:
for a in range(1, 256, 1):
for b in range(0, 256, 1):
start = timer()
ip_list = pd.DataFrame(
columns =["RowKey","PartitionKey", "ip"],
data=(
{
"RowKey": sha256('.'.join([str(a), str(b), str(c), str(d)]).encode()).hexdigest(),
"PartitionKey": sha256('.'.join([str(a), str(b), str(c), str(d)]).encode()).hexdigest()[0:4],
"ip": '.'.join([str(a), str(b), str(c), str(d)])
} for d in range(0, 256, 1) for c in range(0, 256, 1))
)
ip_list.to_csv('.'.join([str(a), str(b), 'csv']), index=False)
end = timer()
print('Time to generate', '.'.join([str(a), str(b), 'X', 'X']), ":'", str(round(end-start, 2)), "seconds")
Now, that's more like it - 0.5s for a /16, hopefully we're on the right track here.
Using a triple-nested generator, the time to calculate a /8 sha-256 table is, on average 4 minutes (sample size of 40), better than the result we prevously got for a /16.
That's a 350x improvement, and with larger datasets, it' precisely what we're gunning for.
But it's not enough.
Also, we've hit another issue - writing to disk. A set of 2^24 addresses, 16M rows, takes 1.37G on disk - and roughly 25-35% of the 4 minutes.
We'll take that as a "natural" limitation which we'll ignore for two reasons:
The csv's are not the final destination of our data;
Writing to disk is a topic of its own, and we'll assume limitations there are a given.
Threads and Multiprocessing in Python
Chances are you're looking for the good thread vs. multiprocessing stuff already. So what are the differences?
Well, the main difference is memory allocation: threads are "lighter" than processes and share the same memory space.
Ideally, you should chose the number of threads to correspond to the number of virtual cores your computer has,
programmatically available through:
import multiprocessing as mp
print(mp.cpu_count())
Optimizing this parameter is actually dependent upon a multitude of factors, so I suggest profiling and documenting multiple combinations
by running something like a Monte Carlo simulation to identify the best parameters for your machine.
Please note: we're considering options for this parameter only in the context of threads. If we aim at multiple processes, we should
limit the maximum number of processes by the number of cores we've shown above. You can't run processes in parallel on the same core.
Another key observation: multithreaded code is executing one at a time throughout this process due to the GIL.
Therefore, multithreaded code is concurrent but not parallel.
Still, one might see improvements in cases like IO-bound tasks like asyynchronously downloading files, where time is spent IDLE-ing for the netowrk response.
Using the threading module in Python or any other interpreted language with a GIL can actually result in reduced performance...
Our code is performing a CPU bound task, therefore using the threading module will result in a slower execution time.
For CPU bound tasks and truly parallel execution, we can use the multiprocessing module.
Basic Multiprocessing
Here's our initial version, that tries to run the process on multiple cores:
from hashlib import sha256
import pandas as pd
from timeit import default_timer as timer
import multiprocessing as mp
def process(a):
start = timer()
ip_list = pd.DataFrame(
columns =["RowKey","PartitionKey", "ip"],
data=(
{
"RowKey": sha256('.'.join([str(a), str(b), str(c), str(d)]).encode()).hexdigest(),
"PartitionKey": sha256('.'.join([str(a), str(b), str(c), str(d)]).encode()).hexdigest()[0:4],
"ip": '.'.join([str(a), str(b), str(c), str(d)])
} for d in range(0, 256, 1) for c in range(0, 256, 1) for b in range(0, 256, 1))
)
end = timer()
ip_list.to_csv('.'.join([str(a), 'csv']), index=False)
end2 = timer()
print('Time to generate', '.'.join([str(a), 'X', 'X', 'X']), ":'", str(round(end-start, 2)), "seconds")
print('Time to export', '.'.join([str(a), 'X', 'X', 'X']), ":'", str(round(end2-start, 2)), "seconds")
cores = mp.cpu_count()
with mp.Pool(int(cores/2)) as pool:
iterators = range(1, 256, 1)
pool.map(process, iterators)
You might notice that I'm runnning the code above on 1/2 of the available cores. The reason is that I want to use my computer for other activities
as well. The results:
with 6 cores, I'm able to generate 6 /8s for 10 minutes: roughly 2 minutes per /8
with 16 cores, generation time is 30 minutes, again a little less than 2 minutes per /8
In both cases, writing to disk takes 60-70s, and that's a limitation I'll have to live with.
In the end, we managed to cut our work from ~212 days down to 8 hours, a 650x improvement.
Further reading
If you are looking to share memory between cores or processes, a package called SharedArray can help.
You might want to share the work between multiple servers via a common queue. Consider using RQ.