Compressing DataFrames (in Parallel)

From DataFrame to CompressedDF

lzhw doesn't work only on lists, it also compress pandas dataframes and save it into compressed files to decompress them later.

import pandas as pd

df = pd.DataFrame({"a": [1, 1, 2, 2, 1, 3, 4, 4],
                   "b": ["A", "A", "B", "B", "A", "C", "D", "D"]})
comp_df = lzhw.CompressedDF(df)
# 100%|██████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2003.97it/s]

Let's check space saved by compression

comp_space = 0
for i in range(len(comp_df.compressed)):
    comp_space += comp_df.compressed[i].size()

print(comp_space, getsizeof(df))
# 296 712

## Test information loss
print(list(map(int, comp_df.compressed[0].decompress())) == list(df.a))
# True

Saving and Loading Compressed DataFrames

With lzhw we can save a data frame into a compressed file and then read it again using save_to_file method and decompress_df_from_file function.

Let's try to decompress in parallel

## Save to file
comp_df.save_to_file("comp_df.txt")

## Load the file
original = lzhw.decompress_df_from_file("comp_df.txt", parallel = True)
# 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2004.93it/s]

print(original)
#   a  b
#0  1  A
#1  1  A
#2  2  B
#3  2  B
#4  1  A
#5  3  C
#6  4  D
#7  4  D

Compressing Bigger DataFrames

Let's try to compress a real-world dataframe german_credit.xlsx file from UCI Machine Learning Repository [1].

Original txt file is 219 KB on desk.

gc_original = pd.read_excel("examples/german_credit.xlsx") 
comp_gc = lzhw.CompressedDF(gc_original, parallel = True, n_jobs = -3) # default value all CPUs but 2
# 100%|█████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 257.95it/s]

## Compare sizes in Python:
comp_space = 0
for i in range(len(comp_gc.compressed)):
    comp_space += comp_gc.compressed[i].size()

print(comp_space, getsizeof(gc_original))
# 4488 548852

print(list(map(int, comp_gc.compressed[0].decompress())) == list(gc_original.iloc[:, 0]))
# True

Huge space saving, 99%, with no information loss!

Let's now write the compressed dataframe into a file and compare the sizes of the files.

comp_gc.save_to_file("gc_compressed.txt")

Checking the size of the compressed file, it is 38 KB. Meaning that in total we saved around 82%. Future versions will be optimized to save more space.

Let's now check when we reload the file, will we lose any information or not.

## Load the file
gc_original2 = lzhw.decompress_df_from_file("gc_compressed.txt")
# 100%|█████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 259.46it/s]

print(list(map(int, gc_original2.iloc[:, 13])) == list(gc_original.iloc[:, 13]))
# True

print(gc_original.shape == gc_original2.shape)
# True

Perfect! There is no information loss at all.

We can also adjust the sliding window of the LZ77 algorithm

comp_gc512 = lzhw.CompressedDF(gc_original, sliding_window = 512)
# 100%|█████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 353.21it/s]

Sliding window can be very useful in case we want more compressed output.

Reference
    [1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.