(De)Compressing specific columns or rows from a dataframe (in Parallel)

(De)Compressing in Chunks

With lzhw you can choose what columns you are interested in compressing from a data frame. CompressedDF class has an argument selected_cols.

import lzhw
import pandas as pd
gc_original = pd.read_excel("examples/german_credit.xlsx")
comp_gc = lzhw.CompressedDF(gc_original, selected_cols = [0, 3, 4, 7])
# 100%|███████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 401.11it/s]

Also when you have a compressed file that you want to decompress, you don't have to decompress it all, you can choose only specific columns and/or rows to decompress. By this you can deal with large compressed files and do operations column by column quickly and avoid memory errors decompress_df_from_file function has the same argument selected_cols.

gc_original2 = lzhw.decompress_df_from_file("gc_compressed.txt", selected_cols = [0, 4],
                                            parallel = True)
# 100%|████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 3348.53it/s]

gc_original2.head()
#   Duration    Age
#0         6    67
#1        48    22
#2        12    49
#3        42    45
#4        24    53

Let's compare this subset with the original df.

gc_original.iloc[:, [0, 4]].head()
#   Duration    Age
#0         6    67
#1        48    22
#2        12    49
#3        42    45
#4        24    53

Perfect!

We can also select columns by names:

gc_subset = lzhw.decompress_df_from_file("gc_compressed.txt", 
                                         selected_cols=["Age", "Duration"])
# 100%|████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 6220.92it/s]

print(gc_subset.head())
# Duration  Age
#0   6  67
#1  48  22
#2  12  49
#3  42  45
#4  24  53

selected_cols has "all" as its default value.

decompress_df_from_file has another argument which is n_rows to specify the number of rows we would like to decompress only.

The argument's default value is 0 to decompress all data frame, if specified it will decompress from start until desired number of rows.

gc_original_subset = lzhw.decompress_df_from_file("gc_compressed.txt", n_rows = 6)
# 100%|████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 914.21it/s]

print(gc_original_subset.shape)
# (6, 62)

This can be very helpful when reading very big data in chunks of rows and columns to avoid MemoryError and to apply operations and online algorithms faster.

gc_original_subset_smaller = lzhw.decompress_df_from_file("gc_compressed.txt", 
                                                  selected_cols = [1, 4, 8, 9], 
                                                  n_rows = 6)
# 100%|████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 3267.86it/s]

print(gc_original_subset_smaller.shape)
# (6, 4)

print(gc_original_subset_smaller)
#   Amount Age ForeignWorker Class
# 0   1169  67             1  Good
# 1   5951  22             1   Bad
# 2   2096  49             1  Good
# 3   7882  45             1  Good
# 4   4870  53             1   Bad
# 5   9055  35             1  Good