Handling a large tabular dataset using Vaex

When you have to handle a large tabular dataset (for example, 100 GB dataset with one billion rows), which cannot fit into the RAM on your computer, you can read the dataset by chunks using Pandas. You can filter a chunk of the data based on a condition and concatenate the filtered data later for your downstream analysis.

Alternatively, you can handle such a large dataset using Vaex, which is a Python library for lazy Out-of-Core DataFrames (similar to Pandas), to explore big tabular datasets.

import vaex

print(vaex.__version__)
{'vaex': '4.1.0', 'vaex-core': '4.1.0', 'vaex-viz': '0.5.0', 'vaex-hdf5': '0.7.0', 'vaex-server': '0.4.0', 'vaex-astro': '0.8.0', 'vaex-jupyter': '0.6.0', 'vaex-ml': '0.11.1'}

Using Vaex, you can read a file in chunks, and convert each chunk to a temporary HDF5 file on disk. All the temporary files are then converted into a single HDF5 file and deleted.

df_vaex = vaex.from_csv('dataset.csv', convert=True)

This conversion process takes time (for example, several hours depending on the size of the dataset and your computer). If you only need some of the columns, you can specify them for reducing the loading time and file size.

columns = ['patient_id', 'sex', 'age', 'diagnosis_date', 'icd_code', 'disease_name']
df = vaex.from_csv('dataset.csv', convert=True, usecols=columns)

Once you generated the HDF5 file, you can instantly start to work on the dataset the next time.

df = vaex.open('dataset.csv.hdf5')

Now you are ready to explore the dataset.

# Show the number of rows
print(len(df))

# Filter by age
df[df.age >= 65]

# Calculate the average age
df.age.mean()

# Calculate the the number of each sex
df.sex.value_counts()

# Calculate the average age in each sex
df.groupby(by='sex').agg({'age': 'mean'})

You can filter the dataset by a condition and convert to Pandas DataFrame for further analysis. In this example, assume that you have a list of ICD-10 codes of interest, and you want to filter the dataset by the ICD-10 codes.

# Create an empty Pandas DataFrame
df_out = pd.DataFrame(columns=['patient_id', 'sex', 'age', 'diagnosis_date', 'icd_code', 'disease_name'])

# You can specify the data types
df_out = pd.DataFrame({
    'patient_id': pd.Series([], dtype='int64'), 
    'sex': pd.Series([], dtype='int64'), 
    'age': pd.Series([], dtype='int64'), 
    'diagnosis_date': pd.Series([], dtype='object'), 
    'icd_code': pd.Series([], dtype='object'), 
    'disease_name': pd.Series([], dtype='object')  
})

# Convert the empty Pandas DataFrame to Vaex DataFrame
df_out = vaex.from_pandas(df_out)

# Filter the rows containing the ICD-10 codes of interest
# Here icd_codes is a list containing ICD-10 codes
for code in icd_codes:
    df_out = df_out.concat(df[df.icd_code.str.startswith(pat=code)])

# Convert the filtered Vaex DataFrame to Pandas DataFrame
df_pandas = df_out.to_pandas()

# Save the filtered Vaex DataFrame as the HDF5 and CSV format
df_out.export_hdf5('dataset_filtered.hdf5', progress=True)
df_out.export_csv('dataset_filtered.csv', progress=True)

Comments

Copied title and URL