Hello and welcome to the first blog of GSoC phase-2. Ever faced a time when there was way too much on the plate and you find it really hard to catch up on all the work? That is pretty much how the previous two weeks were for me. With the start of the oncampus internship drive, I was finding it really hard to give manage the project. Somehow I was able to make some progress but I am yet to complete the task.
I am basically trying to parse the .bz2
files of HITEMP
databases into HDF5 files in a Vaex friendly format. Currently .bz2
files are parsed into HDF5 files with the help of high level pandas functions. But as we already know pandas can be very memory consuming. So, I am trying to write to HDF5 files with h5py
library and produce HDF5 files that are vaex friendly (column based).
In order to do this, I am first converting bz2
files to .csv
upon download -> mapping the datatypes of each of the columns -> writing to a HDF5 file with h5py
. I am currently stuck at mapping the datatypes and also trying to make optimizations with respect to the chunksize.
This blog is just a quick update on the things that are happening currently. Once the writing to a HDF5 files is completed, look out for a detailed tutorial on the same on towardsdatascience :)