How to process a DataFrame with billions of rows in seconds

By Medium - 2020-12-13

Description

Big Data Analysis in Python is having its renaissance. It all started with NumPy, which is also one of the building blocks behind the tool I am presenting in this article. In 2006, Big Data was a…

Summary

  • Yet another Python library for Data Analysis that You Should Know About — and no, I am not talking about Spark or Dask.
  • Each month I find a new tool, which I am eager to learn.
  • dv = vaex.from_csv(file_path, convert=True, chunk_size=5_000_000) This function automatically created an HDF5 file and persist it to disk.
  • dv.plot1d(dv.col2, figsize=(14, 7)) Virtual columns Vaex creates a virtual column when adding a new column, — a column that doesn’t take the main memory as it is computed on the fly.

 

Topics

  1. Backend (0.34)
  2. Coding (0.18)
  3. Database (0.16)

Similar Articles

Reducing memory usage in pandas with smaller datatypes

By Medium - 2021-03-15

Managing large datasets with pandas is a pretty common issue. As a result, a lot of libraries and tools have been developed to ease that pain. Take, for instance, the pydatatable library mentioned…

15 Essential Steps To Build Reliable Data Pipelines

By Medium - 2020-12-01

If I learned anything from working as a data engineer, it is that practically any data pipeline fails at some point. Broken connection, broken dependencies, data arriving too late, or some external…