How to process a DataFrame with billions of rows in seconds

By Medium - 2020-12-13

Description

Big Data Analysis in Python is having its renaissance. It all started with NumPy, which is also one of the building blocks behind the tool I am presenting in this article. In 2006, Big Data was a…

Summary

Yet another Python library for Data Analysis that You Should Know About — and no, I am not talking about Spark or Dask.
Each month I find a new tool, which I am eager to learn.
dv = vaex.from_csv(file_path, convert=True, chunk_size=5_000_000) This function automatically created an HDF5 file and persist it to disk.
dv.plot1d(dv.col2, figsize=(14, 7)) Virtual columns Vaex creates a virtual column when adding a new column, — a column that doesn’t take the main memory as it is computed on the fly.

Topics

Backend (0.34)
Coding (0.18)
Database (0.16)

How to process a DataFrame with billions of rows in seconds

Description

Summary

Topics

Similar Articles

Big data architecture style - Azure Application Architecture Guide

Reducing memory usage in pandas with smaller datatypes

4 Limitations of Google Data Studio That Advanced Users Should Watch Out For

Drowning in Data? How To Ensure Your Data Strategy Isn't Hurting Your Brand?

15 Essential Steps To Build Reliable Data Pipelines

Learning Data Science From the Perspective of a Proficient Developer

Feedback

Bookmarks

Similar Readings