How Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020

By Spotify Engineering - 2021-02-11

Description

Spotify’s official technology blog

Summary

  • In this post we’ll discuss how Spotify optimized and sped up elements from our largest Dataflow job, Wrapped 2019, for Wrapped 2020 using a technique called Sort Merge Bucket (SMB) join.
  • Introduction Shuffle is the core building block for many big data transforms, such as a join, GroupByKey, or other reduce operations.
  • Additionally, it must match up bucket files across partitions as well as across different sources, while ensuring that the CoGbkResult output correctly groups data from all partitions of a source into the same TupleTag key.
  • As a nice side effect of data being bucketed and sorted by key, we observed ~50% reduction in storage from better compression due to collocation of similar records.

 

Topics

  1. Backend (0.4)
  2. UX (0.24)
  3. Database (0.19)

Similar Articles

15 Essential Steps To Build Reliable Data Pipelines

By Medium - 2020-12-01

If I learned anything from working as a data engineer, it is that practically any data pipeline fails at some point. Broken connection, broken dependencies, data arriving too late, or some external…

Microsoft Azure ADF - Dynamic Pipelines

By SQLServerCentral - 2021-03-09

Azure Data Factory (ADF) is a cloud based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and