How Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020

By Spotify Engineering - 2021-02-11

Description

Spotify’s official technology blog

Summary

In this post we’ll discuss how Spotify optimized and sped up elements from our largest Dataflow job, Wrapped 2019, for Wrapped 2020 using a technique called Sort Merge Bucket (SMB) join.
Introduction Shuffle is the core building block for many big data transforms, such as a join, GroupByKey, or other reduce operations.
Additionally, it must match up bucket files across partitions as well as across different sources, while ensuring that the CoGbkResult output correctly groups data from all partitions of a source into the same TupleTag key.
As a nice side effect of data being bucketed and sorted by key, we observed ~50% reduction in storage from better compression due to collocation of similar records.

Topics

Backend (0.4)
UX (0.24)
Database (0.19)

Similar Articles

4 Limitations of Google Data Studio That Advanced Users Should Watch Out For

By Medium - 2021-03-22

Google Data Studio is a tool I have been using more and more in the past few months. With the high usage, I have come to notice its advantages over other tools, its capabilities, but also its’…

Google Cloud DLP can modify data to protect it

By Google Cloud Blog - 2021-03-12

Among the best ways to prevent data loss are to modify, delete, or never collect the data in the first place.

Big data architecture style - Azure Application Architecture Guide

By Docs - 2021-01-24

Describes benefits, challenges, and best practices for Big Data architectures on Azure.

15 Essential Steps To Build Reliable Data Pipelines

By Medium - 2020-12-01

If I learned anything from working as a data engineer, it is that practically any data pipeline fails at some point. Broken connection, broken dependencies, data arriving too late, or some external…

Google Cloud BrandVoice: 5 Data Trends That Will Take Your Business Forward In 2021, From Google Cloud Leaders

By Forbes - 2020-12-22

2020 was a year like no other, but the silver lining of a changed world brought data quality, speed, and insights to the forefront for businesses. Here’s what’s coming next.

Microsoft Azure ADF - Dynamic Pipelines

By SQLServerCentral - 2021-03-09

Azure Data Factory (ADF) is a cloud based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and