Description
Spotify’s official technology blog
Summary
- In this post we’ll discuss how Spotify optimized and sped up elements from our largest Dataflow job, Wrapped 2019, for Wrapped 2020 using a technique called Sort Merge Bucket (SMB) join.
- Introduction Shuffle is the core building block for many big data transforms, such as a join, GroupByKey, or other reduce operations.
- Additionally, it must match up bucket files across partitions as well as across different sources, while ensuring that the CoGbkResult output correctly groups data from all partitions of a source into the same TupleTag key.
- As a nice side effect of data being bucketed and sorted by key, we observed ~50% reduction in storage from better compression due to collocation of similar records.