Spark it up a notch. Nitty-gritty details of Apache Spark

By Medium - 2021-03-13

Description

I’ve spent about a year learning and implementing the different subtleties associated with Spark. In this series, starting with this article, I’m going to attempt to document the different scenarios…

Summary

Spark it up a notch Nitty-gritty details of Apache Spark I’ve spent about a year learning and implementing the different subtleties associated with Spark.
Transformations are lazy by nature — Spark keeps track of what transformation is called on which record(using the DAG) and will execute them only when an action is called on the data(for ex, printing the top 5 lines of the dataset).
Note here that transformations return new RDDs since RDDs are immutable.
Wide Transformation All the elements required to compute the records in a single partition may reside in a many partitions of the parent RDD.

Topics

Backend (0.26)
Database (0.14)
Machine_Learning (0.07)

Similar Articles

How to get started with the new Graph Data Science Library of Neo4j

By Medium - 2020-11-02

Big changes to the way graph data science is managed in Neo4j present big opportunity

7 Ways Your Data Is Telling You It’s a Graph

By Neo4j Graph Database Platform - 2015-12-23

Watch (or read) Senior Project Manager Karen Lopez’s GraphConnect presentation on the signs that your data is actually a graph and needs a graph database.

Data Science Learning Roadmap for 2021

By freeCodeCamp.org - 2021-01-12

Although nothing really changes but the date, a new year fills everyone with the hope of starting things afresh. If you add in a bit of planning, some well-envisioned goals, and a learning roadmap, yo ...

Choosing cloud-native Bigtable to save data warehouse costs

By Google Cloud Blog - 2021-01-22

See how ecommerce company Richardo.ch chose Cloud Bigtable as its database to complement its data warehouse and save costs with scalability.

Google Cloud DLP can modify data to protect it

By Google Cloud Blog - 2021-03-12

Among the best ways to prevent data loss are to modify, delete, or never collect the data in the first place.

15 Essential Steps To Build Reliable Data Pipelines

By Medium - 2020-12-01

If I learned anything from working as a data engineer, it is that practically any data pipeline fails at some point. Broken connection, broken dependencies, data arriving too late, or some external…

Feedback

Let us know how do you think about this newsletter or want to add new topics or keywords

contact@velasticity.com

Bookmarks

Latest Readings in NLP

By Medium - 2021-03-16

Stunning Tables using bokeh and svg

By KDnuggets - 2021-03-16

Natural Language Processing Pipelines, Explained

By KDnuggets - 2021-03-16

2019 Best Masters in Data Science and Analytics – Online

By Medium - 2021-03-16

Collecting, transforming and cleaning JSTOR metadata in Python

By Medium - 2021-03-13

Storage & Compute for Machine Learning

By congress - 2021-03-15

Text - H.R.1019 - 117th Congress (2021-2022): E-BIKE Act

By KDnuggets - 2021-03-16

Top 10 Best Podcasts on AI, Analytics, Data Science, Machine Learning

By datasciencecentral - 2021-03-16

How to become a Digital Strategy Leader

By datasciencecentral - 2021-03-16

All about Use Of Data Science

By Medium - 2021-02-15

10 Hyper-parameter Tuning Libraries

By GitHub - 2021-03-16

doc.noun_chunks is not supported for Chinese language, how to figure this out? · Issue #7436 · explosion/spaCy

By Medium - 2021-03-15

Gaussian Process Regression From First Principles

By datasciencecentral - 2021-03-16

5 tasks You Can Automate in Business Intelligence (BI) and Analytics

By GitHub - 2021-03-15

Avoiding accidental errors with sanity checks · Discussion #5053 · allenai/allennlp

By KDnuggets - 2021-03-14

Emotion and Sentiment Analysis: A Practitioner’s Guide to NLP

By Medium - 2021-03-16

How Data Science Can Give Further Understanding on Urban Poverty

By Electronic Frontier Foundation - 2021-03-03

Google’s FLoC Is a Terrible Idea

By datasciencecentral - 2021-03-16

7 Key Benefits of Integrating Asset Monitoring in the Water Sector

By Medium - 2021-03-16

Why Machines Will Never Feel Empathy: A Q&A With MIT’s Sherry Turkle

By datasciencecentral - 2021-03-16

Clustering with Scikit with GIFs

By Coursera - 2021-03-14

Numerical Methods for Engineers

By Medium - 2021-03-16

Introduction to Bootstrapping in Data Science — part

By SearchEnterpriseAI - 2021-03-15

The power and limitations of enterprise AI

By GitHub - 2021-03-14

aajanki/spacy-fi

By Selbstmanagement - 2021-03-14

By KDnuggets - 2021-03-14

Feature Store as a Foundation for Machine Learning

By Medium - 2021-03-11

Lowri Williams on How to Connect Your Academic Training to Real-World Challenges

By huggingface - 2021-03-15

elgeish/wav2vec2-large-xlsr-53-arabic · Hugging Face

By Medium - 2021-03-16

The All-time Best Guides to Data Science Writing

By datasciencecentral - 2021-03-16

Can Gerrymandering Be Ended via Machine Learning?

By KDnuggets - 2021-03-14

Naïve Bayes Algorithm: Everything you need to know

By Medium - 2021-03-09

Weekly Awesome Tricks And Best Practices From Kaggle | Towards Dev

By huggingface - 2021-03-16

Hugging Face – On a mission to solve NLP, one commit at a time.

By datasciencecentral - 2021-03-16

Google is Rethinking its Business – What About You?

By datasciencecentral - 2021-03-16

Media and Entertainment: How This Industry is Impacted by Big Data

By KDnuggets - 2021-03-14

Introduction to Data Engineering

By KDnuggets - 2021-03-16

Metric Matters, Part 1: Evaluating Classification Models

By datasciencecentral - 2021-03-16

Markov Chain Monte Carlo Methods for Bayesian Data Analysis in Astronomy

By datasciencecentral - 2021-03-15

Data Analytics Perks