Learning PySpark

Building and deploying data-intensive applications at scale using Python and Apache Spark

3.95 (184 reviews)
Udemy
platform
English
language
Databases
category
Learning PySpark
620
students
2.5 hours
content
Apr 2018
last update
$44.99
regular price

What you will learn

Learn about Apache Spark and the Spark 2.0 architecture.

Understand schemas for RDD, lazy executions, and transformations.

Explore the sorting and saving elements of RDD.

Build and interact with Spark DataFrames using Spark SQL

Create and explore various APIs to work with Spark DataFrames.

Learn how to change the schema of a DataFrame programmatically.

Explore how to aggregate, transform, and sort data with DataFrames.

Why take this course?

Apache Spark is an open-source distributed engine for querying and processing data. In this tutorial, we provide a brief overview of Spark and its stack. This tutorial presents effective, time-saving techniques on how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark.

You'll learn about different techniques for collecting data, and distinguish between (and understand) techniques for processing data. Next, we provide an in-depth review of RDDs and contrast them with DataFrames. We provide examples of how to read data from files and from HDFS and how to specify schemas using reflection or programmatically (in the case of DataFrames). The concept of lazy execution is described and we outline various transformations and actions specific to RDDs and DataFrames.

Finally, we show you how to use SQL to interact with DataFrames. By the end of this tutorial, you will have learned how to process data using Spark DataFrames and mastered data collection techniques by distributed data processing.

About the Author

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 12 years' international experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting.

Tomasz started his career in 2003 with LOT Polish Airlines in Warsaw, Poland while finishing his Master's degree in strategy management. In 2007, he moved to Sydney to pursue a doctoral degree in operations research at the University of New South Wales, School of Aviation; his research crossed boundaries between discrete choice modeling and airline operations research. During his time in Sydney, he worked as a Data Analyst for Beyond Analysis Australia and as a Senior Data Analyst/Data Scientist for Vodafone Hutchison Australia among others. He has also published scientific papers, attended international conferences, and served as a reviewer for scientific journals.

In 2015 he relocated to Seattle to begin his work for Microsoft. While there, he has worked on numerous projects involving solving problems in high-dimensional feature space.

Content

A Brief Primer on PySpark

The Course Overview
Brief Introduction to Spark
Apache Spark Stack
Spark Execution Process
Newest Capabilities of PySpark 2.0+
Cloning GitHub Repository

Resilient Distributed Datasets

Brief Introduction to RDDs
Creating RDDs
Schema of an RDD
Understanding Lazy Execution
Introducing Transformations – .map(…)
Introducing Transformations – .filter(…)
Introducing Transformations – .flatMap(…)
Introducing Transformations – .distinct(…)
Introducing Transformations – .sample(…)
Introducing Transformations – .join(…)
Introducing Transformations – .repartition(…)

Resilient Distributed Datasets and Actions

Introducing Actions – .take(…)
Introducing Actions – .collect(…)
Introducing Actions – .reduce(…) and .reduceByKey(…)
Introducing Actions – .count()
Introducing Actions – .foreach(…)
Introducing Actions – .aggregate(…) and .aggregateByKey(…)
Introducing Actions – .coalesce(…)
Introducing Actions – .combineByKey(…)
Introducing Actions – .histogram(…)
Introducing Actions – .sortBy(…)
Introducing Actions – Saving Data
Introducing Actions – Descriptive Statistics

DataFrames and Transformations

Introduction
Creating DataFrames
Specifying Schema of a DataFrame
Interacting with DataFrames
The .agg(…) Transformation
The .sql(…) Transformation
Creating Temporary Tables
Joining Two DataFrames
Performing Statistical Transformations
The .distinct(…) Transformation

Data Processing with Spark DataFrames

Schema Changes
Filtering Data
Aggregating Data
Selecting Data
Transforming Data
Presenting Data
Sorting DataFrames
Saving DataFrames
Pitfalls of UDFs
Repartitioning Data

Screenshots

Learning PySpark - Screenshot_01Learning PySpark - Screenshot_02Learning PySpark - Screenshot_03Learning PySpark - Screenshot_04

Reviews

Maria
October 23, 2021
It's a good course, very quicky and resumed. I hope we could have learnt how to set a Spark context/session and the difference between then. The course starts using a Spark context (sc.) without having talked about it.
John
August 10, 2021
half way update: For illustrating concepts this is ok. I'm learning enough that I'll likely finish the course. Also this section is better at brief introductions. Not including the sampel data in the git repo was a mistake; at this point the available data from whatever website () has a different number of records and doesn't match the video examples (not a deal breaker, just irritating). I'm going to have to find a separate class/book that talks about the architecture of Spark, how pyspark works with py4j, and how to do logging from the driver & workers. I'd specifically like a walkthrough of how to troubleshoot stack traces, I find it difficult to isolate what the original python code was when a transform fails. original: burning the first 30 seconds of this section's videos repeatedly explaining "this video is an overview, and you can skip it if you want" was unnecessary. Thirty seconds doesn't sound like much, so think of as 25% of the video's content. As for the intro message, that would better be covered in the title/video description.
Luis
March 16, 2021
50% of the video duration is only agenda description. "in this video we will do ABC...." "in the next video we wil do DEF..." "In this video we will do DEF..."
Admiral
January 19, 2021
Avendo una minima esperienza su pyspark i video mi hanno aiutato ad ampliare le mie conoscenze di base, chiarendo anche alcuni dubbi che avevo!
Gentian
January 15, 2021
The git repo does not even closely resemble the documents presented in the course. Also too much time spent on RDD when most of modern spark is dataframes. Basically for me was mostly waste of time.
Nambi
November 27, 2020
Really worthful course. I would recommend this site for the best online course. Truly value for your money and time
Pragya
September 20, 2020
Overall the content is not high quality. The resources given were not correct, thanks to that one comment I could find the actual codes.
Harshita
July 29, 2020
DO NOT join this course. It has no quality content. I received zero understanding in initial sections. I am requesting for refund for the same.
Sudheer
July 22, 2020
The course was good. But I felt it was very high level and each video lasting less than 3 minutes, I felt its all in bits and pieces.
Pranjal
July 19, 2020
Brief and concise course which could have added some self-practise exercises (with answers) for practise
Sam
June 28, 2020
The content is very elementary... a more descriptive title is "A fast tour of pyspark basics". Also, the github repo is not exactly matching the course material.
Gertjan
September 18, 2019
This introduction to Spark is to much high-level. The examples and explanation of transformations and action methods are great. Also the provided notebook will help you. This notebook is different from the one explained in the video, so that is great because you must do it with different data.
Paloma
July 5, 2019
Yes, all videos are quite interesting. In my opinion, some of them seems quite short and could be more minutes.
Abhishek
May 12, 2019
Better for neophytes to understand precise information. Although there could be many more use cases we get practically. But for a start, I recommend it to everyone.
Dipin
January 31, 2019
The content was too basic and the course lacked assignments and quiz to test your knowledge. The course does not discuss about streaming data, machine learning api. It's good for someone, who is a just a beginner for pyspark

Charts

Price

Learning PySpark - Price chart

Rating

Learning PySpark - Ratings chart

Enrollment distribution

Learning PySpark - Distribution chart

Related Topics

1594214
udemy ID
3/13/2018
course created date
6/22/2020
course indexed date
Bot
course submited by