Taming Big Data with Apache Spark and Python - Hands On!

PySpark tutorial with 20+ hands-on examples of analyzing large data sets on your desktop or on Hadoop with Python!

4.46 (15738 reviews)
Udemy
platform
English
language
Data Science
category
98,128
students
7 hours
content
Mar 2024
last update
$109.99
regular price

What you will learn

Use DataFrames and Structured Streaming in Spark 3

Use the MLLib machine learning library to answer common data mining questions

Understand how Spark Streaming lets your process continuous streams of data in real time

Frame big data analysis problems as Spark problems

Use Amazon's Elastic MapReduce service to run your job on a cluster with Hadoop YARN

Install and run Apache Spark on a desktop computer or on a cluster

Use Spark's Resilient Distributed Datasets to process and analyze large data sets across many CPU's

Implement iterative algorithms such as breadth-first-search using Spark

Understand how Spark SQL lets you work with structured data

Tune and troubleshoot large jobs running on a cluster

Share information between nodes on a Spark cluster using broadcast variables and accumulators

Understand how the GraphX library helps with network analysis problems

Description

New! Updated for Spark 3, more hands-on exercises, and a stronger focus on DataFrames and Structured Streaming.

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark and specifically PySpark. Employers including Amazon, EBay, NASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think.

Learn and master the art of framing data analysis problems as Spark problems through over 20 hands-on examples, and then scale them up to run on cloud computing services in this course. You'll be learning from an ex-engineer and senior manager from Amazon and IMDb.


  • Learn the concepts of Spark's DataFrames and Resilient Distributed Datastores

  • Develop and run Spark jobs quickly using Python and pyspark

  • Translate complex analysis problems into iterative or multi-stage Spark scripts

  • Scale up to larger data sets using Amazon's Elastic MapReduce service

  • Understand how Hadoop YARN distributes Spark across computing clusters

  • Learn about other Spark technologies, like Spark SQL, Spark Streaming, and GraphX

By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes. 

This course uses the familiar Python programming language; if you'd rather use Scala to get the best performance out of Spark, see my "Apache Spark with Scala - Hands On with Big Data" course instead.

We'll have some fun along the way. You'll get warmed up with some simple examples of using Spark to analyze movie ratings data and text in a book. Once you've got the basics under your belt, we'll move to some more complex and interesting tasks. We'll use a million movie ratings to find movies that are similar to each other, and you might even discover some new movies you might like in the process! We'll analyze a social graph of superheroes, and learn who the most “popular" superhero is – and develop a system to find “degrees of separation" between superheroes. Are all Marvel superheroes within a few degrees of being connected to The Incredible Hulk? You'll find the answer.

This course is very hands-on; you'll spend most of your time following along with the instructor as we write, analyze, and run real code together – both on your own system, and in the cloud using Amazon's Elastic MapReduce service. 7 hours of video content is included, with over 20 real examples of increasing complexity you can build, run and study yourself. Move through them at your own pace, on your own schedule. The course wraps up with an overview of other Spark-based technologies, including Spark SQL, Spark Streaming, and GraphX.

Wrangling big data with Apache Spark is an important skill in today's technical world. Enroll now!


  • " I studied "Taming Big Data with Apache Spark and Python" with Frank Kane, and helped me build a great platform for Big Data as a Service for my company. I recommend the course!  " - Cleuton Sampaio De Melo Jr.

Content

Getting Started with Spark

Introduction
How to Use This Course
Udemy 101: Getting the Most From This Course
[Activity]Getting Set Up: Installing Python, a JDK, Spark, and its Dependencies.
[Activity] Installing the MovieLens Movie Rating Dataset
[Activity] Run your first Spark program! Ratings histogram example.

Spark Basics and Simple Examples

What's new in Spark 3?
Introduction to Spark
The Resilient Distributed Dataset (RDD)
Ratings Histogram Walkthrough
Key/Value RDD's, and the Average Friends by Age Example
[Activity] Running the Average Friends by Age Example
Filtering RDD's, and the Minimum Temperature by Location Example
[Activity]Running the Minimum Temperature Example, and Modifying it for Maximums
[Activity] Running the Maximum Temperature by Location Example
[Activity] Counting Word Occurrences using flatmap()
[Activity] Improving the Word Count Script with Regular Expressions
[Activity] Sorting the Word Count Results
[Exercise] Find the Total Amount Spent by Customer
[Excercise] Check your Results, and Now Sort them by Total Amount Spent.
Check Your Sorted Implementation and Results Against Mine.

Advanced Examples of Spark Programs

[Activity] Find the Most Popular Movie
[Activity] Use Broadcast Variables to Display Movie Names Instead of ID Numbers
Find the Most Popular Superhero in a Social Graph
[Activity] Run the Script - Discover Who the Most Popular Superhero is!
Superhero Degrees of Separation: Introducing Breadth-First Search
Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark
[Activity] Superhero Degrees of Separation: Review the Code and Run it
Item-Based Collaborative Filtering in Spark, cache(), and persist()
[Activity] Running the Similar Movies Script using Spark's Cluster Manager
[Exercise] Improve the Quality of Similar Movies

Running Spark on a Cluster

Introducing Elastic MapReduce
[Activity] Setting up your AWS / Elastic MapReduce Account and Setting Up PuTTY
Partitioning
Create Similar Movies from One Million Ratings - Part 1
[Activity] Create Similar Movies from One Million Ratings - Part 2
Create Similar Movies from One Million Ratings - Part 3
Troubleshooting Spark on a Cluster
More Troubleshooting, and Managing Dependencies

SparkSQL, DataFrames, and DataSets

Introducing SparkSQL
Executing SQL commands and SQL-style functions on a DataFrame
Using DataFrames instead of RDD's

Other Spark Technologies and Libraries

Introducing MLLib
[Activity] Using MLLib to Produce Movie Recommendations
Analyzing the ALS Recommendations Results
Using DataFrames with MLLib
Spark Streaming
[Activity] Structured Streaming in Python
GraphX

You Made It! Where to Go from Here.

Learning More about Spark and Data Science
Bonus Lecture: More courses to explore!

Screenshots

Taming Big Data with Apache Spark and Python - Hands On! - Screenshot_01Taming Big Data with Apache Spark and Python - Hands On! - Screenshot_02Taming Big Data with Apache Spark and Python - Hands On! - Screenshot_03Taming Big Data with Apache Spark and Python - Hands On! - Screenshot_04

Reviews

Bryce
November 10, 2023
I learned so much about RDDs and DFs! Will be very applicable at my full time job when I leverage PySpark code
HEMANT
October 11, 2023
Good course!. Explains the topics in detail, and explain the code line by line. Very good course for beginner.
Daniel
October 9, 2023
This is a good course, could be much more helpful if the teacher guide us into each detail happening in each function. (I just finished section 2 though, so i might be wrong about this)
Deepak
October 7, 2023
Frank is a good instructor but this course somehow assumes that you are familiar with Spark. The API in the codes can be confusing and some of the codes are also throwing error. There is also no section on Databricks which is a good platform to work with Spark. Overall it is a good course but only if you have a prior idea of Spark and are looking for a refresher...
Ritu
October 5, 2023
Course was good and engaging. Only issue is that it seems little out dated at some places otherwise the best course on udemy for spark.
Harshavardhan
September 5, 2023
I am completely new to pyspark. so this course is very useful for understanding the basics. Also the script resources helped save a lot of time.
Luis
September 5, 2023
The course is not that good. The instructor just paste de code and explains it, nothing that you can do if you do not have previous experience. Also, some code doing on shell and spyder and does not explain why the difference.
Sudhanshu
August 21, 2023
The course content is old and not covered in detail. Some FreeUdemy courses are better than this. You can find the content covered in this course on YouTube or other platforms easily.
Paolo
August 15, 2023
The software setup section could have been done better by including, for example, a description of the setup on linux platforms
Nneoma
August 12, 2023
The intsructions to download spoark were wrong. I had to go to a youtube video to get the correct instrcutions. But other than that, it was a good video.
samad
August 10, 2023
The content is very good for someone new to Spark, however please update the 6th slide installation section and how python path should be added, it took me sometime to figure out what was the issue that too from diffrent websites.
Divyansh
April 6, 2023
It is an amazing course to practice pyspark. I will recommend whoever wishes to get handson on real time data.
Sreekanth
April 3, 2023
I have bough this course probably a year back out of impulsion to learn apache spark. But never touched it. But once I started the course, I cant take my eyes off. I wanted to continue this course in single shot. I know it's not possible. But the course is engaging and i want to try to complete this course as quickly as possible and try to apply this knowledge on some complex problems
Kuntal
April 1, 2023
Excellent content and process oriented approach . Structure and layer by layer deep dive into topics are also good . Good amount of examples also covered .
Bogsatchio
March 10, 2023
This course is pretty outdated as to methods. The content itself is decent but the IDEs used there as well as presenteation of concept was executed poorly.

Charts

Price

Taming Big Data with Apache Spark and Python - Hands On! - Price chart

Rating

Taming Big Data with Apache Spark and Python - Hands On! - Ratings chart

Enrollment distribution

Taming Big Data with Apache Spark and Python - Hands On! - Distribution chart
622414
udemy ID
9/25/2015
course created date
8/7/2019
course indexed date
Bot
course submited by