Apache Spark 3 for Data Engineering & Analytics with Python

Learn how to use Python and PySpark 3.0.1 for Data Engineering / Analytics (Databricks) - Beginner to Ninja

4.47 (569 reviews)
Udemy
platform
English
language
Other
category
Apache Spark 3 for Data Engineering & Analytics with Python
7,123
students
8.5 hours
content
May 2022
last update
$69.99
regular price

What you will learn

Learn the Spark Architecture

Learn Spark Execution Concepts

Learn Spark Transformations and Actions using the Structured API

Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API

Learn how to set up your own local PySpark Environment

Learn how to interpret the Spark Web UI

Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution

Learn the RDD (Resilient Distributed Datasets) API (Crash Course)

Learn the Spark DataFrame API  (Structured APIs)

Learn Spark SQL

Learn Spark on Databricks

Learn to Visualize (Graphs and Dashboards) Data on Databricks

Why take this course?

The key objectives of this course are as follows;

  • Learn the Spark Architecture

  • Learn Spark Execution Concepts

  • Learn Spark Transformations and Actions using the Structured API

  • Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API

  • Learn how to set up your own local PySpark Environment

  • Learn how to interpret the Spark Web UI

  • Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution

  • Learn the RDD (Resilient Distributed Datasets) API (Crash Course)

    • RDD Transformations

    • RDD Actions

  • Learn the Spark DataFrame API  (Structured APIs)

    • Create Schemas and Assign DataTypes

    • Read and Write Data using the DataFrame Reader and Writer

    • Read Semi-Structured Data such as JSON

    • Create and New Data Columns to the DataFrame using Expressions

    • Filter the DataFrame using the "Filter" and "Where" Transformations

    • Ensure that the DataFrame has unique rows

    • Detect and Drop Duplicates

    • Augment the DataFrame by Adding New Rows

    • Combine 2 or More DataFrames

    • Order the DataFrame by Specific Columns

    • Renaming and Drop Columns from the DataFrame

    • Clean the DataFrame by detecting and Removing Missing or Bad Data

    • Create  User-Defined Spark Functions

    • Read and Write to/from Parquet File

    • Partition the DataFrame and Write to Parquet File

    • Aggregate the DataFrame using Spark SQL functions (count, countDistinct, Max, Min, Sum, SumDistinct, AVG)

    • Perform Aggregations with Grouping

  • Learn Spark SQL and Databricks

    • Create a Databricks Account

    • Create a Databricks Cluster

    • Create Databricks SQL and Python Notebooks

    • Learn Databricks shortcuts

    • Create Databases and Tables using Spark SQL

    • Use DML, DQL, and DDL with Spark SQL

    • Use Spark SQL Functions

    • Learn the differences between Managed and Unmanaged Tables

    • Read CSV Files from the Databricks File System

    • Learn to write Complex SQL

    • Use Spark SQL Functions

    • Create Visualisations with Databricks

    • Create a Databricks Dashboard


The Python Spark project that we are going to do together;

Sales Data

  • Create a Spark Session

  • Read a CSV file into a Spark Dataframe

  • Learn to Infer a Schema

  • Select data from the Spark Dataframe

  • Produce analytics that shows the topmost sales orders per Region and Country


Convert Fahrenheit to Degrees Centigrade

  • Create a Spark Session

  • Read and Parallelize data using the Spark Context into an RDD

  • Create a Function to Convert Fahrenheit to Degrees Centigrade

  • Use the Map Function to convert data contained within an RDD

  • Filter temperatures greater than or equal to 13 degrees celsius


XYZ Research

  • Create a set of RDDs that hold Research Data

  • Use the union transformation to combine RDDs

  • Learn to use the subtract transformation to minus values from an RDD

  • Use the RDD API to answer the following questions

    • How many research projects were initiated in the first three years?

    • How many projects were completed in the first year?

    • How many projects were completed in the first two years?


Sales Analytics

  • Create the Sales Analytics DataFrame to a set of CSV Files

  • Prepare the DataFrame by applying a Structure

  • Remove bad records from the DataFrame (Cleaning)

  • Generate New Columns from the DataFrame

  • Write a Partitioned DataFrame to a Parquet Directory

  • Answer the following questions and create visualizations using Seaborn and Matplotlib

    • What was the best month in sales?

    • What city sold the most products?

    • What time should the business display advertisements to maximize the likelihood of customers buying products?

    • What products are often sold together in the state "NY"?

Technology Spec

  1. Python

  2. Jupyter Notebook

  3. Jupyter Lab

  4. PySpark (Spark with Python)

  5. Pandas

  6. Matplotlib

  7. Seaborne

  8. Databricks

  9. SQL

Screenshots

Apache Spark 3 for Data Engineering & Analytics with Python - Screenshot_01Apache Spark 3 for Data Engineering & Analytics with Python - Screenshot_02Apache Spark 3 for Data Engineering & Analytics with Python - Screenshot_03Apache Spark 3 for Data Engineering & Analytics with Python - Screenshot_04

Reviews

Can
May 10, 2023
Great course with a lot of practical application! I was also able to pass the databricks certified developer for apache spark 3.0 upon finishing this course. Can highly recommend to anyone aiming for the certification.
Gergely
March 19, 2023
There's a bit too much verbal repetition of fundamental things for my taste, like import statements, and other code snippets. Instead of such fillers the instructor could have explained some more about concepts and techniques, pros and cons, or even some of his decisions. The lessons could be shorter and yet much richer by cutting out or replacing the unnecessary repetition. Our time is precious after all. It is a good course anyway. Thanks!
Chris
February 22, 2023
Good content. Some stuff outdated or done incorrectly and you have to go back and fix it later in the course.
Lalit
August 3, 2022
Good course for absolute beginners, covers basics of Spark and RDD in depth. was expecting more details or depth for a course of almost 9 hours..
Dron
July 8, 2022
Great Course, the material is great and very greatly delivered, the entire work from the concepts to finally implementing them in a real world use case is great
Nick
April 21, 2022
Good intro course, goes over basics of use cases, general architecture and APIs. Could definitely use more explanation in underlying concepts and could contain more information on more intermediate/advanced topics. Installation videos for Mac were not very helpful and had to figure that out myself, which took a while. Instructor does not seem very responsive to other comments.
Edén
March 25, 2022
El curso es muy interesante y contiene muchos ejemplos prácticos. El único punto débil es la gestión de versiones porque PySpark ha evolucionado muy rápido en los últimos años y eso hace que viejos comandos no funcionen en versiones nuevas o viceversa.
Prabhat
January 19, 2022
Very Good Course. Did learn a lot about dataframe and spark sql command. Love project after every concept. Thanks
printf("%s
October 27, 2021
The course is pretty basic. The instructor seems to have a pattern of either not responding to questions or responding in an untimely manner. Differences and use-cases of methods are not always clear. Challenges require the student to step-up and self-learn a bit too much, relative to what is presented in the course, in my opinion. You will learn the basics of Spark, Pyspark, Spark SQL, and Databricks, but this course could be better in my opinion.
Min-Ming
September 23, 2021
Examples are good, but the the why is not well explained. Need a lot of self research to get into each concept
Manoj
September 21, 2021
I think this is a good course. It is at a good pace. I was expecting him to dive into PySpark more and spend time on the tool later.
Parry
August 13, 2021
This is one of the best courses I have come across for Spark on Udemy. I have bought courses by Jose Portilla, Frank Kane, etc. This instructor goes slowly and explains every single line of code and I appreciated that. I read some comments by other course takers that he goes too slow and is begineer level. Well, may be that is what some people need, like me. If you are an intermediate level user, this may be basic for you, but for someone like me who has never used Spark before this was perfect!
Allan
July 7, 2021
Excelente, el instructor explica de manera clara y tiene buenos contenidos, Apache Spark, pyspark, SQL , Databricks, así como conceptos base de dichas tecnologías / herramientas.
Thomas
June 12, 2021
Many steps omitted that are needed to get everything to work. Could use a couple updates. Needed to throw the hadoop dll file in the bin to get the write function to work
Anitha
April 27, 2021
This course is really awesome. First time getting introduced to the spark and tutor is really great in explaining things. He kept the course so simple any person without prior knowledge can understand easily. I recommend this course to all.

Charts

Price

Apache Spark 3 for Data Engineering & Analytics with Python - Price chart

Rating

Apache Spark 3 for Data Engineering & Analytics with Python - Ratings chart

Enrollment distribution

Apache Spark 3 for Data Engineering & Analytics with Python - Distribution chart

Related Topics

3592114
udemy ID
10/25/2020
course created date
11/9/2020
course indexed date
Bot
course submited by