Troubleshooting Apache Spark

Quick, simple solutions to common development issues and Debugging techniques with Apache Spark.

3.35 (10 reviews)
Udemy
platform
English
language
Web Development
category
93
students
1.5 hours
content
Dec 2018
last update
$19.99
regular price

What you will learn

Solve long-running computation problems by leveraging lazy evaluation in Spark

Avoid memory leaks by understanding the internal memory management of Apache Spark

Rework problems due to not-scaling out pipelines by using partitions

Debug and create user-defined functions that enrich the Spark API

Choose a proper join strategy depending on the characteristics of your input data

Troubleshoot APIs for joins - DataFrames or DataSets

Write code that minimizes object creation using the proper API

Troubleshoot real-time pipelines written in Spark Streaming

Description

Apache Spark has been around quite some time, but do you really know how to solve the development issues and problems you face with it? This course will give you new possibilities and you'll cover many aspects of Apache Spark; some you may know and some you probably never knew existed. If you take a lot of time learning and performing tasks on Spark, you are unable to leverage Apache Spark's full capabilities and features, and face a roadblock in your development journey. You'll face issues and will be unable to optimize your development process due to common problems and bugs; you'll be looking for techniques which can save you from falling into any pitfalls and common errors during development. With this course you'll learn to implement some practical and proven techniques to improve particular aspects of Apache Spark with proper research

You need to understand the common problems and issues Spark developers face, collate them, and build simple solutions for these problems. One way to understand common issues is to look out for Stack Overflow queries. This course is a high-quality troubleshooting course, highlighting issues faced by developers in different stages of their application development and providing them with simple and practical solutions to these issues. It supplies solutions to some problems and challenges faced by developers; however, this course also focuses on discovering new possibilities with Apache Spark. By the end of this course, you will have solved your Spark problems without any hassle.

About the Author

Tomasz Lelek is a Software Engineer, programming mostly in Java and Scala. He is a fan of microservice architectures and functional programming. He dedicates considerable time and effort to getting better every day. He is passionate about nearly everything associated with software development, and believes that we should always try to consider different solutions and approaches before solving a problem. Recently he was a speaker at conferences in Poland -, Confitura and JDD (Java Developers Day), and also at Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference.

Content

Common Problems and Troubleshooting the Spark Distributed Engine

The Course Overview
Eager Computations: Lazy Evaluation
Caching Values: In-Memory Persistence
Unexpected API Behavior: Picking the Proper RDD API
Wide Dependencies: Using Narrow Dependencies

Distributed DataFrames Optimization Pitfalls

Making Computations Parallel: Using Partitions
Defining Robust Custom Functions: Understanding User-Defined Functions
Logical Plans Hiding the Truth: Examining the Physical Plans
Slow Interpreted Lambdas: Code Generation Spark Optimization

Distributed Joins in Cluster

Avoid Wrong Join Strategies: Using a Join Type Based on Data Volume
Slow Joins: Choosing an Execution Plan for Join
Distributed Joins Problem: DataFrame API
TypeSafe Joins Problem: The Newest DataSet API

Solving Problems with Non-Efficient Transformations

Minimizing Object Creation: Reusing Existing Objects
Iterating Transformations – The mapPartitions() Method
Slow Spark Application Start: Reducing Setup Overhead
Performing Unnecessary Recomputation: Reusing RDDs

Troubleshooting Real-Time Processing Jobs in Spark Streaming

Repeating the Same Code in Stream Pipeline: Using Sources and Sinks
Long Latency of Jobs: Understanding Batch Internals
Fault Tolerance: Using Data Checkpointing
Maintaining Batch and Streaming: Using Structured Streaming Pros

Screenshots

Troubleshooting Apache Spark - Screenshot_01Troubleshooting Apache Spark - Screenshot_02Troubleshooting Apache Spark - Screenshot_03Troubleshooting Apache Spark - Screenshot_04

Charts

Price

Troubleshooting Apache Spark - Price chart

Rating

Troubleshooting Apache Spark - Ratings chart

Enrollment distribution

Troubleshooting Apache Spark - Distribution chart

Related Topics

2064755
udemy ID
12/3/2018
course created date
12/28/2020
course indexed date
Bot
course submited by