IT Certification


Managing Big Data on Google's Cloud Platform

The Second Course in a Series for Attaining the Google Certified Data Engineer

4.10 (91 reviews)

Managing Big Data on Google's Cloud Platform


1.5 hours


Jul 2017

Last Update
Regular Price

Exclusive  SkillShare  Offer
Unlimited access to 30 000 Premium SkillShare courses

What you will learn

At the end of the course you'll understand Cloud Dataproc

You'll also know how to craft machine learning projects at scale on GCP.

You'll also know how to integrate dataproc with other core services like BigQuery

Additionally, you'll learn how to migrate on premise Hadoop and Spark jobs to Cloud Dataproc.


Welcome to Managing Big Data on Google's Cloud Platform. This is the second course in a series of courses designed to help you attain the coveted Google Certified Data Engineer. 

Additionally, the series of courses is going to show you the role of the data engineer on the Google Cloud Platform

At this juncture the Google Certified Data Engineer is the only real world certification for data and machine learning engineers.

NOTE: This is NOT a course on Big Data. This is a course on a specific cloud service called Google Cloud Dataproc. The course was designed to be part of a series for those who want to become data engineers on Google's Cloud Platform

This course is all about Google's Cloud and migrating on-premise Hadoop jobs to GCP.  In reality, Big Data is simply about unstructured data.  There are two core types of data in the real world. The first is structured data, this is the kind of data found in a relational database. The second is unstructured, this is a file sitting on a file system. Approximately 90% of all data in the enterprise is unstructured and our job is to give it structure.

Why do we want to give it structure? We want to give is structure so we can analyze it. Recall that 99% of all applied machine learning is supervised learning. That simply means we have a data set and we point our machine learning models at that data set in order to gain insight into that data.

In the course we will spend much of the time working in Cloud Dataproc. This is Google’s managed Hadoop and Spark platform. 

Recall the end goal of big data is to get that data into a state where it can be analyzed and modeled. Therefore, we are also going to cover how to work on machine learning projects with big data at scale.

Please keep in mind this course alone will not give you the knowledge and skills to pass the exam. The course will provide you with the big data knowledge you need for working with Cloud Dataproc and for moving existing projects to the Google Cloud Platform. 

                                                             *Five Reasons to take this Course.*

1) The Top Job in the World

The data engineer role is the single most needed role in the world. Many believe that it's the data scientist but several studies have broken down the job descriptions and the most needed position is that of the data engineer. 

2) Google's the World Leader in Data

Amazon's AWS is the most used cloud and Azure has the best UI but no cloud vendor in the world understands data like Google. They are the world leader in open sources artificial intelligence. You can't be the leader in AI without being the leader in data. 

3) 90% of all Organizational Data is Unstructured

The study of big data is the study of unstructured data. As the data in companies grows most will need to scale to unprecedented level. Without a significant investment in infrastructure and talent this won't be possible without the cloud. 

4) The Data Revolution is Now

We are in a data revolution. Data used to be viewed as a simple necessity and lower on the totem pole. Now it is more widely recognized as the source of truth. As we move into more complex systems of data management, the role of the data engineer becomes extremely important as a bridge between the DBA and the data consumer. Beyond the ubiquitous spreadsheet, graduating from RDBMS (which will always have a place in the data stack), we now work with NoSQL and Big Data technologies.

5) Data is Foundation 

Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers giving meaning to an otherwise static entity. Simply put, data engineers clean, prepare and optimize data for consumption. Once the data becomes useful, data scientists can perform a variety of analyses and visualization techniques to truly understand the data, and eventually, tell a story from the data. 

Thank you for your interest in Managing Big Data on Google's Cloud Platform and we will see you in the course!!




Is this Course for You?

Instructor Course Q&A

Unstructured Data

Data Sources



Why Cloud Dataproc

Why Use GCP for Big Data?

On-Premise Hadoop Build

Scaling up or Scaling Out

Zones and Regions

Separating Compute and Storage

Cloud Dataproc Architecture



Cloud Dataproc in Action

Create Cluster Screen

Create Dataproc Cluster in GCP Console

Create a Cluster using the Shell

The Three Dataproc Configurations

Using Preemption on Cloud Dataproc

How GCP Handles Preemption

Image Version Options

Scaling Clusters

Creating a Custom Image

Cluster Customization

3 Steps to Install Additional Software on Clusters

Initialization Actions

High Availability



Submitting Jobs

The Submit Jobs Screen

Submitting Spark Job - Console

Submitting Spark Job - Google Cloud Shell

Submitting PySpark Job - SSH

Moving from On-Premise to Google Cloud Dataproc

Python and Scala Code Reference Change



You're the Data Engineer

White Boarding: Difference between On-prem and Cloud Dataproc

White Boarding: Moving Jobs to GCP

White Boarding: Data Near Clusters

White Boarding: Defining Preemptibles

White Boarding: On-Premise Architecture to GCP

White Boarding: Add Software to Nodes



fernando15 March 2021

The course gives an overview about GCP Architecture for DataProc , but there are some details that is not covered in details for thos that work with Haoop . It is not just a matter of architecture change , but there are also changes reagrds the data consumption and data ingestion . The course just cover the process pipeline.

John16 August 2019

Great and well planned course from start to finish - especially the last section on Whiteboarding. I just wish it had been a little longer. Thank you for a great course.

Ievgenii11 August 2019

10 % education, 90% advertisement; which would be fine, have I not payed for the privelage of listen to it

John31 July 2019

I thought the class was very informative on how to set up and manage Dataproc for Hadoop clusters. As someone without a lot of exposure to Hadoop and Spark, I only wish there had been a little more detailed and hands on information on the types of jobs and things you can actually do with it.

Arthur9 April 2019

this is going through really basic info that any even semi-technical Hadoop engineer knows. It isn't very valuable so far.

Conroy13 March 2019

A good intro to the topic. But to pass the exam you need to dive much deeper to every single piece of detail, including troubleshooting and choosing the best option out of several possible options.

Caio8 December 2018

O curso é muito bom em apresentar tópicos para o estudante. Porém, como na maioria dos cursos da Udemy, as explicações são rasas e na maior parte do tempo é necessário ir atrás dos conceitos de forma mais aprofundada.

Kenneth17 October 2018

Not a good value. Superficial coverage of a small number of topics. Looney Corn videos are a much better value. Much more content.

Roberto4 May 2018

O curso é bem suscinto e abrangente, só acho que falata mais material impresso (pdf) com mais detalhamento para estudo posterior

Orhan17 February 2018

Nice explanations, but slower talking would be much better. In addition, i expected more visuals, graphs or written important points during the lessons. Sometimes, difficult to follow. Summary at the end of each chapter was a plus point.

Ramon30 November 2017

Los temas son de interés y bien explicados, el único problema que he detectado es que algunos subtitulos esta incorrectos con respecto a la exposición.

Valentina22 September 2017

The courses are explained with a lot of detail that includes the history of "why" the tool was created to the "present" of how it is used. Emphasizing the concepts with images scaled the learning curve and made understanding the concepts much easier. As well, the rewind feature is helpful. Finally, the summary page makes the learning comprehensive.

Raj25 August 2017

Good continuation to GCP Overview. Benefits of moving from on-premise Hadoop to GCP is well explained

Anand21 August 2017

I found the lectures are very easy to understand the Cloud Dataproc, which is a hadoop managed service in GCP.

Julie21 July 2017

This is an amazing and first ever course on Udemy Big data on Google Cloud Platform that will support all your google Cloud needs.Mike is very good and talented instructor presented this quality course. Lastly Thank you Mike West for this wonderful course. You are the best and this course is worth any price.


Udemy ID


Course created date


Course Indexed date
Course Submitted by