Data Engineering on Google Cloud

Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data.

Objectives

In this course, participants will learn the following skills:
  • Design and build data processing systems on Google Cloud.
  • Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.
  • Derive business insights from extremely large datasets using BigQuery.
  • Leverage unstructured data using Spark and ML APIs on Dataproc.
  • Enable instant insights from streaming data.
  • Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding.

Audience

This class is intended for developers who are responsible for:

  • Extracting, loading, transforming, cleaning, and validating data.
  • Designing pipelines and architectures for data processing.
  • Integrating analytics and machine learning capabilities into data pipelines.
  • Querying datasets, visualizing query results, and creating reports.

Prerequisites

To get the most of out of this course, participants should have:

  • Completed Google Cloud Fundamentals Big Data and Machine Learning course or have equivalent experience.
  • Basic proficiency with a common query language such as SQL.
  • Experience with data modeling and ETL (extract, transform, load) activities.
  • Experience with developing applications using a common programming language such as Python.
  • Familiarity with machine learning and/or statistics.

Duration

4 days

Investment

Check the next open public class in our enrollment page. If you are interested in a private training class for your company, contact-us.
Data Engineering on Google Cloud dependencies with other courses and certifications
Data Engineering on Google Cloud dependencies with other courses and certifications

Course Outline

The course includes presentations, demonstrations, and hands-on labs.
  • Explore the role of a data engineer
  • Analyze data engineering challenges
  • Introduction to BigQuery
  • Data lakes and data warehouses
  • Transactional databases versus data warehouses
  • Partner effectively with other data teams
  • Manage data access and governance
  • Build production-ready pipelines
  • Review Google Cloud customer case study
  • Introduction to data lakes
  • Data storage and ETL options on Google Cloud
  • Building a data lake using Cloud Storage
  • Securing Cloud Storage
  • Storing all sorts of data types
  • Cloud SQL as a relational data lake
  • The modern data warehouse
  • Introduction to BigQuery
  • Getting started with BigQuery
  • Loading data
  • Exploring schemas
  • Schema design
  • Nested and repeated fields
  • Optimizing with partitioning and clustering
  • EL, ELT, ETL
  • Quality considerations
  • How to carry out operations in BigQuery
  • Shortcomings
  • ETL to solve data quality issues
  •  The Hadoop ecosystem
  • Run Hadoop on Dataproc
  • Cloud Storage instead of HDFS
  • Optimize Dataproc
  • Introduction to Dataflow
  • Why customers value Dataflow
  • Dataflow pipelines
  • Aggregating with GroupByKey and Combine
  • Side inputs and windows
  • Dataflow templates
  • Dataflow SQL
  • Building batch data pipelines visually with Cloud Data Fusion
  • Components
  • UI overview
  • Building a pipeline
  • Exploring data using Wrangler
  • Orchestrating work between Google Cloud services with Cloud Composer
  • Apache Airflow environment
  • DAGs and operators
  • Workflow scheduling
  • Monitoring and logging
  • Introduction to Pub/Sub
  • Pub/Sub push versus pull
  • Publishing with Pub/Sub code
  • Process Streaming Data
  • Steaming data challenges
  • Dataflow windowing
  • Streaming into BigQuery and visualizing results
  • High-throughput streaming with Cloud Bigtable
  • Optimizing Cloud Bigtable performance
  • Analytic Window Functions.
  • Using With Clauses.
  • GIS Functions.
  • Performance Considerations.
  • What is AI?.
  • From Ad-hoc Data Analysis to Data Driven Decisions.
  • Options for ML models on Google Cloud
  • Unstructured Data is Hard.
  • ML APIs for Enriching Data.
  • Whats a Notebook.
  • BigQuery Magic and Ties to Pandas.
  • Ways to do ML on GCP.
  • Kubeflow.
  • AI Hub.
  • BigQuery ML for Quick Model Building.
  • Supported Models.
  • Why Auto ML?
  • Auto ML Vision.
  • Auto ML NLP.
  • Auto ML Tables.