Beginning with foundations, this training explains how Apache Beam and Dataflow work together to meet your data processing needs without the risk of vendor lock-in. The section on developing pipelines covers how you convert your business logic into data processing applications that can run on Dataflow. This training culminates with a focus on operations, which reviews the most important lessons for operating a data application on Dataflow, including monitoring, troubleshooting, testing, and reliability.
Objectivos
In this course, participants will learn the following skills:
- Demonstrate how Apache Beam and Dataflow work together to fulfill your organization’s data processing needs.
- Summarize the benefits of the Beam Portability Framework and enable it for your Dataflow pipelines.
- Enable Shuffle and Streaming Engine, for batch and streaming pipelines respectively, for maximum performance.
- Enable Flexible Resource Scheduling for more cost-efficient performance.
- Select the right combination of IAM permissions for your Dataflow job.
- Implement best practices for a secure data processing environment.
- Select and tune the I/O of your choice for your Dataflow pipeline.
- Use schemas to simplify your Beam code and improve the performance of your pipeline.
- Develop a Beam pipeline using SQL and DataFrames.
- Perform monitoring, troubleshooting, testing and CI/CD on Dataflow pipelines.
Público
This training is intended for big data practitioners who want to further their understanding of Dataflow in order to advance their data processing applications, including:
- Data Engineer
- Data Analysts and Data Scientists aspiring to develop Data Engineering skills
Prerrequisitos
To get the most of out of this course, participants should have:
- Completed Google Cloud Fundamentals- Big Data and Machine Learning course OR have equivalent experience.
- Basic proficiency with common query language such as SQL
- Experience with data modeling, extract, transform, load activities
- Developing applications using a common programming language such as Python
- Completed “Building Batch Data Pipelines”
and “Building Resilient Streaming Analytics Systems” or Data Engineering on Google Cloud
Duración
~ 24 horas (~ 3 días)
Inversión
Vea el valor actualizado y las próximas fechas de las clases abiertas en nuestra página de registro.
Si está interesado en una clase cerrada para su empresa, contáctenos.
Resumen del curso
El curso incluye presentaciones y laboratorios prácticos.
- Course Introduction
- Beam and Dataflow Refresher
- Beam Portability
- Runner v2
- Container Environments
- Cross Language TransformS
- Dataflow
- Dataflow Shuffle Service
- Dataflow Streaming Engine
- Flexible Resource Scheduling
- IAM
- Quota
- Data Locality
- Shared VPC
- Private IPs
- CMEK
- Beam Basics
- Utility Transforms
- DoFn Lifecycle
- Windows
- Watermarks
- Triggers
- Sources and Sinks
- Text IO and File IO
- BigQuery IO
- PubSub IO
- Kafka IO
- Bigtable IO
- Avro IO
– Splittable DoFn
- Beam Schemas
- Code Examples
- State API
- Timer API
- Summary
- Schemas
- Handling un-processable Data
- Error Handling
- AutoValue Code Generator
- JSON Data Handling
- Utilize DoFn Lifecycle
- Pipeline Optimizations
- Dataflow and Beam SQL
- Windowing in SQL
- Beam DataFrames
- Beam Notebooks
- Job List
Job Info
Job Graph
Job Metrics
Metrics Explorer
- Logging
- Error Reporting
- Troubleshooting Workflow
Types of Troubles
- Pipeline Design
- Data Shape
- Source, Sinks, and External Systems
- Shuffle and Streaming Engine
- Testing and CI/CD Overview
- Unit Testing
- Integration Testing
- Artifact Building
- Deployment
- Introduction to Reliability
- Monitoring
- Geolocation
- Disaster Recovery
- High Availability
- Classic Templates
- Flex Templates
- Using Flex Templates
- Google provided Templates
- Summary
Quick recap of training topics