Data Engineering on Google Cloud
(GC-DEGC)
Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data.
What you'll learn
- Design and build data processing systems on Google Cloud.
- Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.
- Derive business insights from extremely large datasets using BigQuery.
- Leverage unstructured data using Spark and ML APIs on Dataproc.
- Enable instant insights from streaming data.
- Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding.
Target Audience
This class is intended for developers who are responsible for:
- Extracting, loading, transforming, cleaning, and validating data.
- Designing pipelines and architectures for data processing.
- Integrating analytics and machine learning capabilities into data pipelines.
- Querying datasets, visualizing query results, and creating reports.
Prerequisites
- To benefit from this course, participants should have completed “Google Cloud Big Data and Machine Learning Fundamentals” or have equivalent experience.
- Participant should also have:
- Basic proficiency with a common query language such as SQL.
- Experience with data modeling and ETL (extract, transform, load) activities.
- Experience with developing applications using a common programming language such as Python.
- Familiarity with machine learning and/or statistics.
Products
- BigQuery
- Cloud Bigtable
- Cloud Storage
- Cloud SQL
- Cloud Spanner
- Dataproc
- Dataflow
- Cloud Data Fusion
- Cloud Composer
- Pub/Sub
- Vertex AI
- Cloud ML APIs
Course Modules
Module 1: Introduction to Data Engineering
- Explore the role of a data engineer
- Analyze data engineering challenges
- Introduction to BigQuery
- Data lakes and data warehouses
- Transactional databases versus data warehouses
- Partner effectively with other data teams
- Manage data access and governance
- Build production-ready pipelines
- Review Google Cloud customer case study
Module 2: Building a Data Lake
- Introduction to data lakes
- Data storage and ETL options on Google Cloud
- Building a data lake using Cloud Storage
- Securing Cloud Storage
- Storing all sorts of data types
- Cloud SQL as a relational data lake
Module 3: Building a Data Warehouse
- The modern data warehouse
- Introduction to BigQuery
- Getting started with BigQuery
- Loading data
- Exploring schemas
- Schema design
- Nested and repeated fields
- Optimizing with partitioning and clustering
Module 4: Introduction to Building Batch Data Pipelines
- EL, ELT, ETL
- Quality considerations
- How to carry out operations in BigQuery
- Shortcomings
- ETL to solve data quality issues
Module 5: Executing Spark on Dataproc
- The Hadoop ecosystem
- Run Hadoop on Dataproc
- Cloud Storage instead of HDFS
- Optimize Dataproc
Module 6: Serverless Data Processing with Dataflow
- Introduction to Dataflow
- Why customers value Dataflow
- Dataflow pipelines
- Aggregating with GroupByKey and Combine
- Side inputs and windows
- Dataflow templates
- Dataflow SQL
Module 7: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
- Building batch data pipelines visually with Cloud Data Fusion
- Components
- UI overview
- Building a pipeline
- Exploring data using Wrangler
- Orchestrating work between Google Cloud services with Cloud Composer
- Apache Airflow environment
- DAGs and operators
- Workflow scheduling
- Monitoring and logging
Module 8: Introduction to Processing Streaming Data
- Process Streaming Data
Module 9: Serverless Messaging with Pub/Sub
- Introduction to Pub/Sub
- Pub/Sub push versus pull
- Publishing with Pub/Sub code
Module 10: Dataflow Streaming Features
- Steaming data challenges
- Dataflow windowing
Module 11: High-Throughput BigQuery and Bigtable Streaming Features
- Streaming into BigQuery and visualizing results
- High-throughput streaming with Cloud Bigtable
- Optimizing Cloud Bigtable performance
Module 12: Advanced BigQuery Functionality and Performance
- Analytic window functions
- Use With clauses
- GIS functions
- Performance considerations
Module 13: Introduction to Analytics and AI
- What is AI?
- From ad-hoc data analysis to data-driven decisions
- Options for ML models on Google Cloud
Module 14: Prebuilt ML Model APIs for Unstructured Data
- Unstructured data is hard
- ML APIs for enriching data
Module 15: Big Data Analytics with Notebooks
- What’s a notebook?
- BigQuery magic and ties to Pandas
Module 16: Production ML Pipelines
- Ways to do ML on Google Cloud
- Vertex AI Pipelines
- AI Hub
Module 17: Custom Model Building with SQL in BigQuery ML
- BigQuery ML for quick model building
- Supported models
Module 18: Custom Model Building with AutoML
- Why AutoML?
- AutoML Vision
- AutoML NLP
- AutoML tables