Data Engineering on Google Cloud (GC-DEGC)

Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data.


What you'll learn

  • Design and build data processing systems on Google Cloud.
  • Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.
  • Derive business insights from extremely large datasets using BigQuery.
  • Leverage unstructured data using Spark and ML APIs on Dataproc.
  • Enable instant insights from streaming data.
  • Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding.


Target Audience

This class is intended for developers who are responsible for:

  • Extracting, loading, transforming, cleaning, and validating data.
  • Designing pipelines and architectures for data processing.
  • Integrating analytics and machine learning capabilities into data pipelines.
  • Querying datasets, visualizing query results, and creating reports.


Prerequisites

  • To benefit from this course, participants should have completed “Google Cloud Big Data and Machine Learning Fundamentals” or have equivalent experience.
  • Participant should also have: 
    • Basic proficiency with a common query language such as SQL.
    • Experience with data modeling and ETL (extract, transform, load) activities.
    • Experience with developing applications using a common programming language such as Python.
    • Familiarity with machine learning and/or statistics.


Products

  • BigQuery
  • Cloud Bigtable
  • Cloud Storage
  • Cloud SQL
  • Cloud Spanner
  • Dataproc
  • Dataflow
  • Cloud Data Fusion
  • Cloud Composer
  • Pub/Sub
  • Vertex AI
  • Cloud ML APIs
Show details


Course Modules

Module 1: Introduction to Data Engineering

  • Explore the role of a data engineer
  • Analyze data engineering challenges
  • Introduction to BigQuery
  • Data lakes and data warehouses
  • Transactional databases versus data warehouses
  • Partner effectively with other data teams
  • Manage data access and governance
  • Build production-ready pipelines
  • Review Google Cloud customer case study

Module 2: Building a Data Lake

  • Introduction to data lakes
  • Data storage and ETL options on Google Cloud
  • Building a data lake using Cloud Storage
  • Securing Cloud Storage
  • Storing all sorts of data types
  • Cloud SQL as a relational data lake

Module 3: Building a Data Warehouse

  • The modern data warehouse
  • Introduction to BigQuery
  • Getting started with BigQuery
  • Loading data
  • Exploring schemas
  • Schema design
  • Nested and repeated fields
  • Optimizing with partitioning and clustering

Module 4: Introduction to Building Batch Data Pipelines

  • EL, ELT, ETL
  • Quality considerations
  • How to carry out operations in BigQuery
  • Shortcomings
  • ETL to solve data quality issues

Module 5: Executing Spark on Dataproc

  • The Hadoop ecosystem
  • Run Hadoop on Dataproc
  • Cloud Storage instead of HDFS
  • Optimize Dataproc

Module 6: Serverless Data Processing with Dataflow

  • Introduction to Dataflow
  • Why customers value Dataflow
  • Dataflow pipelines
  • Aggregating with GroupByKey and Combine
  • Side inputs and windows
  • Dataflow templates
  • Dataflow SQL

Module 7: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

  • Building batch data pipelines visually with Cloud Data Fusion
  • Components
  • UI overview
  • Building a pipeline
  • Exploring data using Wrangler
  • Orchestrating work between Google Cloud services with Cloud Composer
  • Apache Airflow environment
  • DAGs and operators
  • Workflow scheduling
  • Monitoring and logging

Module 8: Introduction to Processing Streaming Data

  • Process Streaming Data

Module 9: Serverless Messaging with Pub/Sub

  • Introduction to Pub/Sub
  • Pub/Sub push versus pull
  • Publishing with Pub/Sub code

Module 10: Dataflow Streaming Features

  • Steaming data challenges
  • Dataflow windowing

Module 11: High-Throughput BigQuery and Bigtable Streaming Features

  • Streaming into BigQuery and visualizing results
  • High-throughput streaming with Cloud Bigtable
  • Optimizing Cloud Bigtable performance

Module 12: Advanced BigQuery Functionality and Performance

  • Analytic window functions
  • Use With clauses
  • GIS functions
  • Performance considerations

Module 13: Introduction to Analytics and AI

  • What is AI?
  • From ad-hoc data analysis to data-driven decisions
  • Options for ML models on Google Cloud

Module 14: Prebuilt ML Model APIs for Unstructured Data

  • Unstructured data is hard
  • ML APIs for enriching data

Module 15: Big Data Analytics with Notebooks

  • What’s a notebook?
  • BigQuery magic and ties to Pandas

Module 16: Production ML Pipelines

  • Ways to do ML on Google Cloud
  • Vertex AI Pipelines
  • AI Hub

Module 17: Custom Model Building with SQL in BigQuery ML

  • BigQuery ML for quick model building
  • Supported models

Module 18: Custom Model Building with AutoML

  • Why AutoML?
  • AutoML Vision
  • AutoML NLP
  • AutoML tables