Google Cloud Certified Professional Data Engineer - 2019 Updated exam // Dmitri Lerko

Starting 29th of March Google Cloud Professionals attempting Data Engineer certification will be subject to the latest version of the certification exam. The most recognisable difference is the removal of case studies. I was lucky enough to be challenged with and pass the updated version of the Data Engineer certification on the 2nd of April. In my preparation I’ve used materials and exam guide aimed at the v1 of the exam, hence being underprepared for the certification is an understatement. While Data Engineering is not something I’d consider to be a large part of my day-to-day toolkit, I do use BigQuery, Dataproc, Data Studio, GCS, PubSub, Datastore, Cloud SQL and related IAM. Therefore topics of Big Data and Machine Learning are fascinating, novel and challenging to me.

2019 Updated exam guide.

My GCP experience to date

I’ve been learning about GCP from around February 2018 and got fully committed to it from May onwards. I’ve chosen GCP cloud vs AWS as it was better suited for current and future company needs @ Loveholidays.com. We’ve successfully migrated from physical servers with KVM VMs environment to GCP where we are running most of our workloads inside of GKE. Since May I’ve worked with HTTP(S), TCP, Internal load-balancers, firewall rules, regular and hosted/service network VPCs, GKE (including VPC-Native alias IP clusters), VPNs, NAT Instances, Cloud Routers, Cloud Armor, Hybrid Connectivity with Interconnect, GCE instances with private IPs, network tagging, static routes, VPC peering, Stackdriver and few more technologies.

Exam preparation

While hands-on experience is invaluable, sometimes you miss on a bigger picture of the available infrastructure when you find yourself working only with a subset of available GCP network technologies. Here is what I’ve used:

David das Neves: Awesome GCP Certifications
Coursera: Data Engineering on Google Cloud Platform Specialization - not a quick course, but covers material well, gives an opportunity to practice what you’ve learned in QuickLabs.
LinuxAcademy: Google Cloud Certified Professional Data Engineer
Qwiklabs: Data Engineering
YouTube: Introduction to Google Cloud Machine Learning (Google Cloud Next ‘17)
YouTube: Auto-awesome: advanced data science on Google Cloud Platform (Google Cloud Next ‘17)
YouTube: Lifecycle of a machine learning model (Google Cloud Next ‘17)

An actual exam

Two hours, 50 questions. It took me 1 hour for the first pass; I had 21 out of 50 questions marked for review (shows the amount of self-doubt this exam will inflict upon you). I’ve finished the entire exam in 1 hour 15 minutes and was presented with much doubted “pass”.

Preparation suggestions

I do not intend to share any of the actual questions as this is against certification’s mission. Topics I’ve covered before are the ones that I’ve found harder or less prepared for after taking all of the training above.

Key topics: BigQuery, BigTable, Dataflow, PubSub

BigQuery

Streaming data into BigQuery - know it well.
High-rate streaming
Serving large datasets to BI dashboard (focus on data freshness and cost efficiently)
Benefits of partitions
From the point of view of BigQuery administrator ensure that you know best practices on how to allow various teams access team specific datasets without cross access.
Methods to increase the number of concurrent slots
How to verify that ETL migrated to BigQuery produced equal results
Point in time snapshots BigQuery
Integration with BigQuery ML
UDFs
Understand the pros and cons of denormalised data in the context of BigQuery

BigTable

Understand architecture and key reasons for high performance well
Know Key Visualiser
Know when to scale BigTable
Know performant key/schema design
Scaling up BigTable
If you need to double your reads for a prolonged period, what can you do to guarantee the same read latency?
Dev to Prod cluster promotion
HDD to SSD data migration

Dataflow

Understand Apache Beam building blocks - Pipeline, PCollection, PTransform, ParDO
Know Side Inputs
Exactly once processing of PubSub messages
Handling invalid inputs

PubSub

Migrate from Kafka to PubSub
Know potential reasons for PubSub ingesting applications being busier than initially planned
What PubSub metrics are available in Stackdriver and how to debug producers/consumers
Ordering messages
Dealing with duplicate messages

Data migrations

Know when to use Data Transfer Appliance. Hint - slow network, huge dataset, no in-between refreshes.
When to use Transfer Service and what are its limitations.
Know the cost of storage and availability for various products: BigQuery, BigTable, Cloud SQL, GCS to be able to find the cheapest product for a set of availability/durability criteria.
How Dedicated Interconnect impacts your data transfer decisions?
How to continuously sync data between on-prem and GCP

Dataproc

ML

Know best practices for training ML models (training, test, overfitting detection)
Speeding up TensorFlow applications
Know Cloud Data Loss Prevention API
Know Cloud Natural Language API
Know Cloud Vision API

IAM

How to allow cross team data access to BigQuery and GCS in a large organisation

Misc

Know how to backup, migrate Datastore
Know Cloud Composer well