Starting 29th of March Google Cloud Professionals attempting Data Engineer certification will be subject to the latest version of the certification exam. The most recognisable difference is the removal of case studies. I was lucky enough to be challenged with and pass the updated version of the Data Engineer certification on the 2nd of April. In my preparation I’ve used materials and exam guide aimed at the v1 of the exam, hence being underprepared for the certification is an understatement. While Data Engineering is not something I’d consider to be a large part of my day-to-day toolkit, I do use BigQuery, Dataproc, Data Studio, GCS, PubSub, Datastore, Cloud SQL and related IAM. Therefore topics of Big Data and Machine Learning are fascinating, novel and challenging to me.
My GCP experience to date
I’ve been learning about GCP from around February 2018 and got fully committed to it from May onwards. I’ve chosen GCP cloud vs AWS as it was better suited for current and future company needs @ Loveholidays.com. We’ve successfully migrated from physical servers with KVM VMs environment to GCP where we are running most of our workloads inside of GKE. Since May I’ve worked with HTTP(S), TCP, Internal load-balancers, firewall rules, regular and hosted/service network VPCs, GKE (including VPC-Native alias IP clusters), VPNs, NAT Instances, Cloud Routers, Cloud Armor, Hybrid Connectivity with Interconnect, GCE instances with private IPs, network tagging, static routes, VPC peering, Stackdriver and few more technologies.
While hands-on experience is invaluable, sometimes you miss on a bigger picture of the available infrastructure when you find yourself working only with a subset of available GCP network technologies. Here is what I’ve used:
- David das Neves: Awesome GCP Certifications
- Coursera: Data Engineering on Google Cloud Platform Specialization - not a quick course, but covers material well, gives an opportunity to practice what you’ve learned in QuickLabs.
- LinuxAcademy: Google Cloud Certified Professional Data Engineer
- Qwiklabs: Data Engineering
- YouTube: Introduction to Google Cloud Machine Learning (Google Cloud Next ‘17)
- YouTube: Auto-awesome: advanced data science on Google Cloud Platform (Google Cloud Next ‘17)
- YouTube: Lifecycle of a machine learning model (Google Cloud Next ‘17)
An actual exam
Two hours, 50 questions. It took me 1 hour for the first pass; I had 21 out of 50 questions marked for review (shows the amount of self-doubt this exam will inflict upon you). I’ve finished the entire exam in 1 hour 15 minutes and was presented with much doubted “pass”.
I do not intend to share any of the actual questions as this is against certification’s mission. Topics I’ve covered before are the ones that I’ve found harder or less prepared for after taking all of the training above.
Key topics: BigQuery, BigTable, Dataflow, PubSub
- Streaming data into BigQuery - know it well.
- High-rate streaming
- Serving large datasets to BI dashboard (focus on data freshness and cost efficiently)
- Benefits of partitions
- From the point of view of BigQuery administrator ensure that you know best practices on how to allow various teams access team specific datasets without cross access.
- Methods to increase the number of concurrent slots
- How to verify that ETL migrated to BigQuery produced equal results
- Point in time snapshots BigQuery
- Integration with BigQuery ML
- Understand the pros and cons of denormalised data in the context of BigQuery
- Understand architecture and key reasons for high performance well
- Know Key Visualiser
- Know when to scale BigTable
- Know performant key/schema design
- Scaling up BigTable
- If you need to double your reads for a prolonged period, what can you do to guarantee the same read latency?
- Dev to Prod cluster promotion
- HDD to SSD data migration
- Understand Apache Beam building blocks - Pipeline, PCollection, PTransform, ParDO
- Know Side Inputs
- Exactly once processing of PubSub messages
- Handling invalid inputs
- Migrate from Kafka to PubSub
- Know potential reasons for PubSub ingesting applications being busier than initially planned
- What PubSub metrics are available in Stackdriver and how to debug producers/consumers
- Ordering messages
- Dealing with duplicate messages
- Know when to use Data Transfer Appliance. Hint - slow network, huge dataset, no in-between refreshes.
- When to use Transfer Service and what are its limitations.
- Know the cost of storage and availability for various products: BigQuery, BigTable, Cloud SQL, GCS to be able to find the cheapest product for a set of availability/durability criteria.
- How Dedicated Interconnect impacts your data transfer decisions?
- How to continuously sync data between on-prem and GCP
- Know best practices for training ML models (training, test, overfitting detection)
- Speeding up TensorFlow applications
- Know Cloud Data Loss Prevention API
- Know Cloud Natural Language API
- Know Cloud Vision API
- How to allow cross team data access to BigQuery and GCS in a large organisation