Challenge 9: Monitoring the workflow

Introduction

When we run our Cloud Composer DAG manually it’s easy to see if it has failed or not, but we’ve configured to run it automatically every day; what if something goes wrong tomorrow or the day after? In order to make sure that things stay healthy we need to do some continous monitoring, create incidents when something goes wrong and notify the responsible people. This is where Cloud Monitoring comes in to the play. In this challenge we’ll introduce a very basic method of monitoring failed Cloud Composer workflows using Cloud Monitoring.

Description

Create a new Alerting Policy for Failed DAG runs that will be triggered if there’s at least 1 failed DAG run for the workflow from the previous challenge, for a rolling window of 10 minutes using delta function and use an Email Notification Channel that sends an email to your personal account(s). Configure a Policy user label with the key workflow-name and the name of the monitored workflow as the value. Set the Policy Severity Level to Critical and use some sensible values for the other fields.

Create a new broken.csv file with the following (idea is to have multiple csv files with irreconcilable schemas to cause an error):

foo,bar
xx,yy

And upload it to the landing bucket for one of the entities (next to a data.csv file, do not overwrite!) and re-run the Cloud Composer DAG.

Note
If you had DAG failures in the previous challenge, the configured alert might be triggered before you re-run the DAG. Please ignore that and go ahead with uploading the broken file and re-run the DAG.

When you receive an email for the incident, follow the link to view and then Acknowledge it.

Note
This might take ~10 minutes as Airflow will retry the failing tasks multiple times before giving it up and mark the DAG run as failed.

Success Criteria

There’s a new failed DAG run.
There’s a new Cloud Monitoring incident related to the failed DAG run, that’s Acknowledged.
No code was modified.

Learning Resources

Previous Challenge