Challenge 9: Monitoring the workflow
Introduction
When we run our Cloud Composer DAG manually it’s easy to see if it has failed or not, but we’ve configured to run it automatically every day; what if something goes wrong tomorrow or the day after? In order to make sure that things stay healthy we need to do some continous monitoring, create incidents when something goes wrong and notify the responsible people. This is where Cloud Monitoring comes in to the play. In this challenge we’ll introduce a very basic method of monitoring failed Cloud Composer workflows using Cloud Monitoring.
Description
Create a new Alerting Policy for Failed DAG runs that will be triggered if there’s at least 1 failed DAG run for the workflow from the previous challenge, for a rolling window of 10 minutes
using delta
function and use an Email Notification Channel that sends an email to your personal account(s). Configure a Policy user label with the key workflow-name
and the name of the monitored workflow as the value. Set the Policy Severity Level to Critical
and use some sensible values for the other fields.
Create a new broken.csv
file with the following (idea is to have multiple csv
files with irreconcilable schemas to cause an error):
foo,bar
xx,yy
And upload it to the landing bucket for one of the entities (next to a data.csv
file, do not overwrite!) and re-run the Cloud Composer DAG.
Note If you had DAG failures in the previous challenge, the configured alert might be triggered before you re-run the DAG. Please ignore that and go ahead with uploading the broken file and re-run the DAG.
When you receive an email for the incident, follow the link to view and then Acknowledge
it.
Note This might take ~10 minutes as Airflow will retry the failing tasks multiple times before giving it up and mark the DAG run as failed.
Success Criteria
- There’s a new failed DAG run.
- There’s a new Cloud Monitoring incident related to the failed DAG run, that’s Acknowledged.
- No code was modified.