Alerting when a GCP instance fails

6 min read Original article ↗

Press enter or click to view image in full size

Alberto Marchetti

Today I experienced a weird issue with a GCP virtual machine. At first glance it looked like the service crashed, but something about it didn’t seem quite right. I was running a very stable piece of software after all and the metrics didn’t indicate anything that could explain the crash. So what could have caused it to fail?

After inspecting the logs on the machine there was in fact no trace of a crash. But I did note a journalctl entry with the line ---REBOOT--- Wait. What just happened? Who is meddling with my machine?

Digging deeper within GCP brought me to the Stackdriver Logging area. I typed the name of my instance, and this intriguing log line popped up:

A-ha!

The keyword here is: compute.instances.hostError. Google tells us that the VM had a hardware or software failure and needed to be rebooted.

Aha, so it wasn’t a devious overlord trying to mess with my machine. Quite the opposite; the automatic failure recovery is indeed a very nice feature. But what would make it even better is getting a warning that my machine had failed, before getting grey hairs looking for the root cause of the issue.

GCP Cloud Functions to the rescue!

After a little searching I found a neat way to use GCP Cloud Functions to directly monitor GCP logs and create notifications whenever entries containing e.g. compute.instances.hostError appear.

To set up the alerts we’ll need to do the following:

  • Create a Cloud Function, which will listen to a Pub/Sub topic and send us a notification when a message is published to that topic. In this example it will issue a web request to the notification service Notify17).
  • Set up a logs export stream: whenever a log entry matching our filter will be seen by Stackdriver Logging, it will immediately be published to the Pub/Sub topic.

I’ll explain how to set this up in detail, so bear with me as it’s going to get a little long. But the process is simple enough, promise.

The notification service

To receive the notification we need to use a service which will generate them for us.

We’ll use Notify17, a service which accepts many types of payloads and lets you parse them using templates. It’s ideal for this use case as logs don’t have a uniform structure.

In case you don’t have a Notify17 account, it only takes a few of seconds to set up:

  • Navigate to Notify17 dashboard and sign in with your preferred method.
  • Enter your new encryption password.

Once you have logged in:

  • Navigate to the notification templates page.
  • Click on the Import button, paste the following template, and import the template by pressing OK:
Import button
  • Save the template by clicking the Save button.
Save button
  • Copy the Template URL, we'll be using it in our Cloud Function.
Template URL

The Cloud Function

To create our Cloud Function:

  • Navigate to the Cloud Functions console area. If it’s not active enable the Cloud Functions API.
  • Click the CREATE FUNCTION button (top of the screen).
  • Select Cloud Pub/Sub as Trigger.
  • Select Create new topic under the Topic dropdown menu and give the topic a name (e.g. test-logs-topic).
New topic name
  • If not already selected, choose Node.js 8 as Runtime.
  • In the code editor, paste the following code under the index.js tab, and replace REPLACE_WITH_TEMPLATE_API_KEY with the previously copied Notify17's Template URL:
  • Still in the code editor, paste the following code under the package.json tab:
  • Type notify17 in the Function to execute field. This is the name of the function exported by our JavaScript code.
  • Save the Cloud Function by clicking the Create button.
  • Wait until the Cloud Function is marked as deployed (a green-tick icon represents a successful deployment).

At this stage, every JSON payload you’ll send to this Pub/Sub topic will generate a notification in Notify17! You can try this out from the CLI with the command (replace YOUR_TOPIC_NAME with the previously created topic name):

You should receive a Notify17 notification in a few seconds. In case something is not working correctly, you can inspect your Cloud Function logs by navigating to Stackdriver logging and typing the name of your Cloud Function in the search filter.

Get Alberto Marchetti’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Now we’re ready to trigger this function with our logs.

Logs exports stream

Before the last step we’re going to take a short detour, with a simple log monitoring example in which we will log whenever a GCE instance is created or deleted. The following steps are not strictly necessary, but this way we can manually trigger the logs giving us an easy way to test the feature.

The logging filter we want is:

resource.type="gce_instance"
jsonPayload.event_type="GCE_OPERATION_DONE"

A little explanation:

  • resource.type="gce_instance" -> Only target instance-related logs.
  • jsonPayload.event_type="GCE_OPERATION_DONE" -> Lets us know that the create/delete operation has completed.

Now we have to create an exporter for this event.

  • Navigate to Stackdriver logs viewer.
  • Select the Convert to advanced filter entry on the right menu of the search field.
Advanced filter
  • Paste the previous filter in the filter bar and click on the Submit filter button to enable it.
  • Press the CREATE EXPORT button.
  • In the Edit export panel:
  • Give this export a name, e.g. n17-instance-creation.
  • Select Pub/Sub under the Sink service menu.
  • Select our previously created Pub/Sub topic under the Sink destination menu.
  • Save the export by clicking the Create sink button.

Finally, we’re all set up and we can test this bad boy.

Launch a new GCP instance from the dashboard, or using the CLI:

As soon as the instance creation is completed, you should receive a notification with the content of the related log line!

You can delete the instance with:

gcloud compute instances delete test-logs-instance

You should receive another notification, this time because of the instance deletion.

To tear down the logs export:

  • Navigate to the Logs Exports page.
  • Click on the menu for your previously created export, and click on “Delete sink”.
Delete sink

Ok ok, I’ll admit this wasn’t a short detour, but the point is that we can get a notification for any log line we’re interested in! So, going back to the original problem, all we need is to create a logs export with the following filter:

resource.type="gce_instance"
jsonPayload.event_type="GCE_OPERATION_DONE"
(
jsonPayload.event_subtype="compute.instances.automaticRestart" OR
jsonPayload.event_subtype="compute.instances.hostError"
)

This will tell us whenever an instance is having trouble because of a hardware/software failure, and when GCP automatically restarts it. Woot!

If you made it this far, kudos to you and thanks for your time!

Press enter or click to view image in full size

Follow us on Twitter 🐦 and Facebook 👥 and Instagram 📷 and join our Facebook and Linkedin Groups 💬.

To join our community Slack team chat 🗣️ read our weekly Faun topics 🗞️, and connect with the community 📣 click here⬇

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇