Press enter or click to view image in full size
Today I experienced a weird issue with a GCP virtual machine. At first glance it looked like the service crashed, but something about it didn’t seem quite right. I was running a very stable piece of software after all and the metrics didn’t indicate anything that could explain the crash. So what could have caused it to fail?
After inspecting the logs on the machine there was in fact no trace of a crash. But I did note a journalctl entry with the line ---REBOOT--- Wait. What just happened? Who is meddling with my machine?
Digging deeper within GCP brought me to the Stackdriver Logging area. I typed the name of my instance, and this intriguing log line popped up:
The keyword here is: compute.instances.hostError. Google tells us that the VM had a hardware or software failure and needed to be rebooted.
Aha, so it wasn’t a devious overlord trying to mess with my machine. Quite the opposite; the automatic failure recovery is indeed a very nice feature. But what would make it even better is getting a warning that my machine had failed, before getting grey hairs looking for the root cause of the issue.
GCP Cloud Functions to the rescue!
After a little searching I found a neat way to use GCP Cloud Functions to directly monitor GCP logs and create notifications whenever entries containing e.g. compute.instances.hostError appear.
To set up the alerts we’ll need to do the following:
- Create a Cloud Function, which will listen to a Pub/Sub topic and send us a notification when a message is published to that topic. In this example it will issue a web request to the notification service Notify17).
- Set up a logs export stream: whenever a log entry matching our filter will be seen by Stackdriver Logging, it will immediately be published to the Pub/Sub topic.
I’ll explain how to set this up in detail, so bear with me as it’s going to get a little long. But the process is simple enough, promise.
The notification service
To receive the notification we need to use a service which will generate them for us.
We’ll use Notify17, a service which accepts many types of payloads and lets you parse them using templates. It’s ideal for this use case as logs don’t have a uniform structure.
In case you don’t have a Notify17 account, it only takes a few of seconds to set up:
- Navigate to Notify17 dashboard and sign in with your preferred method.
- Enter your new encryption password.
Once you have logged in:
- Navigate to the notification templates page.
- Click on the
Importbutton, paste the following template, and import the template by pressingOK:
- Save the template by clicking the
Savebutton.
- Copy the
Template URL, we'll be using it in our Cloud Function.
The Cloud Function
To create our Cloud Function:
- Navigate to the Cloud Functions console area. If it’s not active enable the Cloud Functions API.
- Click the
CREATE FUNCTIONbutton (top of the screen). - Select
Cloud Pub/SubasTrigger. - Select
Create new topicunder theTopicdropdown menu and give the topic a name (e.g.test-logs-topic).
- If not already selected, choose
Node.js 8asRuntime. - In the code editor, paste the following code under the
index.jstab, and replaceREPLACE_WITH_TEMPLATE_API_KEYwith the previously copied Notify17'sTemplate URL:
- Still in the code editor, paste the following code under the
package.jsontab:
- Type
notify17in theFunction to executefield. This is the name of the function exported by our JavaScript code. - Save the Cloud Function by clicking the
Createbutton. - Wait until the Cloud Function is marked as deployed (a green-tick icon represents a successful deployment).
At this stage, every JSON payload you’ll send to this Pub/Sub topic will generate a notification in Notify17! You can try this out from the CLI with the command (replace YOUR_TOPIC_NAME with the previously created topic name):
You should receive a Notify17 notification in a few seconds. In case something is not working correctly, you can inspect your Cloud Function logs by navigating to Stackdriver logging and typing the name of your Cloud Function in the search filter.
Get Alberto Marchetti’s stories in your inbox
Join Medium for free to get updates from this writer.
Now we’re ready to trigger this function with our logs.
Logs exports stream
Before the last step we’re going to take a short detour, with a simple log monitoring example in which we will log whenever a GCE instance is created or deleted. The following steps are not strictly necessary, but this way we can manually trigger the logs giving us an easy way to test the feature.
The logging filter we want is:
resource.type="gce_instance"
jsonPayload.event_type="GCE_OPERATION_DONE"A little explanation:
resource.type="gce_instance"-> Only target instance-related logs.jsonPayload.event_type="GCE_OPERATION_DONE"-> Lets us know that the create/delete operation has completed.
Now we have to create an exporter for this event.
- Navigate to Stackdriver logs viewer.
- Select the
Convert to advanced filterentry on the right menu of the search field.
- Paste the previous filter in the filter bar and click on the
Submit filterbutton to enable it. - Press the
CREATE EXPORTbutton. - In the
Edit exportpanel: - Give this export a name, e.g.
n17-instance-creation. - Select
Pub/Subunder theSink servicemenu. - Select our previously created Pub/Sub topic under the
Sink destinationmenu. - Save the export by clicking the
Create sinkbutton.
Finally, we’re all set up and we can test this bad boy.
Launch a new GCP instance from the dashboard, or using the CLI:
As soon as the instance creation is completed, you should receive a notification with the content of the related log line!
You can delete the instance with:
gcloud compute instances delete test-logs-instanceYou should receive another notification, this time because of the instance deletion.
To tear down the logs export:
- Navigate to the Logs Exports page.
- Click on the menu for your previously created export, and click on “Delete sink”.
Ok ok, I’ll admit this wasn’t a short detour, but the point is that we can get a notification for any log line we’re interested in! So, going back to the original problem, all we need is to create a logs export with the following filter:
resource.type="gce_instance"
jsonPayload.event_type="GCE_OPERATION_DONE"
(
jsonPayload.event_subtype="compute.instances.automaticRestart" OR
jsonPayload.event_subtype="compute.instances.hostError"
)This will tell us whenever an instance is having trouble because of a hardware/software failure, and when GCP automatically restarts it. Woot!
If you made it this far, kudos to you and thanks for your time!
Press enter or click to view image in full size
Follow us on Twitter 🐦 and Facebook 👥 and Instagram 📷 and join our Facebook and Linkedin Groups 💬.
To join our community Slack team chat 🗣️ read our weekly Faun topics 🗞️, and connect with the community 📣 click here⬇