How to Monitor Databricks Jobs: API-Based Dashboard

4 min read Original article ↗

Large companies use various orchestrators, such as Airflow, to manage scheduled scripts. But if you're a small company, team, or Data Mesh domain where Airflow is overkill in terms of cost and complexity, you can get by with the standard orchestrator, Databricks (Jobs & Pipelines).

Press enter or click to view image in full size

The UI is quite user-friendly, but:

  • The first problem we encounter is that there are no "folders" and all jobs are stored in one large table. If commands don't use prefixes or tags, it's very difficult to figure out whose job it is.
  • The next thing is that there is no summary information about jobs with errors or anything else.

But in reality, all the necessary information can be extracted from system tables and used to build your own dashboards or reports for different types of monitoring, for example:

  • build a board on which you can monitor the correct parameter settings;
  • build a board to track errors in planned scripts;
  • Build a board where you can track costs and abnormal resource consumption.

In this article, we'll look at one example. It's important to fix problems as they arise, not months later.

Script for tracking job parameters

Let's start from the beginning, and in this article, we'll build a board where we'll monitor the installation of the necessary and important metrics I wrote about in the previous article — 11 Common Databricks Mistakes Beginners Make: Best Practices for Data Management and Coding. Specifically:

  • Presence of a timeout in the job/run.
  • Configured alerts/notifications about crashes.
  • Using job clusters instead of shared interactive clusters.
  • Correct permissions: access is granted not only to one person, but to a team.

The current script and configuration can be easily compiled and tested on the free version of Databricks, which has everything you need. In a real environment, you may not have a cluster for the dashboard; you will need to create one through the administrator.

Writing a script and validation logic via API

First, we need to create a Notebook. Let's import and initialize the Workspace Client:

import pytz, requests, pyarrow as pa
from databricks.sdk import WorkspaceClient
from graphlib import TopologicalSorter
from datetime import datetime, timezone
from pprint import pprint
from pyspark.sql import Row
from collections import defaultdict
from pyspark.sql.functions import *
from pyspark.sql.types import *

wc = WorkspaceClient()

Next, we can get a list of all jobs in JSON format and understand which fields we need:

Press enter or click to view image in full size

For each job, you need to extract the configuration (job_id, job_name, timeout, alerts, cluster, etc.) and save it in a table format.

job_details = {j.job_id: wc.jobs.get(j.job_id) for j in jobs}

df1 = spark.createDataFrame([
Row(
job_id = str(job_detail.job_id) if job_detail.job_id is not None else "",
job_name = str(settings.name) if settings.name else "",
timeout_seconds = str(settings.timeout_seconds) if hasattr(settings, "timeout_seconds") and settings.timeout_seconds else "0",
email_on_failure = str(getattr(settings.email_notifications, "on_failure", []) or []),
timeout_job_health = str(next((rule.value for rule in (settings.health.rules if settings.health and settings.health.rules else []) if getattr(rule.metric, "name", None) == "RUN_DURATION_SECONDS"), "")),
)
for job_id, job_detail in job_details.items()
for settings in [job_detail.settings]
])

df1.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(f"unity.test.jr_jobs")

Using CASE WHEN we perform the necessary checks and save to another table. For example:

df2 = spark.sql("""
SELECT
*,
CASE
WHEN timeout_job_health > 0
THEN "OK"
ELSE CONCAT("timeout ",timeout_job_health, " sec")
END AS check_job_timout
FROM unity.test.jr_jobs
""")

df2.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(f"unity.test.jr_jobs2")

Databricks Dashboard makes it easy to build a simple dashboard and display all the necessary fields. For convenience, you can open it through Databricks One. The basic features are sufficient for creating an informative dashboard:

We had a lot of jobs that were set up "as needed," and mistakes were caught retroactively. Now:

  • The team administrator or technician has a simple monitoring center.
  • You can quickly identify vulnerable jobs and systematically bring them up to best practices.
  • Minimal entry barrier: a little code + a basic dashboard instead of a heavy orchestration infrastructure.

What's next?

Press enter or click to view image in full size

This was a simple example to demonstrate the capabilities. If you liked the idea of ​​basic monitoring, I'll describe how to build it in the next article, a board to track errors of scheduled scripts.

Get Maksim Pachkouski’s stories in your inbox

Join Medium for free to get updates from this writer.

Teams also often migrate between workspaces, and settings aren't always transferred correctly. You can also create a dashboard that will display the migration's progress. The possibilities are quite extensive, and everything depends on your idea and needs.

Companies often neglect resource monitoring in the form of a DBU and overpay due to the mistaken launch of expensive resources. It’s also possible to create a board to monitor expenses and anomalies, but this requires more extensive permissions.