Databricks System Table Workspace Health SQL Toolkit

11 min read Original article ↗

20 System Table queries to understand Databricks usage across the organization

Sync Computing

Press enter or click to view image in full size

As data engineers, understanding the intricacies of our Databricks environment is important. You can’t optimize performance, budget or ensure efficient resource allocations without it. Thankfully, Databricks gives you a behind-the-scenes look at how your workspace is running in system tables. Everything from query performance to job execution and cluster activity is in those tables.

But raw system data can be tricky to navigate, and sometimes you just need a quick answer to that burning question. That’s why we’ve created a Databricks Health SQL Toolkit — a free set of queries you can copy, paste, and run to get instant insights into your environment. Whether you’re debugging slow queries, tracking costs, or just curious about what’s happening under the hood, these queries should help you get there faster.

If you find these queries to be useful, you can download our 1-click dashboard for Databricks to visualize all of them in a sharable dashboard. Wow the team with insights in your Jobs, SQL warehouses, APC clusters, and DLT usage.

General Usage

Let’s start with the basics. Here are a few queries to help you get a quick overview of how your organization is using Databricks and determine your next steps.

Who submits the most amount of queries?

Press enter or click to view image in full size

Knowing who your most active users by SQL query count helps shed valuable insights into workspace utilization and helps you:

  • Identify power users who might benefit from additional training or resources
  • Detect unusual activity patterns that could indicate security issues or inefficient practices
  • Better understand how your organization uses Databricks

This will be essential to understand how to optimize your Databricks environment to meet the needs and usage patterns of your team. It will also be useful for cost allocation and user management.

Go here for the query

Who owns the most expensive clusters?

Press enter or click to view image in full size

Pinpointing the users that own and run your most expensive Jobs and All-Purpose Compute clusters is the next logical step on the road to cutting costs.

These insights enable you to:

  • Identify the owners of your most expensive clusters
  • Attribute costs to users and groups
  • Detect anomalies and potential misuse

Use this query to identify the users to follow up with to understand why costs are so high and devise a plan to reduce them.

Go here for the query

How many clusters are currently active, and how were they created?

Press enter or click to view image in full size

Understanding your cluster creation activity is crucial for resource management and cost optimization.

This enables you to:

  • Ensure cluster creation practices are inline with policies
  • Understand cluster creation trends
  • Identify rogue clusters created in alternative ways

Knowing where cluster creation originates can help optimize your workflows and ensure better governance.

Go here for the query

Which markets (Spot, On-Demand) do my clusters use?

Press enter or click to view image in full size

Market usage data helps you balance the cost and reliability of your infrastructure. While Spot instances may offer up to 90% lower costs, this is at the expense of reliability. Cloud providers can terminate nodes at any moment, with little to no notice, causing interruptions and even job failures.

This insight will enable you to:

  • Understand spot vs on-demand usage across the organization
  • Identify opportunities to leverage Spot
  • Ensure efficient use of Spot as fallback

Go here for the query

How “sticky” is my workspace?

Press enter or click to view image in full size

Databricks stickiness is a critical metric to understand user engagement. It is calculated by dividing daily active users by monthly active users and helps:

  • Gauge the extent of adoption across the organization
  • Identify trends and changes in usage over time
  • Alert on usage anomalies and potential misuse

A high stickiness ratio means that users are consistently using the workspace, which indicates that you have successfully adopted Databricks within your organization.

Go here for the query

Cost Optimization

Understanding usage is nice, but cutting costs is the ultimate end goal. Here are a few useful queries to help you pinpoint where to focus your optimization efforts for the most impact.

What are the current compute costs associated with my Databricks workspace?

Press enter or click to view image in full size

Keeping track of your Databricks compute costs is fundamental for budget management. With some many options (classic, serverless, DBSQL; APC vs on-demand) it’s hard to know what to use when.

These insights help you:

  • Pinpoint high spend targets
  • Identify over and under used options
  • Detect unexpected or over consumption

Go here for the query

What are the product costs associated with my Databricks workspace?

Press enter or click to view image in full size

Knowing which Databricks products you are spending the most on allows you to implement targeted cost-saving measures to ensure you’re getting the most out of your investment.

These insights let you:

  • Identify high cost targets for optimization
  • Get a bird’s eye view of how the org spends on Databricks
  • Detect trends and changing patterns

Go here for the query

Which jobs are most likely over-provisioned?

Press enter or click to view image in full size

Identifying the workloads with the lowest average CPU usage is a quick way to detect over-provisioned clusters. This information helps you:

  • Reduce spend from over-provisioned resources
  • Detect misconfigurations
  • Better allocate resources
  • Understand which jobs are best to use Gradient to auto-optimize

This query is a great place to start your waste reduction efforts.

Go here for the query

Which jobs are most likely under-provisioned?

Press enter or click to view image in full size

Identifying the data pipelines with the highest average memory utilization is an easy way to detect potential bottleneck and under-provisioned resources. These insight enables you to:

  • Identify jobs with potential bottlenecks
  • Improve performance with increased memory allocation
  • Detect anomalies and trends over time
  • Understand which jobs are best to use Gradient to auto-optimize

Detecting bottlenecks will help you to improve performance, by identifying the workloads that will benefit from additional resources.

Go here for the query

What are my most expensive jobs over the past 30 days?

Press enter or click to view image in full size

Knowing which jobs are costing you the most is crucial for optimizing your data infrastructure. This information helps you:

  • Identify the best candidates for optimization
  • Implement cost-saving for high-spend workloads (or use Gradient for automated Jobs cluster tuning)
  • Better understand your Databricks workload spend

Keeping tabs on your most expensive jobs every month is a great way to anticipate spend and detect costly workloads before they incur too many expenses.

Go here for the query

What jobs are growing the most in costs week-over-week?

Press enter or click to view image in full size

Workload costs can grow gradually over time. Understanding which of your workloads is growing the most in costs each week enables you to:

  • Be the first to identify jobs that are getting increasingly more expensive
  • Pinpoint workloads for optimization
  • Detect anomalies and potential problems

By following week over week increases in spend we can identify jobs that are becoming increasingly more expensive, even if spend grows slowly over time.

Go here for the query

What are the most expensive SQL warehouses for the last 30 days?

Press enter or click to view image in full size

Pinpointing which warehouses are driving up costs over the last 30 days is a great way to attribute costs to users and teams. This information helps you to:

  • Tie expensive warehouses to owners
  • Simplify chargebacks and cost attribution
  • Identify anomalies and trends over time

This query should be your first step when looking to optimize warehouse costs. It will ensure you focus your efforts on high-impact targets and give you everything you need to plan your next steps.

Here’s an example output for this query:

Go here for the query

What are the most expensive SQL queries over the past 30 days?

Press enter or click to view image in full size

Query costs can grow gradually with every run. Use this query to clear out the noise and focus on your most expensive queries by total cost for the last 30 days. This information enables you to:

  • Pinpoint top candidates for optimization
  • Detect anomalies and potential issues
  • Investigate costs increases before they become substantial

Go here for the query

This query helps you understand which queries to investigate further. The following query add much needed information to help you take your next steps.

Why are these queries so expensive, and who owns them?

Press enter or click to view image in full size

Pinpointing the queries that are costing you the most in a great first step for optimization. But it is just a first step. This query adds granular information about each expensive query. This information helps you:

  • Find queries that require optimization
  • Understand per run and overall costs
  • Better allocate resources

Go here for the query

What are the most costly APC clusters over the past 30 days?

Press enter or click to view image in full size

Understanding which APC clusters are costing you can help you cut costs by pinpointing candidates for optimization. This information lets you:

  • Focus optimization effort on high-value targets
  • Identify potential candidates for on-demand clusters
  • Better allocate resources

Use this query to pinpoint the clusters that are prime for further investigation.

Go here for the query

Which APC clusters are likely under-utilized?

Press enter or click to view image in full size

Identifying your underutilized APC clusters in the past 30 days, along with their relative costs, is a great way to facilitate cost savings and resource optimization. These insights enable you to:

  • Create a short list of clusters for downsizing
  • Cross reference usage with cost to identify issues
  • Detect trends over time

Use this query to detect under-utilizated APC clusters and connect the dots between those clusters to users that created them.

Go here for the query

What are the most expensive DLT clusters over the past 30 days?

Press enter or click to view image in full size

Knowing which delta live tables clusters are costing you the most is the first step to optimization. This information lets you:

  • Pinpoint DLT clusters for optimization
  • Identify tables for further analysis
  • Detect anomalies and trends over time

Go here for the query

Performance Management

Two queries to help optimize the performance of your Databricks ecosystem.

Which notebooks are consuming the most DBUs over time?

Press enter or click to view image in full size

Determining which notebooks consume the most DBUs will help target your optimization efforts for maximum effect. This information helps you:

  • Identify notebooks for optimization
  • Find problems and misuse
  • Charge costs based on notebook owner

Go here for the query

What are my longest running queries?

Press enter or click to view image in full size

Identifying performance bottlenecks is key to keeping your data infrastructure efficient and responsive. This information helps you:

  • Pinpoint queries that might be slowing down your system
  • Detect which queries need more resources vs code optimization
  • Identify patterns in slow queries that could indicate underlying issues

By optimizing these queries, you can make a big difference in the overall responsiveness and efficiency of your Databricks workspace.

Go here for the query

Data and Workflow Management

A couple queries to help manage your data and workflows on Databricks

What are my most commonly used datasets? Where are they stored?

Press enter or click to view image in full size

Knowing which of your datasets is used the most is crucial for optimizing data access and storage. This insight helps you:

  • Prioritize optimization efforts by dataset to help lower storage costs
  • Apply caching strategies for frequently accessed data
  • Make informed decisions about data placement and replication

Optimizing your most-used datasets may help improve overall performance and reduce data transfer cost across your Databricks environment.

Go here for the query

How do I track lineage for data transformations in the workspace?

Press enter or click to view image in full size

Visualize all the data lineage in your workspace. By mapping the relationships between source and target tables, you gain clarity into how data flows through the workspace.

This information helps you understand:

  • Dependencies between datasets and transformations
  • Where the data used in critical workflows originates from
  • Upstream or downstream effects of changes

Go here for the query

Get the queries

For the SQL queries to answer these questions and to access a comprehensive 1-click dashboard visualizing all these insights, go directly to the Databricks SQL Query Toolkit page.

This toolkit provides you with ready-to-use queries and a powerful dashboard to help you gain deep insights into your Databricks environment quickly and easily. Paste these queries into your Databricks SQL Editor to see the results, or fill out the form to get a comprehensive dashboard with answers to all these questions.

Get the comprehensive dashboard

Press enter or click to view image in full size

If you’d like all of the queries with corresponding plots all in 1 convenient dashboard, request the comprehensive dashboard here.

Conclusion

By regularly asking and answering questions like these, you’ll be well-equipped to optimize your Databricks workspace, reduce costs, and improve overall performance. Remember, the key to effective data engineering is not just in processing data, but in understanding and optimizing the environment in which that processing occurs.

We’ll be adding additional queries to the Databricks SQL toolkit and 1-click dashboard, so stay tuned!

Do you think these queries are useful? Are we missing any? We’d love to hear what you think! Feel free to drop us a line here.