In this article, I will go into detail about the Parameters in Databricks. How to transfer them between Notebooks using widgets. I will also show some nuances of Databricks Workflows related to Parameters and ways to work around them.
Press enter or click to view image in full size
Databricks is a cloud-based big data and machine learning platform based on Apache Spark. It was created by the creators of Apache Spark and provides a convenient interface for working with large volumes of data, as well as tools for analytics and machine learning. It is widely used in clouds such as AWS and Microsoft Azure.
First, we need to write a script that will perform the necessary manipulations with data. In Databricks, the file with the script is called - Notebook. Notebooks in Databricks are very similar to Jupyter Notebooks, which are the de facto standard for interactive data development and analysis in the fields of data science and machine learning.
Of course, sometimes you can find solutions where the entire flow of tasks is performed in one huge Notebook, I usually start development with it, but this is inconvenient for debugging the process, scalability and it will also simplify further reuse. Therefore, the best practice is to use separate Notebooks for subtasks, i.e. I then break it down into several depending on the process.
When we write code, we use variables and parameters.
Variables in Databricks, as in most software frameworks, are named entities used to store data or references to data while code is running.
variable = 10Parameters in Databricks typically refer to the values that are passed to Notebooks or jobs when they are started. Parameters allow you to customize the execution of a Notebook or task by passing certain values to it from outside, which can affect the logic of the Notebook or task. This makes Notebooks and tasks more flexible, since the same Notebook can be used for different scenarios, depending on the parameters passed.
Options in the Notebook are typically obtained through Databricks widgets or through command line options in Databricks tasks. An example of creating a parameter using a widget on a Notebook:
dbutils.widgets.text("my_parameter", "default_value", "Parameter Label")And getting the value of this Parameter to write to a Variable:
my_parameter_value = dbutils.widgets.get("my_parameter")Widgets in Databricks
Widgets in Databricks — these are interactive controls that can be used in Databricks Notebooks to dynamically enter user data. They allow users to easily configure parameters and inputs for Notebooks, making them flexible and reusable for a variety of data analytics and machine learning scenarios.
When you first start working with widgets, it is not entirely clear where to check the box so that the widgets begin to appear at the top of the Notebook. It’s simple, they appear after executing the code:
dbutils.widgets.text("text_widget", "default")The following line of code assigns the value from the widget to a variable:
input_value = dbutils.widgets.get("text_widget")If you rename the widget, then when you run the code, you will have a second one. To update widgets, you must first delete the previous ones, and then run the command again to create them. An example of a command that will delete all widgets, you can always insert it before creating widgets:
dbutils.widgets.removeAll()Widgets in Databricks support different types of input data:
- text: Input a value in a text box.
- dropdown: Select a value from a list of provided values.
- combobox: Combination of text and dropdown. Select a value from a provided list or input one in the text box.
- multiselect: Select one or more values from a list of provided values.
An example of creating and retrieving all four kinds of values from widgets in notebook code in Databricks:
Press enter or click to view image in full size
Sample code:
# Example of creating a text field
dbutils.widgets.text("text_widget", "default", "1.Enter value")
input_value = dbutils.widgets.get("text_widget")# Example of creating a dropdown
dbutils.widgets.dropdown("dropdown_widget", "Option1", ["Option1", "Option2", "Option3"], "2.Enter value")
dropdown_value = dbutils.widgets.get("dropdown_widget")
# Example of creating a combobox
dbutils.widgets.combobox("combobox_widget", "Option 100", ["Option1", "Option2", "Option3"], "3.Enter value")
textarea_input_value = dbutils.widgets.get("combobox_widget")
# Example of creating a checkbox
dbutils.widgets.multiselect("checkbox_widget", "Option1", ["Option1", "Option2", "Option3"], "4.Enter value")
Transferring Parameters in Widgets between Notebooks
Widgets are convenient because with their help you can quickly and more conveniently change Parameters in a Notebook, but their second and more important purpose is to accept parameters. How it works? For example, let’s create two Notebooks.
Notebook code with name— 0.py:
test_text = "test test test"
job_folder = '/Users/maksim.pachkovskiy@t1a.com' #change to your folder
path_notebook_1 = f'/{job_folder}/1'parameters = {
"text_widget": test_text
}
dbutils.notebook.run(path_notebook_1, 1200, arguments=parameters)
Notebook code with name— 1.py:
dbutils.widgets.text("text_widget", "default")
input_value = dbutils.widgets.get("text_widget")After starting the execution of Notebook — 0.py, execution 1.py will begin, and at the very bottom after completion, there will be a link to a job with completed execution.
Press enter or click to view image in full size
By clicking on it, you can see on the right side the parameters that he received, as well as prints, if any.
Press enter or click to view image in full size
Through widgets, Notebooks can accept parameters not only from other Notebooks but also from Azure Data Factory. The picture below shows how parameters are passed:
Press enter or click to view image in full size
You can run Notebooks manually, but most likely sooner or later you will need to set the script to a schedule with some frequency. If you click in the upper right corner of the Notebook — Schedule and fill out the parameters, then in fact a job will be created in Workflow with one Notebook.
Workflows
Workflows in Databricks are a set of tools and functions for orchestrating and automating a sequence of data science and machine learning tasks. These workflows allow users to schedule, run, track, and manage tasks and Notebooks in Databricks, enabling efficient data processing and analytics on the platform.
We can also move our simple example with two Notebooks into Workflows, but there is a slight difference when passing parameters between Notebooks. What’s the difference?
Previously, we launched or scheduled the Notebook with the name — 0.py, and it passed parameters to the Notebook — 1.py.
Get Maksim Pachkouski’s stories in your inbox
Join Medium for free to get updates from this writer.
Now links to Notebooks and dependencies between them are configured through the UI interface. Therefore, we remove from Notebook 0.py the variables pointing to folders and the block with the launch of the next Notebook, instead we create a block that should now write parameters to taskValues:
dbutils.jobs.taskValues.set(key = "test_text_new", value = test_text)As a result, the entire Notebook code — 0.py looks like this:
test_text = "NEW Workflow test"dbutils.jobs.taskValues.set(key = "test_text_new", value = test_text)
There is no need to make any changes to Notebook 1.py, but we will add another widget to demonstrate and output variables to the screen:
dbutils.widgets.text("text_widget", "default")
input_value = dbutils.widgets.get("text_widget")dbutils.widgets.text("global_text_widget", "default")
global_input_value = dbutils.widgets.get("global_text_widget")
print("input_value:", input_value)
print("global_input_value:", global_input_value)
Go to Databricks Workflows and click — Create job. The creation of the first Task opens, in which we will specify the following settings:
- Task name — Job_0
- Type — Notebook
- Source — Workspace (The Path link can be pointed directly to the Notebook in GitHub.)
- Path — to our first Notebook — 0.py
- Cluster — our cluster for running job
Press enter or click to view image in full size
Next, click + Add task, to create a second Task, in which we will need to specify a little more settings:
- Task name — Job_1
- Type — Notebook
- Source — Workspace
- Path — to our second Notebook — 1.py
- Cluster — our cluster for running job
In the next field, we can set parameters. They are local and global.
Global ones can be assigned for the entire job, and they will be applied in all tasks (for example — global_text_widget). Global parameters are set on the right in the section — Job parameters.
Local refer to each task (for example — text_widget).
The value can be set manually or using dynamic expressions. We need to pass the parameter from the first task.
- Parameters — {{tasks.Job_0.values.test_text_new}}
Please note that you need to indicate Task name, in which we set the parameter using taskValue and the name of the parameter.
Press enter or click to view image in full size
In Workflows, you can conveniently track the execution of Notebooks, and by clicking on the green square we will go to the completed Notebook to see the execution result.
Press enter or click to view image in full size
As you can see below, both parameters were successfully passed:
Press enter or click to view image in full size
The following picture clearly shows how parameters are passed:
Press enter or click to view image in full size
Below is the YAML file for the job settings in Workflow — Test job by Maksim Pachkovskiy:
resources:
jobs:
Test_job_by_Maksim_Pachkovskiy:
name: Test job by Maksim Pachkovskiy
tasks:
- task_key: Job_0
notebook_task:
notebook_path: /Users/maksim.pachkovskiy@t1a.com/Databricks
parameters/02_Workflows/0
source: WORKSPACE
existing_cluster_id: 0128-120853-sj80lvox
- task_key: Job_1
depends_on:
- task_key: Job_0
notebook_task:
notebook_path: /Users/maksim.pachkovskiy@t1a.com/Databricks
parameters/02_Workflows/1
base_parameters:
text_widget: "{{tasks.Job_0.values.test_text_new}}"
source: WORKSPACE
existing_cluster_id: 0128-120853-sj80lvox
parameters:
- name: global_text_widget
default: NEW GLOBAL testPassing parameters in Workflow without Widgets
There is another way to pass parameters between Tasks in Workflows without using Widgets. Of course, this is not very convenient for debugging, because… we don’t see the values and need to display them on the screen, but this method will be convenient if there is a limit on the number of Widgets, and you don’t want to pass parameters via JSON. Or you just need to secretly pass parameters between Notebooks.
Let’s create two Notebooks again.
In the code of the first Notebook — 0.py we will save the parameter using taskValues.set:
test_text = "NEW Workflow without WIDGETS"dbutils.jobs.taskValues.set(key = "test_text_wg", value = test_text)
In the second Notebook — 1.py we accept the value, note that we will again need a Task name in which we refer to the Notebook where the parameter was set— job_0:
input_value = dbutils.jobs.taskValues.get(taskKey = "job_0", key = "test_text_wg", default = '', debugValue = 0)print("input_value:", input_value)
When creating a job and tasks, we need to specify a minimum set of settings: Task names, Type — Notebook, Source — Workspace, Path — to our Notebooks and also the Cluster for running job:
Press enter or click to view image in full size
Press enter or click to view image in full size
The picture below shows how parameters are passed:
Press enter or click to view image in full size
Below is the YAML file for the job settings in Workflow — Test job by Maksim Pachkovskiy 2:
resources:
jobs:
Test_job_by_Maksim_Pachkovskiy_2:
name: Test job by Maksim Pachkovskiy 2
tasks:
- task_key: job_0
notebook_task:
notebook_path: /Users/maksim.pachkovskiy@t1a.com/Databricks
parameters/03_Workflow_without_Widgets/0
source: WORKSPACE
existing_cluster_id: 0128-120853-sj80lvox
- task_key: job_1
depends_on:
- task_key: job_0
notebook_task:
notebook_path: /Users/maksim.pachkovskiy@t1a.com/Databricks
parameters/03_Workflow_without_Widgets/1
source: WORKSPACE
existing_cluster_id: 0128-120853-sj80lvoxIt is also possible, of course, to combine both methods of passing parameters.
The article turned out to be quite large and detailed, all the files can be found on my GitHub, if you liked it, subscribe, in the next articles I’ll tell you about interesting features in Databricks, and I’m also planning an article with the ELT process in Databricks for uploading air pollution data through open APIs.