Zero-Setup Federated Learning: Train Models Across Private Datasets Using Only Google Colab

Have you ever wanted to train a machine learning model on distributed private data without anyone sharing their raw data? In this tutorial, you’ll learn how to run a complete federated learning workflow directly from Google Colab—no local setup required.

We’ll use the PIMA Indians Diabetes dataset split across two data owners to train a diabetes prediction model collaboratively, all while keeping each party’s data private and secure.

Quick Demo

⬩⬩⬩

Overview: The Parties

In this federated learning flow, there are three key parties:

Data Owners (DO1 & DO2): Organizations that hold private data. Each runs their own Colab notebook to manage their data and approve training jobs.
Data Scientist (DS): The coordinator who proposes the ML project, submits jobs to data owners, and aggregates the results.

Each party runs in a separate Google Colab notebook. You can use three different Google accounts (emails)—or call two friends to join you for a real collaborative experience!

The magic? Raw data never leaves the data owner’s environment—only model updates are shared, no local setup is required!

⬩⬩⬩

Prerequisites

Before starting, you’ll need:

Three Google accounts (one for each party), or two friends willing to join
Each party downloads the notebook from the link provided, uploads and opens their respective notebook in Google Colab

That’s it! No local Python installation, no complex setup.

⬩⬩⬩

Step 1: Set Up the Data Owners and the Data Scientist

Each data owner runs their own notebook. Let’s start with DO1.

DO1 Notebook

Open a new Colab notebook
Install the syft-flwr package:

!uv pip install -q "git+https://github.com/OpenMined/syft-flwr.git@main"

import syft_client as sc
import syft_flwr

print(f"{sc.__version__ = }")
print(f"{syft_flwr.__version__ = }")

do_email = input("Enter the Data Owner's email: ")
do_client = sc.login_do(email=do_email)

Here, Google will ask for your permission to allow the notebook to access your Google credentials.

Please click “Allow” and follow a few other pop-up windows to complete the process.

Switch to DO2 notebook and DS notebook to login similarly with respective emails, e.g.

ds_email = input("Enter the Data Scientist's email: ")
ds_client = sc.login_ds(email=ds_email)

⬩⬩⬩

Step 2: Data Scientist Adds Data Owner as Peers

Add both data owners as peers:

do1_email = input("Enter the First Data Owner's email: ")
ds_client.add_peer(do1_email)

do2_email = input("Enter the Second Data Owner's email: ")
ds_client.add_peer(do2_email)

# check that the 2 DOs are added as peers
ds_client.peers

⬩⬩⬩

Step 3: Each Data Owner Creates A Diabetes Dataset

First, the DO downloads the PIMA Indians Diabetes dataset that’s already split into partitions from Hugging Face

from pathlib import Path
from huggingface_hub import snapshot_download

DATASET_DIR = Path("./dataset/").expanduser().absolute()

if not DATASET_DIR.exists():    
    snapshot_download(
        repo_id="khoaguin/pima-indians-diabetes-database-partitions",
        repo_type="dataset",      
        local_dir=DATASET_DIR,
    )

Next, DO creates a Syft dataset from a partition of the downloaded dataset (with mock and private path)

partition_number = 0

DATASET_PATH = DATASET_DIR / f"pima-indians-diabetes-database-{partition_number}"

do_client.create_dataset(
    name="pima-indians-diabetes-database",
    mock_path=DATASET_PATH / "mock",
    private_path=DATASET_PATH / "private",
    summary="This is a partition of the pima-indians-diabetes-database",
    readme_path=DATASET_PATH / "README.md",
    sync=True,
)

DO verifies that the dataset has been created

do_client.datasets.get_all()

Key concept: The mock_path contains synthetic/sample data that data scientists can explore and write code upon. The private_path contains the real data that never leaves this environment.

DO2 Notebook

Repeat the same steps in the DO2’s notebook, but change the partition number:

partition_number = 1  # DO2 uses partition 1 (or any other partition)

Everything else stays the same. Now you have two data owners, each holding a different slice of the diabetes dataset.

⬩⬩⬩

Step 4: Data Scientist Explores the Data Owner’s Datasets

# Check DO1's datasets

do1_datasets = ds_client.datasets.get_all(datasite=do1_email)  

do1_datasets[0].describe()

# Check DO2's datasets

do2_datasets = ds_client.datasets.get_all(datasite=do2_email)

do2_datasets[0].describe()

⬩⬩⬩

Step 5: Data Scientist Proposes and Submits the FL Project

Clone the FL Project

The FL project is built using Flower, a popular open-source federated learning framework. It defines the model architecture, training logic, and client/server communication—all following Flower’s standard patterns. The syft-flwr integration handles the secure job submission, data governance and communication layer on top. We have already prepared the FL project here and you only need to clone it like below (in the DS’s notebook)

from pathlib import Path

!mkdir -p /content/fl-diabetes-prediction

!curl -sL https://github.com/khoaguin/fl-diabetes-prediction/archive/refs/heads/main.tar.gz | tar -xz --strip-components=1 -C /content/fl-diabetes-prediction

SYFT_FLWR_PROJECT_PATH = Path("/content/fl-diabetes-prediction")

print(f"syft-flwr project at: {SYFT_FLWR_PROJECT_PATH}")

Bootstrap the Project

This configures the project with the aggregator (DS) and participating datasites (DOs), and generates the main.py entry point:

import syft_flwr

try:
    !rm -rf {SYFT_FLWR_PROJECT_PATH / "main.py"}
    print(f"syft_flwr version = {syft_flwr.__version__}")
    do_emails = [peer.email for peer in ds_client.peers]
    syft_flwr.bootstrap(
        SYFT_FLWR_PROJECT_PATH, aggregator=ds_email, datasites=do_emails
    )
    print("Bootstrapped project successfully ✅")
except Exception as e:
    print(e)

Submit Jobs to Data Owners

Now send the FL project to each data owner for review. The job contains the training code—data owners can inspect it before approving execution on their private data.

!rm -rf {SYFT_FLWR_PROJECT_PATH / "fl_diabetes_prediction" / "__pycache__"}

job_name = "fl-diabetes-training"

# Submit to DO1
ds_client.submit_python_job(
    user=do1_email,
    code_path=str(SYFT_FLWR_PROJECT_PATH),
    job_name=job_name,
)

# Submit to DO2
ds_client.submit_python_job(
    user=do2_email,
    code_path=str(SYFT_FLWR_PROJECT_PATH),
    job_name=job_name,
)

DS can check for submitted jobs with ds_client.jobs

⬩⬩⬩

Step 6: Data Owners Approve and Run Jobs

Back in each Data Owner’s notebook, check for incoming jobs:

do_client.jobs
do_client.jobs[0]

Review and approve the job:

do_client.jobs[0].approve()
do_client.jobs

Process the approved jobs (this runs the actual client-side training on private data for each DO):

do_client.process_approved_jobs()

After this, you will see that the DOs start to install the packages and run the client-side of the FL workflow.

Repeat this for both the DO1 and DO2 notebooks.

⬩⬩⬩

Step 7: Data Scientist Runs the Federated Learning Aggregator

Back in the Data Scientist notebook, the DS installs the required packages and runs the aggregator-side logic of the federated training:

!uv pip install \
    "flwr-datasets>=0.5.0" \
    "imblearn>=0.0" \
    "loguru>=0.7.3" \
    "pandas>=2.3.0" \
    "ipywidgets>=8.1.7" \
    "scikit-learn==1.7.1" \
    "torch>=2.8.0" \
    "ray==2.31.0"

Start the aggregation server:

ds_email = ds_client.email

syftbox_folder = f"/content/SyftBox_{ds_email}"

!SYFTBOX_EMAIL="{ds_email}" SYFTBOX_FOLDER="{syftbox_folder}" uv run {str(SYFT_FLWR_PROJECT_PATH / "main.py")}

You can start observing and monitoring the FL training log. After the FL flow is done, you can check the final job status:

ds_client.jobs

Fun Challenge: From the FL training logs and the jobs’ details, can you find where the aggregated models are saved?

⬩⬩⬩

Step 8: Clean Up

When you’re done, clean up the SyftBox resources in each notebook:

# In DS notebook
ds_client.delete_syftbox()

# In DO1 and DO2 notebooks
do_client.delete_syftbox()

⬩⬩⬩

What Just Happened?

Congratulations! You successfully trained a diabetes prediction model using federated learning:

Two data owners each held a private partition of the PIMA Indians Diabetes dataset
A data scientist coordinated the training without ever seeing the raw data
Model updates were aggregated using the Flower framework
Privacy was preserved—raw data never left the data owner’s Colab environment

This is the core promise of federated learning: collaborative machine learning without sharing sensitive data.

Enjoying this project? Help us grow the community by starring our repos on GitHub:

Stars help others discover these tools and keep our contributors motivated!

⬩⬩⬩

Next Steps

Ready to build production federated learning solutions?

We invite data scientists, researchers, and engineers working on production federated learning use cases to apply to our Federated Learning Co-Design Program. You’ll get direct support from the OpenMined team.

Apply to the Co-Design Program Now

Have questions, found some bugs, or want to contribute?

Join the conversation in our Slack Community. Already in the OpenMined workspace? Join the #community-federated-learning channel.

⬩⬩⬩