Every data practitioner eventually runs into the same problem: you need data, but you don’t have it. It could be that the production database is locked behind access controls. Or, you might have the situation where the dataset you need doesn’t exist yet (because the feature hasn’t shipped). Maybe you’re writing tests, building a demo, or teaching a workshop and you need something that looks real but carries zero risk. Whatever the reason, the need for synthetic data is everywhere, and it comes up far more often than most of us would like to admit.
The great news here is that fake can be just as good. If your synthetic data has the right shape, the right types, the right distributions, and the right internal consistency, it can stand in for real data in many different situations.
Pointblank is a Python library for data validation, but over the last several releases (v0.20.0, v0.21.0, and v0.22.0), we’ve been building out a complementary capability: data generation. The idea is simple. You define a schema (the columns, their types, and their constraints), and Pointblank produces n rows of data that conform to it. The result is a Polars or Pandas DataFrame, ready to use.
In this post, I’ll walk through the generate_dataset() function in some depth, show how to build realistic datasets for common scenarios (including a customer data example you might actually use), and highlight the country-specific and coherence features that make the generated data feel surprisingly real.
All examples here use pb.preview() to display results, which renders a compact HTML table showing the head and tail of the dataset. If you want to follow along, install Pointblank with pip install pointblank and make sure you have Polars available.
Starting simple: A schema and a dataset
Everything begins with a Schema object. You declare columns as keyword arguments, using field specification functions to describe each one:
import pointblank as pb
schema = pb.Schema(
id=pb.int_field(min_val=1000, max_val=9999, unique=True),
score=pb.float_field(min_val=0.0, max_val=100.0),
passed=pb.bool_field(p_true=0.7),
)
pb.preview(pb.generate_dataset(schema, n=10, seed=23))PolarsRows10Columns3 |
|||
id Int64 |
score Float64 |
passed Boolean |
|
|---|---|---|---|
| 1 | 5749 | 92.48652516259452 | False |
| 2 | 2368 | 94.86057779931771 | False |
| 3 | 1279 | 89.24333440485793 | False |
| 4 | 6025 | 8.355067683068363 | True |
| 5 | 7942 | 59.20272268857353 | True |
| 6 | 7212 | 42.37474082349614 | True |
| 7 | 9684 | 53.00880101180064 | True |
| 8 | 6866 | 13.030294124748053 | True |
| 9 | 3134 | 19.19971575392927 | True |
| 10 | 4145 | 44.4573573873013 | True |
Three columns, three types, ten rows. The seed=23 parameter makes the output reproducible. The id column has unique integers in the range 1000–9999, score is a uniform float between 0 and 100, and passed is True about 70% of the time.
This is already useful for quick prototyping, but the real power shows up when you start using string presets.
String presets: Names, emails, cities, and more
The string_field() function accepts a preset parameter that taps into Pointblank’s built-in data generators. There are over 40 presets covering personal information, locations, business data, internet artifacts, and more. Here’s a small example:
schema = pb.Schema(
name=pb.string_field(preset="name"),
email=pb.string_field(preset="email"),
city=pb.string_field(preset="city"),
company=pb.string_field(preset="company"),
)
pb.preview(pb.generate_dataset(schema, n=10, seed=23))PolarsRows10Columns4 |
||||
name String |
String |
city String |
company String |
|
|---|---|---|---|---|
| 1 | Patricia Williams | patricia_williams@yandex.com | Lubbock | Innovative Systems Solutions |
| 2 | Andrea Mitchell | a_mitchell@gmail.com | Anaheim | Sterling Engineering |
| 3 | Maria Valentine | maria.valentine54@gmail.com | Phoenix | Goldman Sachs |
| 4 | Virginia Walker | virginia.walker@outlook.com | Denver | Evans Group |
| 5 | Brenda Lopez | b_lopez@yahoo.com | San Antonio | Goodwin and Garrett |
| 6 | Lauren Davis | l_davis@outlook.com | New York | Hayes and Kennedy |
| 7 | John West | j_west@zoho.com | Charlotte | UnitedHealth Group |
| 8 | Claire Jackson | claire202@outlook.com | Irvine | First Ventures Group |
| 9 | Ariana Wood | ariana_wood@zoho.com | Seattle | Cox Research |
| 10 | Michael Simmons | michaelsimmons@mail.com | Denver | Williams Industries |
Notice that the email addresses aren’t random gibberish. They’re derived from the person’s name. This is one of Pointblank’s coherence systems at work, and it activates automatically when certain presets appear together in the same schema.
Building a realistic customer dataset
Let’s put these pieces together for a scenario that comes up constantly in practice: generating a table of customer records. This is the kind of dataset you might need for a dashboard prototype, a workshop exercise, or integration testing of a CRM pipeline.
from datetime import date
schema = pb.Schema(
customer_id=pb.int_field(min_val=10000, max_val=99999, unique=True),
first_name=pb.string_field(preset="first_name"),
last_name=pb.string_field(preset="last_name"),
email=pb.string_field(preset="email"),
phone=pb.string_field(preset="phone_number"),
city=pb.string_field(preset="city"),
state=pb.string_field(preset="state"),
postcode=pb.string_field(preset="postcode"),
signup_date=pb.date_field(
min_date=date(2022, 1, 1),
max_date=date(2025, 12, 31),
),
is_active=pb.bool_field(p_true=0.8),
lifetime_spend=pb.float_field(min_val=0.0, max_val=5000.0),
)
customers = pb.generate_dataset(schema, n=50, seed=23)
pb.preview(customers)PolarsRows50Columns11 |
|||||||||||
customer_id Int64 |
first_name String |
last_name String |
String |
phone String |
city String |
state String |
postcode String |
signup_date Date |
is_active Boolean |
lifetime_spend Float64 |
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 47999 | Paul | Woods | paulwoods@hotmail.com | (512) 899-4802 | Lubbock | Texas | 79468 | 2023-08-17 | False | 4624.326258129726 |
| 2 | 20951 | Mark | Smith | mark684@icloud.com | (310) 986-0270 | Anaheim | California | 92873 | 2022-06-21 | False | 4743.028889965885 |
| 3 | 12238 | Willow | Fowler | willowfowler@gmail.com | (623) 938-2304 | Phoenix | Arizona | 85032 | 2022-02-04 | False | 4462.166720242896 |
| 4 | 87598 | Roger | Graham | roger.graham@zoho.com | (970) 514-7904 | Denver | Colorado | 80232 | 2025-04-27 | True | 417.7533841534181 |
| 5 | 50205 | Karen | Horn | karen.horn70@gmail.com | (210) 987-2966 | San Antonio | Texas | 78271 | 2023-09-21 | True | 2960.1361344286765 |
| 46 | 72136 | Hannah | Weaver | hannahweaver@yahoo.com | (419) 998-5523 | Columbus | Ohio | 43255 | 2022-06-25 | True | 1377.8223075007618 |
| 47 | 33282 | Martin | Ramos | martin_ramos@yahoo.com | (951) 234-6078 | San Jose | California | 95170 | 2024-08-28 | True | 2864.109474442189 |
| 48 | 73318 | Audrey | Jackson | audrey_jackson@aol.com | (252) 401-8878 | Charlotte | North Carolina | 28226 | 2022-12-30 | False | 4103.315904362622 |
| 49 | 87412 | Christina | Cannon | ccannon13@aol.com | (320) 486-6471 | St. Paul | Minnesota | 55195 | 2024-09-16 | True | 1654.024239966494 |
| 50 | 68648 | Melissa | Nelson | m_nelson@yandex.com | (260) 590-0851 | Bloomington | Indiana | 47493 | 2025-04-24 | True | 1848.269660030496 |
What we get here is 50 rows of plausible customer data. The city, state, and postcode are coherent within each row (a customer in "San Antonio" will have a Texas state code and a valid Texas zip code). The email is derived from the customer’s name. The phone number matches the region. None of this required any manual wiring. Pointblank detects the preset combinations and applies the appropriate coherence rules.
Extending with Polars
Since the default output is a Polars DataFrame, you can immediately layer on transformations. Let’s add a loyalty tier based on lifetime spend:
import polars as pl
customers_tiered = customers.with_columns(
pl.when(pl.col("lifetime_spend") >= 3000)
.then(pl.lit("Gold"))
.when(pl.col("lifetime_spend") >= 1000)
.then(pl.lit("Silver"))
.otherwise(pl.lit("Bronze"))
.alias("loyalty_tier")
)
pb.preview(customers_tiered)PolarsRows50Columns12 |
||||||||||||
customer_id Int64 |
first_name String |
last_name String |
String |
phone String |
city String |
state String |
postcode String |
signup_date Date |
is_active Boolean |
lifetime_spend Float64 |
loyalty_tier String |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 47999 | Paul | Woods | paulwoods@hotmail.com | (512) 899-4802 | Lubbock | Texas | 79468 | 2023-08-17 | False | 4624.326258129726 | Gold |
| 2 | 20951 | Mark | Smith | mark684@icloud.com | (310) 986-0270 | Anaheim | California | 92873 | 2022-06-21 | False | 4743.028889965885 | Gold |
| 3 | 12238 | Willow | Fowler | willowfowler@gmail.com | (623) 938-2304 | Phoenix | Arizona | 85032 | 2022-02-04 | False | 4462.166720242896 | Gold |
| 4 | 87598 | Roger | Graham | roger.graham@zoho.com | (970) 514-7904 | Denver | Colorado | 80232 | 2025-04-27 | True | 417.7533841534181 | Bronze |
| 5 | 50205 | Karen | Horn | karen.horn70@gmail.com | (210) 987-2966 | San Antonio | Texas | 78271 | 2023-09-21 | True | 2960.1361344286765 | Silver |
| 46 | 72136 | Hannah | Weaver | hannahweaver@yahoo.com | (419) 998-5523 | Columbus | Ohio | 43255 | 2022-06-25 | True | 1377.8223075007618 | Silver |
| 47 | 33282 | Martin | Ramos | martin_ramos@yahoo.com | (951) 234-6078 | San Jose | California | 95170 | 2024-08-28 | True | 2864.109474442189 | Silver |
| 48 | 73318 | Audrey | Jackson | audrey_jackson@aol.com | (252) 401-8878 | Charlotte | North Carolina | 28226 | 2022-12-30 | False | 4103.315904362622 | Gold |
| 49 | 87412 | Christina | Cannon | ccannon13@aol.com | (320) 486-6471 | St. Paul | Minnesota | 55195 | 2024-09-16 | True | 1654.024239966494 | Silver |
| 50 | 68648 | Melissa | Nelson | m_nelson@yandex.com | (260) 590-0851 | Bloomington | Indiana | 47493 | 2025-04-24 | True | 1848.269660030496 | Silver |
Or compute a summary by state:
pb.preview(
customers_tiered
.group_by("state", "loyalty_tier")
.agg(
pl.col("customer_id").count().alias("count"),
pl.col("lifetime_spend").mean().alias("avg_spend"),
)
.sort("state", "loyalty_tier")
)PolarsRows35Columns4 |
||||
state String |
loyalty_tier String |
count UInt32 |
avg_spend Float64 |
|
|---|---|---|---|---|
| 1 | Arizona | Gold | 2 | 3882.413633247243 |
| 2 | Arizona | Silver | 1 | 2860.2339059589044 |
| 3 | California | Bronze | 3 | 561.8318745352304 |
| 4 | California | Gold | 2 | 4798.352513930336 |
| 5 | California | Silver | 3 | 2503.2021274153226 |
| 31 | Texas | Bronze | 1 | 978.0392640195001 |
| 32 | Texas | Gold | 3 | 3904.2742899489526 |
| 33 | Texas | Silver | 3 | 2140.2056508843366 |
| 34 | Washington | Bronze | 2 | 623.879217971278 |
| 35 | Washington | Gold | 1 | 3671.0453939174777 |
This is the workflow I keep coming back to! We can use Pointblank to generate the raw material, and then get Polars in there to shape it into whatever you actually need.
Country-specific data
One of the features I’m most excited about is country-specific data generation. Pointblank ships with locale data for 100 countries, covering names, cities, states/provinces, postcodes, phone number formats, and much more. Switching locales is a single parameter (country=); here’s an example that gets person data for Germany ("DE"):
schema = pb.Schema(
name=pb.string_field(preset="name"),
email=pb.string_field(preset="email"),
city=pb.string_field(preset="city"),
state=pb.string_field(preset="state"),
phone=pb.string_field(preset="phone_number"),
)
pb.preview(pb.generate_dataset(schema, n=8, seed=23, country="DE"))PolarsRows8Columns5 |
|||||
name String |
String |
city String |
state String |
phone String |
|
|---|---|---|---|---|---|
| 1 | Ignaz Schulze | ignazschulze@freenet.de | Potsdam | Brandenburg | (0335) 150-6730 |
| 2 | Sandra Schneider | sandra922@mail.de | Halle (Saale) | Sachsen-Anhalt | (0391) 478-3743 |
| 3 | Antje Jung | antje_jung@yahoo.de | Frankfurt am Main | Hessen | (069) 188-2883 |
| 4 | Jennifer Opitz | j_opitz@gmx.de | Leipzig | Sachsen | (0371) 162-0756 |
| 5 | Eva Lehmann | evalehmann@outlook.de | Cologne | Nordrhein-Westfalen | (0231) 961-3846 |
| 6 | Alexandra Koch | alexandra.koch@outlook.de | Berlin | Berlin | (030) 489-8041 |
| 7 | Christiane Becker | cbecker@gmail.com | Stuttgart | Baden-Württemberg | (0711) 258-6321 |
| 8 | Thomas Mertens | thomas.mertens@posteo.de | Magdeburg | Sachsen-Anhalt | (0345) 881-3877 |
What you see in the above dataset are German names, cities, and phone numbers (where area codes match the locations). Switch to "AU" and you get Australian data:
pb.preview(pb.generate_dataset(schema, n=8, seed=23, country="AU"))PolarsRows8Columns5 |
|||||
name String |
String |
city String |
state String |
phone String |
|
|---|---|---|---|---|---|
| 1 | Ethan Ryan | ethanryan@bigpond.com | Toowoomba | Queensland | (07) 0308 7150 |
| 2 | Olivia Jones | olivia922@dodo.com.au | Hobart | Tasmania | (03) 7301 4783 |
| 3 | Thea Roberts | troberts@icloud.com | Melbourne | Victoria | (03) 4311 8828 |
| 4 | Frankie Rowe | frankierowe@mail.com | Brisbane | Queensland | (07) 4162 0756 |
| 5 | Freya Lee | flee64@internode.on.net | Brisbane | Queensland | (07) 9613 8466 |
| 6 | Audrey Taylor | audreytaylor@optusnet.com.au | Melbourne | Victoria | (03) 8980 4102 |
| 7 | Sadie Brown | sadie.brown@protonmail.com | Brisbane | Queensland | (07) 8632 1588 |
| 8 | John Dawson | john_dawson@fastmail.com.au | Perth | Western Australia | (08) 3877 4056 |
Or Brazilian data:
pb.preview(pb.generate_dataset(schema, n=8, seed=23, country="BR"))PolarsRows8Columns5 |
|||||
name String |
String |
city String |
state String |
phone String |
|
|---|---|---|---|---|---|
| 1 | Bruno Soares | brunosoares@terra.com.br | Campinas | São Paulo | (14) 0308-7150 |
| 2 | Ana Santos | ana922@zipmail.com.br | Porto Alegre | Rio Grande do Sul | (55) 7301-4783 |
| 3 | Regina Andrade | randrade@bol.com.br | Rio de Janeiro | Rio de Janeiro | (22) 4311-8828 |
| 4 | Lorena Nóvoa | lorenanovoa@icloud.com | Belo Horizonte | Minas Gerais | (35) 3416-2075 |
| 5 | Alícia Lopes | alopes64@yahoo.com.br | Belo Horizonte | Minas Gerais | (37) 6296-1384 |
| 6 | Vitória Ferreira | vitoriaferreira@globo.com | Rio de Janeiro | Rio de Janeiro | (22) 6489-8041 |
| 7 | Stella Souza | stella.souza@live.com | Belo Horizonte | Minas Gerais | (31) 2586-3215 |
| 8 | José Brito | jose_brito@protonmail.com | Brasilia | Distrito Federal | (61) 3877-4056 |
The country parameter accepts ISO alpha-2 codes ("US", "DE", "JP") and alpha-3 codes ("USA", "DEU", "JPN").
Mixing multiple countries
For datasets that need to represent a multinational user base, pass a list for an equal distribution, or, a dictionary for weighted proportions:
schema = pb.Schema(
name=pb.string_field(preset="name"),
email=pb.string_field(preset="email"),
city=pb.string_field(preset="city"),
country=pb.string_field(preset="country"),
)
# Weighted: 60% US, 25% Germany, 15% Japan
mixed = pb.generate_dataset(
schema, n=20, seed=23,
country={"US": 0.60, "DE": 0.25, "JP": 0.15},
)
pb.preview(mixed)PolarsRows20Columns4 |
||||
name String |
String |
city String |
country String |
|
|---|---|---|---|---|
| 1 | Jens Hartmann | j_hartmann@gmail.com | Augsburg | Germany |
| 2 | Cooper Richards | c_richards@aol.com | Akron | United States |
| 3 | Martina Koch | m_koch@gmx.de | Heilbronn | Germany |
| 4 | Lars Herbst | lherbst@outlook.de | Oldenburg | Germany |
| 5 | Debra Patterson | debra.patterson@yahoo.com | Pittsburgh | United States |
| 16 | Adrian Peters | adrianpeters@outlook.de | Essen | Germany |
| 17 | Yuji Yamamoto | yuji.yamamoto51@docomo.ne.jp | Chiba | Japan |
| 18 | Matteo Bishop | matteo.bishop18@mail.com | Brooklyn | United States |
| 19 | Robert Martin | robert636@gmail.com | Philadelphia | United States |
| 20 | Barbara Simpson | bsimpson56@outlook.com | Rochester | United States |
By default, rows from different countries are shuffled (set shuffle=False to keep them grouped by country instead).
This kind of multinational dataset is really valuable in practice. If you’re building a global e-commerce platform, you need test data that reflects customers in multiple regions. Other uses include: fintech applications processing cross-border transactions, logistics companies tracking shipments through different postal systems, and SaaS products localizing their onboarding flows. All of these use cases can benefit from synthetic data that accurately represents the countries involved, rather than defaulting to US-only placeholders.
The three coherence systems
I touched a bit on coherence earlier, but it’s worth spelling out explicitly because it’s one of the things that separates Pointblank’s generator from a bag of random values.
The package applies three coherence systems automatically based on which presets you include.
Person coherence
When name, first_name, last_name, email, or user_name presets appear together, emails and usernames are derived from the person’s actual name.
Address coherence
When city, state, postcode, phone_number, latitude, longitude, or license_plate presets appear together, all values are consistent for the same geographic location within each row.
Business coherence
When both job and company appear, they’re drawn from the same industry. If name_full is also present, people in certain professions get appropriate titles (Dr., Prof., etc.), and any integer field for age is automatically constrained to a realistic working range of 22–65.
An example that makes use of all three types
Here’s a more comprehensive example with many uses of string_field(preset=):
schema = pb.Schema(
name=pb.string_field(preset="name_full"),
email=pb.string_field(preset="email"),
company=pb.string_field(preset="company"),
job=pb.string_field(preset="job"),
city=pb.string_field(preset="city"),
state=pb.string_field(preset="state"),
postcode=pb.string_field(preset="postcode"),
age=pb.int_field(),
)
pb.preview(pb.generate_dataset(schema, n=12, seed=23))PolarsRows12Columns8 |
||||||||
name String |
String |
company String |
job String |
city String |
state String |
postcode String |
age Int64 |
|
|---|---|---|---|---|---|---|---|---|
| 1 | Mr. Leo Stevens | leo_stevens@gmail.com | Creative Software Digital | System Administrator | Lubbock | Texas | 79456 | 40 |
| 2 | Rev. Archer Ross | archer.ross@hotmail.com | Anaheim Freight Services | Buyer | Anaheim | California | 92860 | 27 |
| 3 | Mrs. Carolyn Gonzales | carolyn626@protonmail.com | Premier Technologies Solutions | System Administrator | Phoenix | Arizona | 85005 | 23 |
| 4 | Mr. Walter Peters | walter.peters@gmail.com | Costa Legal Services | Attorney | Denver | Colorado | 80267 | 59 |
| 5 | Mr. Everett King | everettking@aol.com | San Antonio School District | Teacher | San Antonio | Texas | 78229 | 41 |
| 8 | Dr. Christopher Crawford | christopher.crawford29@aol.com | Harris Medical Group | Nurse | Irvine | California | 92604 | 55 |
| 9 | Mrs. Katherine Flores | katherine545@protonmail.com | CVS Health | Nurse | Seattle | Washington | 98172 | 44 |
| 10 | Mr. Zachary Wright | zachary_wright@aol.com | Wood & Woods | Electrical Engineer | Denver | Colorado | 80265 | 30 |
| 11 | Mr. Russell Hawkins | r_hawkins@mail.com | Baltimore Grand Hotel | Event Coordinator | Baltimore | Maryland | 21297 | 34 |
| 12 | Mrs. Julia Powell | julia_powell@outlook.com | Los Angeles Academy | Librarian | Los Angeles | California | 90008 | 39 |
Notice the professional titles on some names, the consistent city/state/postcode combinations, and the age values falling within a plausible working range.
Profile fields: The fast path
For the very common case of generating person-centric data, profile_fields() provides a shortcut. It returns a dictionary of pre-configured StringField objects that you unpack into a schema:
schema = pb.Schema(
**pb.profile_fields(set="standard"),
account_id=pb.int_field(min_val=1, unique=True),
)
pb.preview(pb.generate_dataset(schema, n=10, seed=23))PolarsRows10Columns8 |
||||||||
first_name String |
last_name String |
String |
city String |
state String |
postcode String |
phone_number String |
account_id Int64 |
|
|---|---|---|---|---|---|---|---|---|
| 1 | Patricia | Williams | patricia_williams@yandex.com | Lubbock | Texas | 79420 | (713) 225-8632 | 7188536481533917197 |
| 2 | Andrea | Mitchell | a_mitchell@gmail.com | Anaheim | California | 92875 | (323) 788-1387 | 2674009078779859984 |
| 3 | Maria | Valentine | maria.valentine54@gmail.com | Phoenix | Arizona | 85062 | (928) 605-6026 | 7652102777077138151 |
| 4 | Virginia | Walker | virginia.walker@outlook.com | Denver | Colorado | 80296 | (720) 227-6164 | 157503859921753049 |
| 5 | Brenda | Lopez | b_lopez@yahoo.com | San Antonio | Texas | 78213 | (972) 488-4413 | 2829213282471975080 |
| 6 | Lauren | Davis | l_davis@outlook.com | New York | New York | 10084 | (212) 960-7964 | 3497364383162086858 |
| 7 | John | West | j_west@zoho.com | Charlotte | North Carolina | 28266 | (910) 854-4526 | 3302703640991750415 |
| 8 | Claire | Jackson | claire202@outlook.com | Irvine | California | 92648 | (310) 878-4841 | 6695746877064448147 |
| 9 | Ariana | Wood | ariana_wood@zoho.com | Seattle | Washington | 98198 | (360) 542-8519 | 2466163118311913924 |
| 10 | Michael | Simmons | michaelsimmons@mail.com | Denver | Colorado | 80204 | (970) 349-7004 | 129827878195925732 |
The "standard" set includes first_name, last_name, email, city, state, postcode, and phone_number. There’s also "minimal" (just name, email, and phone) and "full" (adds address, company, and job). You can further customize with include= and exclude= parameters to add or remove specific fields.
Regex patterns for structured strings
When none of the built-in presets fit, string_field() also accepts a pattern= parameter for regex-based generation. Pointblank’s regex engine supports character classes, quantifiers, alternation, and groups:
schema = pb.Schema(
sku=pb.string_field(pattern=r"SKU-[A-Z]{2}-[0-9]{5}"),
tracking=pb.string_field(pattern=r"1Z[0-9]{4}[A-Z]{2}[0-9]{8}"),
code=pb.string_field(pattern=r"(ALPHA|BETA|GAMMA)-[0-9]{3}"),
)
pb.preview(pb.generate_dataset(schema, n=8, seed=23))PolarsRows8Columns3 |
|||
sku String |
tracking String |
code String |
|
|---|---|---|---|
| 1 | SKU-CA-66852 | 1Z1094MQ23470397 | BETA-094 |
| 2 | SKU-IO-39701 | 1Z1176QU50309529 | BETA-852 |
| 3 | SKU-WP-08650 | 1Z5959VK72797222 | GAMMA-470 |
| 4 | SKU-ZB-29359 | 1Z8391DN94949515 | ALPHA-011 |
| 5 | SKU-SJ-91727 | 1Z8478IE91735829 | GAMMA-608 |
| 6 | SKU-VU-22858 | 1Z6270UN02303087 | GAMMA-503 |
| 7 | SKU-SD-16094 | 1Z5067BC78374311 | ALPHA-293 |
| 8 | SKU-SK-54847 | 1Z8834NF75629613 | GAMMA-959 |
This is useful for generating product codes, tracking numbers, internal identifiers, or any string that follows a predictable format.
Categorical columns and nullable fields
For columns drawn from a fixed set of values, use the allowed= parameter:
schema = pb.Schema(
plan=pb.string_field(allowed=["Free", "Pro", "Enterprise"]),
region=pb.string_field(allowed=["AMER", "EMEA", "APAC"]),
satisfaction=pb.int_field(allowed=[1, 2, 3, 4, 5]),
notes=pb.string_field(preset="user_agent", nullable=True, null_probability=0.3),
)
pb.preview(pb.generate_dataset(schema, n=12, seed=23))PolarsRows12Columns4 |
||||
plan String |
region String |
satisfaction Int64 |
notes String |
|
|---|---|---|---|---|
| 1 | Pro | EMEA | 3 | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/120.0.0.0 |
| 2 | Free | AMER | 1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 14_6_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15 |
| 3 | Free | AMER | 1 | None |
| 4 | Enterprise | APAC | 5 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 |
| 5 | Pro | EMEA | 3 | None |
| 8 | Enterprise | APAC | 5 | Mozilla/5.0 (Macintosh; Intel Mac OS X 15_0_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.15 |
| 9 | Pro | EMEA | 3 | None |
| 10 | Free | AMER | 2 | Mozilla/5.0 (Linux; Android 15; SM-S911B) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/28.0 Chrome/122.0.0.0 Mobile Safari/537.36 |
| 11 | Enterprise | APAC | 2 | Mozilla/5.0 (Macintosh; Intel Mac OS X 15_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15 |
| 12 | Free | AMER | 3 | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36 |
The nullable=True and null_probability= parameters let you introduce realistic missing data. About 30% of the notes values will be null.
Frequency-weighted sampling
By default, Pointblank uses frequency-weighted sampling for names and cities (weighted=True). This means you’ll see common names like "James" or "Maria" appearing more often than rare ones, following a four-tier distribution: very common (45%), common (30%), uncommon (20%), and rare (5%).
This produces datasets that feel more realistic than a uniform random draw. If you want every name to have an equal chance of appearing, set weighted=False.
A larger example: Event log data
So far, we’ve focused on person and business data, but generate_dataset() handles temporal and numeric types just as well. Let’s build a simulated event log, the kind of table you’d see behind a product analytics dashboard. This schema brings together several field types we haven’t combined yet: datetime_field() for timestamps, duration_field() for session lengths, bool_field() for success/failure flags, and the "ipv4" string preset for IP addresses.
The allowed= parameter on string_field() is doing the work of defining the event vocabulary. Rather than generating random strings, it draws uniformly from the list of actions we provide, giving us a clean categorical column.
from datetime import datetime, timedelta
schema = pb.Schema(
event_id=pb.int_field(min_val=1, unique=True),
user_id=pb.int_field(min_val=1000, max_val=1050),
action=pb.string_field(
allowed=["page_view", "click", "purchase", "signup", "logout"]
),
timestamp=pb.datetime_field(
min_date=datetime(2025, 11, 1),
max_date=datetime(2025, 11, 30, 23, 59, 59),
),
duration=pb.duration_field(
min_duration=timedelta(seconds=1),
max_duration=timedelta(minutes=10),
),
success=pb.bool_field(p_true=0.92),
ip_address=pb.string_field(preset="ipv4"),
)
events = pb.generate_dataset(schema, n=40, seed=23)
pb.preview(events)PolarsRows40Columns7 |
|||||||
event_id Int64 |
user_id Int64 |
action String |
timestamp Datetime |
duration Duration |
success Boolean |
ip_address String |
|
|---|---|---|---|---|---|---|---|
| 1 | 7188536481533917197 | 1049 | purchase | 2025-11-15 01:46:38 | 0:04:57 | False | 148.42.8.157 |
| 2 | 2674009078779859984 | 1018 | page_view | 2025-11-05 01:20:36 | 0:01:26 | False | 216.194.183.66 |
| 3 | 7652102777077138151 | 1005 | page_view | 2025-11-01 19:53:44 | 0:00:18 | True | 98.136.227.7 |
| 4 | 157503859921753049 | 1001 | logout | 2025-11-29 17:45:42 | 0:05:15 | True | 113.232.12.54 |
| 5 | 2829213282471975080 | 1037 | purchase | 2025-11-15 21:22:57 | 0:07:14 | True | 43.255.215.10 |
| 36 | 6232456323939446652 | 1002 | logout | 2025-11-28 18:28:58 | 0:05:22 | True | 41.215.141.245 |
| 37 | 1508803708693178976 | 1037 | purchase | 2025-11-14 15:42:28 | 0:09:47 | True | 90.152.135.44 |
| 38 | 7369527199060817792 | 1023 | logout | 2025-11-16 06:17:31 | 0:01:28 | True | 115.31.254.193 |
| 39 | 4921468493992610632 | 1042 | purchase | 2025-11-28 19:34:35 | 0:08:06 | True | 9.233.210.149 |
| 40 | 6210729776073352921 | 1011 | purchase | 2025-11-05 03:53:23 | 0:03:02 | True | 163.208.178.154 |
What we get is 40 rows of event data spread across November 2025. Each row has a unique event ID, a user ID drawn from a small pool (simulating repeat visitors), a random action, a timestamp within our date window, a session duration between 1 second and 10 minutes, a success flag that’s True about 92% of the time, and a plausible IPv4 address. All from a single generate_dataset() call.
Because the output is a Polars DataFrame, we can immediately run aggregations on it. Here’s a quick summary grouped by action type, showing the count of events, the average success rate, and the mean duration:
pb.preview(
events
.group_by("action")
.agg(
pl.col("event_id").count().alias("count"),
pl.col("success").mean().round(2).alias("success_rate"),
pl.col("duration").mean().alias("avg_duration"),
)
.sort("count", descending=True)
)PolarsRows5Columns4 |
||||
action String |
count UInt32 |
success_rate Float64 |
avg_duration Duration |
|
|---|---|---|---|---|
| 1 | purchase | 10 | 0.9 | 0:05:44.800000 |
| 2 | page_view | 9 | 0.89 | 0:03:44.222222 |
| 3 | logout | 8 | 1.0 | 0:04:20.750000 |
| 4 | signup | 7 | 0.86 | 0:03:57.285714 |
| 5 | click | 6 | 1.0 | 0:06:53.500000 |
This is the sort of exploratory analysis you might do while building a reporting pipeline or testing a dashboard query. The synthetic data gives you something to run your code against before the real event stream is available.
Validating what you generate
Pointblank started as a data validation library, and data generation turns out to be a natural extension of that core mission. The two capabilities complement each other quite well: the same Schema object that describes what your data should look like can also produce data that does look like that. This means you can build validation logic and test it against controlled synthetic inputs, all within one consistent API.
There’s a satisfying loop to this workflow. You define a schema, generate data from it, and then validate that the data meets your expectations. Here we generate 100 rows with a Field-based schema, then verify the structure with col_schema_match() using a dtype-based schema, and add a few value-level checks on top:
gen_schema = pb.Schema(
id=pb.int_field(min_val=1, unique=True),
name=pb.string_field(preset="name"),
score=pb.float_field(min_val=0.0, max_val=100.0),
active=pb.bool_field(),
)
test_data = pb.generate_dataset(gen_schema, n=100, seed=23)
# A dtype-based schema for structural validation
val_schema = pb.Schema(
id="Int64",
name="String",
score="Float64",
active="Boolean",
)
validation = (
pb.Validate(data=test_data)
.col_schema_match(schema=val_schema)
.col_vals_between(columns="score", left=0.0, right=100.0)
.col_vals_not_null(columns="name")
.col_vals_gt(columns="id", value=0)
.rows_distinct(columns_subset="id")
.interrogate()
)
validation| Pointblank Validation | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2026-04-13|17:29:09 Polars |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #4CA64C | 1 |
col_schema_match() |
— | SCHEMA | ✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| #4CA64C | 2 |
col_vals_between() |
score | [0.0, 100.0] | ✓ | 100 | 100 1.00 |
0 0.00 |
— | — | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| #4CA64C | 3 |
col_vals_not_null() |
name | — | ✓ | 100 | 100 1.00 |
0 0.00 |
— | — | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| #4CA64C | 4 |
col_vals_gt() |
id | 0 | ✓ | 100 | 100 1.00 |
0 0.00 |
— | — | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| #4CA64C | 5 |
rows_distinct() |
id | — | ✓ | 100 | 100 1.00 |
0 0.00 |
— | — | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2026-04-13 17:29:09 UTC< 1 s2026-04-13 17:29:09 UTC |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Notes Step 1 (schema_check) ✓ Schema validation passed. Schema Comparison
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The generated data should pass all checks, giving you a clean baseline for your validation logic. In practice, this is how you’d develop and refine validation rules before pointing them at real data: generate a known-good dataset, confirm your checks pass, then swap in the production table and see what fails. Having generation and validation in the same package makes that iteration cycle very tight.
Wrapping up
Synthetic data generation sits at the intersection of several real needs: testing, prototyping, teaching, and privacy. Pointblank’s generate_dataset() tries to make it practical by handling the tedious parts automatically (type-appropriate random values, coherent cross-column relationships, country-specific formatting) so you can focus on the shape of the data you actually need.
Define a schema, call generate_dataset(), and you have a DataFrame ready to go, which is the sort of simplicity that matters when you need data but can’t use the real thing. If you’d like to explore further, the Pointblank website has extensive documentation on data generation, including a dedicated User Guide section and full API documentation for every field type and function covered here.