Untitled

7 min read Original article ↗

People cannot spell for s***. And datetimes are very difficult to format consistently for various reasons. Date inference is genuinely hard—there’s no universal standard, and everyone has their own idea of what “01/02/2023” means (January 2nd or February 1st?). This is my personal battle.

There’s a reason this meme is one of the most upvoted posts on r/dataengineering. We’ve all been there.

Data engineer trying to normalize all possible formats of a datetime field

The AWS Glue Migration That Wasn’t

Here’s something I dealt with recently. We were migrating 50+ tables from MySQL to Postgres using AWS Glue. Should be straightforward—Glue’s built for this. You set up the connection, map the tables, hit run, grab coffee.

Table #37 fails. Then #41. Then #44.

The error message is useless: ERROR: invalid input syntax for type timestamp. Cool. Which timestamp? Which row? Which of the 47 datetime columns in this table?

You’re dealing with 10 million rows. And ONE row—ONE!!!—is causing the entire migration to fail.

You dig into the source data. The created_at column has entries like:

2023-01-15 14:30:00   -- Standard MySQL datetime
01/15/2023 2:30 PM    -- Someone's Excel export
January 15, 2023      -- Marketing team entry
2023-1-15             -- Missing zero padding
15-Jan-23             -- European contractor
NULL                  -- Actually fine
""                    -- Empty string (NOT fine)
Q1 2023               -- Why? Just... why?

MySQL doesn’t care. It’s got loose type coercion. Postgres? Postgres will not accept this nonsense.

PySpark Won’t Save You

“Fine,” you think. “I’ll just use PySpark to clean this first.”

from pyspark.sql import SparkSession
from pyspark.sql.types import TimestampType

spark = SparkSession.builder.getOrCreate()
df = spark.read.jdbc(url="jdbc:mysql://...", table="orders")

# Try to cast it
df = df.withColumn("created_at", F.col("created_at").cast(TimestampType()))

PySpark can’t infer it. Which is bull****, by the way. It just returns NULL for anything it doesn’t understand. No error. No warning. Just silent data loss.

So now you’re writing custom parsing logic.

What I Actually Tried

Attempt 1: “Spark’s built-in functions will handle this”

from pyspark.sql import functions as F
from pyspark.sql.types import TimestampType

df = df.withColumn("created_at", F.col("created_at").cast(TimestampType()))

Result: 30% of values become NULL. No error. No warning. Just gone.

“Fine, I’ll try to_timestamp() with format strings:”

df = df.withColumn("created_at",
    F.to_timestamp(F.col("created_at"), "yyyy-MM-dd HH:mm:ss"))

Result: 70% become NULL because they’re not in that exact format.

“I’ll try multiple formats!”

df = df.withColumn("created_at",
    F.coalesce(
        F.to_timestamp(F.col("created_at"), "yyyy-MM-dd HH:mm:ss"),
        F.to_timestamp(F.col("created_at"), "MM/dd/yyyy"),
        F.to_timestamp(F.col("created_at"), "dd/MM/yyyy"),
        F.to_timestamp(F.col("created_at"), "yyyy-MM-dd")
    ))

Now I have 8% NULL. The other 92% parsed! But which format matched which row? No idea. And “January 15, 2023” is still NULL.

Added 12 more formats to the coalesce(). The query plan is now unreadable. Execution time: 45 minutes for 10 million rows.

Attempt 2: “I’ll just write a UDF”

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from datetime import datetime
import dateutil.parser

@udf(returnType=StringType())
def parse_datetime(value):
    if not value:
        return None
    try:
        return dateutil.parser.parse(value).strftime('%Y-%m-%d %H:%M:%S')
    except:
        return None

df = df.withColumn("created_at", parse_datetime(F.col("created_at")))

Execution time: 6 hours.

Why? Because UDFs serialize every row to Python, parse it, serialize it back. For 10 million rows. On distributed data. The Spark UI shows I’m not even using the cluster, it’s all bottlenecked on Python serialization.

“Fine, I’ll use pandas UDFs, those are faster:”

from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf(StringType())
def parse_datetime_pandas(s: pd.Series) -> pd.Series:
    return pd.to_datetime(s, errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')

Execution time: 2 hours. Better! But still 8% NULL. Still no idea which rows failed or why.

Attempt 3: “I’ll do multiple passes with different UDFs”

# Pass 1: Standard formats
df = df.withColumn("created_at_clean", parse_udf_standard(F.col("created_at")))

# Pass 2: Fix nulls with named months
df = df.withColumn("created_at_clean",
    F.when(F.col("created_at_clean").isNull(),
        parse_udf_named_months(F.col("created_at")))
    .otherwise(F.col("created_at_clean")))

# Pass 3: Fix nulls with European formats
df = df.withColumn("created_at_clean",
    F.when(F.col("created_at_clean").isNull(),
        parse_udf_european(F.col("created_at")))
    .otherwise(F.col("created_at_clean")))

# Pass 4: ...you get the idea
  • Execution time: 4 hours (multiple full scans of 10 million rows)
  • Code length: 600 lines across 8 different UDFs
  • Success rate: 99.1%
  • My mental state: Broken
# Extract ISO format dates
df = df.withColumn("extracted",
    F.regexp_extract(F.col("created_at"), r'(d{4}-d{2}-d{2})', 1))

# Extract MM/DD/YYYY
df = df.withColumn("extracted",
    F.when(F.col("extracted") == "",
        F.regexp_extract(F.col("created_at"), r'(d{2})/(d{2})/(d{4})', 0))
    .otherwise(F.col("extracted")))

This extracted dates but didn’t parse them into the right format. 01/15/2023 stayed as 01/15/2023. I need it as 2023-01-15.

Now I need to parse what I extracted. Back to F.to_timestamp(). Back to NULLs.

Attempt 5: “I’ll just dump to pandas and fix it there”

# Collect to pandas
pdf = df.toPandas()

# Fix in pandas
pdf['created_at'] = pd.to_datetime(pdf['created_at'], errors='coerce')

# Back to Spark
df = spark.createDataFrame(pdf)

Memory error. 10 million rows don’t fit in memory on the driver node. Typical.

“Fine, I’ll do it in partitions:”

def fix_partition(iterator):
    for pdf in iterator:
        pdf['created_at'] = pd.to_datetime(pdf['created_at'], errors='coerce')
        yield pdf

df = df.mapInPandas(fix_partition, schema=df.schema)

This works but takes 3 hours and yet still have 8% NULL values.

Attempt 6: The 2AM Abomination

from pyspark.sql.functions import pandas_udf
import pandas as pd
from dateutil import parser
import re

@pandas_udf(StringType())
def parse_datetime_nuclear_option(s: pd.Series) -> pd.Series:
    def parse_single(value):
        if pd.isna(value) or value == "":
            return None

        # Try pandas first (fast)
        try:
            return pd.to_datetime(value).strftime('%Y-%m-%d %H:%M:%S')
        except:
            pass

        # Try dateutil (slow but flexible)
        try:
            return parser.parse(str(value), fuzzy=True).strftime('%Y-%m-%d %H:%M:%S')
        except:
            pass

        # Try regex extraction for ISO
        match = re.search(r'd{4}-d{2}-d{2}', str(value))
        if match:
            return match.group(0) + ' 00:00:00'

        # Try named months
        months = {'january': '01', 'jan': '01', 'february': '02', ...}
        # ... 40 more lines of string manipulation

        # Try quarter notation
        if 'Q' in str(value):
            # ... 20 more lines

        return None

    return s.apply(parse_single)

This is so frustrating. Just kill me.

df = df.withColumn("created_at", parse_datetime_nuclear_option(F.col("created_at")))
  • Execution time: 8 hours
  • Code length: 200 lines in a single UDF
  • Success rate: 99.4%
  • The remaining 0.6%: Truly cursed data like “FY Q3 2023”, “sometime in january”, and my personal favorite: “2023-13-45” (month 13, day 45—someone just mashing numbers)
  • Cost: $47 in AWS Glue DPU hours

At this point it’s 4am. The migration is still failing. I have a 200-line UDF that takes 8 hours to run and still doesn’t work for all rows. And I still have 12 other tables with datetime columns to fix.

This is when I built DataCompose.

The fix is stupidly simple: stop using UDFs.

All those attempts failed because UDFs serialize data to Python and back. That’s where the 8 hours went. If you stay in native PySpark, Spark actually does its job.

pip install datacompose
datacompose add datetimes
from transformers.pyspark.datetimes import datetimes

df = df.withColumn("created_at", datetimes.standardize_iso(F.col("created_at")))

3 minutes. Same 10 million rows. Same 50+ format variations. 3 minutes.

It’s just a big F.coalesce() with a ton of F.to_timestamp() calls and regex extractions, all native PySpark. No Python serialization, no row-by-row nonsense. Spark’s Catalyst optimizer actually gets to do its thing.

That’s the whole idea behind primitives—small, single-purpose functions that stay in native PySpark. You can chain them together, but each one just does one thing. standardize_iso parses dates. is_valid_date validates them. extract_year pulls out the year. Boring, but that’s the point.

What’s Actually Happening

When you call datetimes.standardize_iso(col), you’re not calling a UDF. You’re getting back a PySpark Column expression. Something like:

F.coalesce(
    F.to_timestamp(col, "yyyy-MM-dd HH:mm:ss"),
    F.to_timestamp(col, "yyyy-MM-dd'T'HH:mm:ss"),
    F.to_timestamp(col, "MM/dd/yyyy HH:mm:ss"),
    F.to_timestamp(col, "MM/dd/yyyy"),
    # ... 50 more patterns
    parse_named_month(col),  # "January 15, 2023" → regex + reconstruction
    parse_european(col),      # "15/01/2023" → regex + reconstruction
)

It’s the same brute-force approach I was doing manually, but:

  1. It’s already written and tested
  2. It runs in the JVM, not Python
  3. Spark sees the whole thing as one expression and optimizes it

No magic. Just code I don’t have to write again.

If you’re curious what a primitive looks like, here’s a simple one:

from transformers.pyspark.utils import PrimitiveRegistry
from pyspark.sql import functions as F

datetimes = PrimitiveRegistry("datetimes")

@datetimes.register()
def extract_year(col):
    """Pull the year out of a date string."""
    return F.year(F.to_date(col))

That’s it. The @register() decorator wraps it so you can call datetimes.extract_year(F.col("date")) and get back a Column expression.

You can write your own primitives the same way. If your company has weird date formats that only exist in your legacy systems, you add a primitive for it. It takes a column input and returns a column output.

And yeah, “Q1 2023” and “FY2024” work too:

df = df.withColumn("parsed", datetimes.parse_flexible(F.col("created_at")))
# "Q3 2024" → "2024-07-01"
# "FY2024"  → "2023-10-01"

The 0.6% of truly cursed data like “2023-13-45”? Flagged by is_valid_date, dealt with manually. Which is what should happen with garbage data.

If you want to try it yourself:

pip install datacompose
datacompose add datetimes

The datetimes docs have the full API.