The Texas Railroad Commission publishes oil and gas permit data, but it is not easy to use.
The data comes as large fixed-width text files, buried behind a website that requires clicking through downloads. There is no clean API. There is no database. If you want to analyze it, you first have to build your own pipeline.
This project does exactly that.
At a high level, the system does four things:
Automatically download the raw files
Parse fixed-width data into tables
Store everything in SQLite
Expose the data through a FastAPI service
I’ll walk through each step.
The Railroad Commission distributes permit data as ZIP files that must be downloaded through a browser flow. A simple requests.get() will not work.
To solve this, the project uses Selenium to automate Chrome.
The logic lives in backend/src/parsers/pulldata.py.
The key idea is:
Launch a headless browser
Navigate to the RRC download page
Click the download links
Wait for the ZIP files to finish downloading
Here is a simplified version of what that looks like:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get(BASE_URL)
download_link = driver.find_element(By.PARTIAL_LINK_TEXT, "Permits")
download_link.click()
The script then waits until Chrome finishes downloading all files before moving on.
Once complete, all ZIP files are saved locally.
After download, the ZIP files are automatically unpacked.
This step ensures that the pipeline always works with clean, extracted .DAT files and avoids manual cleanup.
This makes the process fully repeatable:
Delete old files
Download fresh data
Unzip everything
Process from scratch
That repeatability is important later when refreshing the database.
This is the core of the project.
The Railroad Commission uses fixed-width files, meaning:
Columns are defined by character positions
There are no commas
Every line must be sliced exactly
This logic lives in backend/src/parsers/process.py.
Here is an example of how a record layout is defined:
colspecs_01 = [
(0, 2),
(2, 9),
(9, 11),
(11, 14),
(14, 46),
]
Each tuple defines the start and end character positions for a column.
The matching column names are defined separately:
names_01 = [
"record_type",
"status_number",
"status_sequence_number",
"county_code",
"lease_name"
]
Pandas handles the parsing cleanly:
df = pd.read_fwf(
file_path,
colspecs=colspecs_01,
names=names_01,
encoding=ENCODING
)
This converts thousands of raw text lines into a structured DataFrame.
Multiple record types are parsed this way, then joined together into a unified permit table.
Instead of keeping CSVs or DataFrames in memory, the project loads everything into a local SQLite database.
This provides:
Persistence
Fast filtering
Simple querying
Easy API access
The pipeline uses temporary tables to avoid partial updates.
The flow looks like this:
cur.execute("DELETE FROM permit_data;")
cur.execute("""
INSERT INTO permit_data
SELECT * FROM permit_data_tmp;
""")
This guarantees that the main table is always fully refreshed and never left half-updated.
Once complete, all temporary tables are dropped.
With the data cleaned and stored, the final step is exposing it through an API.
The FastAPI app is defined in backend/src/main.py:
app = FastAPI(
title="Texas RRC Oil & Gas Permits API",
version="0.1.0"
)
Routes are split cleanly by responsibility.
The permit endpoints live in backend/src/api/routes/permits.py.
For example, listing permits:
@router.get("/")
def list_permits(limit: int = 50):
return get_all_permits(limit=limit)
Fetching a single permit by number:
@router.get("/{permit_number}")
def fetch_permit(permit_number: str):
permit = get_permit_by_number(permit_number)
if not permit:
raise HTTPException(status_code=404)
return permit
And searching by well name:
@router.get("/search")
def search_permits(well_name: str, limit: int = 50):
return search_permits_by_well_name(well_name, limit)
The service layer handles SQL queries and returns clean Python objects.
This project follows a simple but effective pattern:
Browser automation only where necessary
Deterministic parsing logic
SQLite for durability and speed
FastAPI for clean interfaces
You can:
Rebuild the database at any time
Extend the API easily
Swap SQLite for DuckDB or Postgres later
Run everything locally
This is a solid foundation for analytics, dashboards, or public data services.
Public data is often messy, poorly documented, and hard to access. The real work is not analysis, it is getting the data into a usable shape.
This project shows a clean, realistic approach to that problem:
automation → parsing → storage → API.
If you are building analytics on top of legacy or government data, this pattern will serve you well.
Github repo here.