GitHub - ralliesai/openfactor: Multi-factor equity risk model with classic factors, semantic LLM exposures, and residual factor discovery.

Open-source equity factor risk model

Portfolio exposures, factor risk attribution, and idiosyncratic risk

OpenFactor is a deterministic equity risk model for portfolio analytics, risk attribution, and manager research workflows. It is designed to be an open alternative to institutional multi-factor risk models.

Public Model

The first public model is:

It covers the top 1000 active US common stocks by market cap.

What the Model Provides

The model package loads:

Object	Use
`universe`	Model constituents
`exposures`	Ticker-level factor exposures
`factor_returns`	Recent realized factor returns
`residual_returns`	Recent per-stock residual returns after common factors
`exposures_panel`	Lagged exposure rows for 1-day return attribution (loaded on demand)
`factor_covariance`	Annualized factor covariance matrix
`idiosyncratic_risk`	Annualized idiosyncratic residual risk
`metadata`	Universe name, model version, and model metadata

These files are enough to report portfolio exposures, common-factor risk, idiosyncratic risk, and total risk without direct access to vendor data.

Python Usage

pip install git+https://github.com/ralliesai/openfactor.git

import pandas as pd
import openfactor as of

portfolio = pd.DataFrame(
    {
        "ticker": ["AAPL", "MSFT", "NVDA"],
        "value": [400000, 300000, 300000],  # dollars held; negative for a short
    }
)

snapshot = of.load_snapshot("openfactor-us1000")
report = of.portfolio_report(portfolio, snapshot)

portfolio_report accepts a value column (dollar holdings, gross-normalized to signed weights) or an allocation column of model weights; both produce the same tables.

Factor Model Data

OpenFactor also exposes aligned model data for downstream analytics:

data = of.factor_model_data(snapshot)

data.tickers                    # asset index
data.exposures                  # assets x factors
data.factor_covariance          # factors x factors
data.idiosyncratic_variance     # asset idiosyncratic variance
data.benchmark_weights          # cap-weighted model risk proxy
data.factor_groups              # factor group labels

data.factor_exposure(weights)
data.active_factor_exposure(weights)
data.risk(weights)
data.tracking_error(weights)
data.beta(weights)
data.portfolio_frame(weights)

OpenFactor does not solve portfolios. It supplies exposures, covariance, benchmark weights, idiosyncratic variance, and risk helpers that another library can use.

Factor Coverage

Family	Factor	Internal Name	Construction
Market	Market	`market`	Benchmark market leg; SPY/S&P 500 in the public default snapshot
Market	Beta	`beta`	Sensitivity to broad market returns
Size	Size	`size`	Log market capitalization
Size	Mid-Cap	`mid_cap`	Nonlinear size exposure
Momentum	Momentum	`momentum`	12-month return skipping the most recent month
	Industry Momentum	`industry_momentum`	Recent momentum of industry peers
	Seasonality	`seasonality`	Same-month historical return tendency
	Long-Term Reversal	`long_term_reversal`	Negative return from the prior long-horizon window
	Short-Term Reversal	`short_term_reversal`	Negative recent one-month return
Volatility	Residual Volatility	`residual_volatility`	Volatility after removing market beta
	Downside Risk	`downside_risk`	Volatility of negative daily returns
	Prospect	`prospect`	Upside skew and drawdown profile
Liquidity and Positioning	Liquidity	`liquidity`	Log average dollar volume
Liquidity and Positioning	Short Interest	`short_interest`	Short interest scaled by shares
Value and Yield	Value	`value`	Book equity divided by market value
	Earnings Yield	`earnings_yield`	Net income divided by market value
	Forward Earnings Yield	`forward_earnings_yield`	Forward net-income estimate divided by market value
	Dividend Yield	`dividend_yield`	Trailing dividends divided by price
Growth	Growth	`growth`	Revenue and earnings growth
Growth	Forward Growth	`forward_growth`	Forward revenue and earnings growth
Quality	Profitability	`profitability`	Net income divided by assets
	Gross Profitability	`gross_profitability`	Gross profit divided by assets
	Earnings Quality	`earnings_quality`	Cash-flow quality of earnings
	Earnings Variability	`earnings_variability`	Variability of recent quarterly earnings
	Capital Discipline	`investment_quality`	Low asset growth, low capex intensity, buybacks, and low issuance
Balance Sheet	Leverage	`leverage`	Liabilities divided by assets
Balance Sheet	Asset Growth	`investment`	Asset growth from latest filing data
Classification	Sector	`sector:*`	Sector membership
Classification	Industry	`industry:*`	Industry membership
Analyst	Analyst Sentiment	`sentiment`	Time-decayed analyst recommendation score

market is estimated inside the factor-return model. The remaining scalar, sector, and industry factors are ticker-level exposures.

CLI Usage

openfactor --universe openfactor-us1000 --portfolio portfolio.csv

portfolio.csv lists the dollar value held in each name (negative for a short):

ticker,value
AAPL,400000
MSFT,300000
NVDA,300000

OpenFactor normalizes by gross exposure into signed weights, so the absolute book size does not change percentage risk. The Python portfolio_report API takes the same dollar value column (or an allocation weights column) directly.

The CLI opens an interactive Textual terminal that uses SPY as the default return benchmark when public index files are present, keeps ex-ante tracking error in model-risk space, and leads with the decision numbers:

Headline cards: total risk, tracking error, one-day VaR (95%), ex-ante beta to the model risk proxy, and the idiosyncratic share of tracking error.
Portfolio risk: the current absolute risk decomposition: common factor, market, style, sector, industry, idiosyncratic, and total risk.
Active risk: every factor's active exposure and its % of the tracking-error budget, sorted, with annualized contribution-to-tracking-error shown next to the share. Diversifying factors (those that reduce tracking error through covariance) are shown in green.
Idiosyncratic risk by name: which holdings drive idiosyncratic risk, with top-name concentration and the effective number of names.
Active return attribution: benchmark return + active return = portfolio return for the latest trading day. The panel has two separate tables: Active return reconciliation for style, sector, industry, idiosyncratic return, and total active return; and Top active return contributors for ranked factor details with contribution, % Active, and TE Share side by side. When enough --track history exists, the same panel adds multi-day attribution buttons for the stored holding path.
Idiosyncratic return by name: the holdings that drove the name-level return line, adjusted so the name rows reconcile to the benchmark-relative idiosyncratic return shown in the active-return table.
Parametric loss & beta: normal one-day VaR (95% / 99%, total and active), ex-ante beta, realized beta when a --track history exists, and realized information ratio. Historical and macro scenarios are omitted until the snapshot ships a real scenario library.

Building a track record

A single run shows where you stand today. To turn daily snapshots into a real track record, pass --track <folder>:

openfactor --portfolio portfolio.csv --track ./openfactor-track

The track folder is local to your machine. It does not write to OpenFactor's public buckets. Each run stores one dated report under days/<date>/ and rebuilds aggregate CSVs at the folder root for analysis. Re-running the same snapshot date overwrites that date's files.

The folder stores enough detail to answer real multi-day questions later:

File	Contents
`track.csv`	Daily portfolio, benchmark, active return, risk, beta, and summary fields
`holdings.csv`	Daily portfolio weights
`factor_contrib.csv`	Daily factor return contributions
`idiosyncratic_returns.csv`	Daily idiosyncratic return by holding
`idiosyncratic_risk.csv`	Daily idiosyncratic risk by holding
`active_risk.csv`	Daily active-risk driver rows
`risk_rows.csv`	Daily total-risk decomposition rows
`days/<date>/report.json`	Complete report snapshot for that date

Run it daily and the stored daily returns accumulate into realized beta, information ratio, hit rate, and cumulative active return. To backfill honestly, run past dates (--snapshot <date>) with the holdings you actually held then, not today's weights.

Because each day's factor and idiosyncratic return breakdown is stored, the Active return attribution panel adds multi-day buttons as history builds:

Stored days	Button
7+	`1W`
22+	`1M`
63+	`1Q`
252+	`1Y`
More than 252	`All`

Each button sums the real stored daily holdings over that window. It is not a backtest and it does not run today's weights backward. The idiosyncratic return name-driver table switches with the selected window too, using average weight over the selected stored days.

Semantic residual discovery

Pass --semantic to run LLM semantic residual discovery before the terminal opens; any accepted factors then appear in a Semantic residual discovery panel in the report:

export OPENAI_API_KEY=sk-...
openfactor --portfolio portfolio.csv --semantic

It needs OPENAI_API_KEY (LLM + web search). Without the key, OpenFactor asks whether to continue with the normal report or exit, rather than running discovery. See Semantic Residual Discovery below for what it finds and the equivalent Python API.

The terminal lives in tui/; the underlying analytics are in portfolio/active_risk.py. By default OpenFactor loads the latest published model; pass --snapshot <date> for a reproducible historical run.

Report Output

portfolio_report() returns a dictionary of pandas tables.

Key	Table
`missing_holdings`	Holdings not found in the model universe
`style`	Portfolio exposure to scalar factors
`sector`	Portfolio sector allocation
`idiosyncratic_risk`	Holding-level idiosyncratic risk
`factor_risk`	Factor exposure, factor volatility, risk contribution, and variance contribution
`active_risk`	Benchmark-relative factor exposure and tracking-error contribution
`risk_share`	Factor vs idiosyncratic variance share
`total_risk`	Factor, idiosyncratic, and total annualized risk
`tracking_error`	Active factor, idiosyncratic, and total tracking error vs the benchmark

Example report access:

report["style"]
report["factor_risk"]
report["active_risk"]
report["total_risk"]
report["tracking_error"]

Typical table shapes:

style
                      exposure
Beta                    ...
Momentum                ...
Size                    ...
Value                   ...

factor_risk
                      exposure  factor_volatility  risk_contribution
Beta                       ...                ...                ...
Sector: Technology         ...                ...                ...
Momentum                   ...                ...                ...

total_risk
                    risk
factor               ...
idiosyncratic        ...
total                ...

Semantic Residual Discovery

Semantic discovery is on-demand. The base model stays deterministic; the LLM is only called when a portfolio still has enough unexplained idiosyncratic risk to justify looking for a missing common risk.

The bundled client uses web search. Normal OpenFactor reports do not construct the LLM client or require OPENAI_API_KEY.

pip install git+https://github.com/ralliesai/openfactor.git

result = of.discover_semantic_factors(
    portfolio,
    snapshot,
    threshold=0.10,  # 10% residual variance share; pass 0.20 for 20%
    semantic_cache="r2://openfactor-public/semantic_factors.csv",
)

result.candidates
result.accepted
result.skipped

Semantic discovery is primarily a Python API (discover_semantic_factors(), above). The terminal also runs it on demand: openfactor --portfolio portfolio.csv --semantic (needs OPENAI_API_KEY) runs discovery first and adds a Semantic residual discovery panel to the report.

Environment:

User runtime:

Variable	Required For
`OPENAI_API_KEY`	LLM discovery and membership classification
`OPENFACTOR_SEMANTIC_MODEL`	Optional model override
`OPENFACTOR_SEMANTIC_TIMEOUT`	Optional per-request timeout override

Maintainer publishing only:

Variable	Required For
`OPENFACTOR_R2_ACCOUNT_ID`	Writing shared public artifacts
`OPENFACTOR_R2_ACCESS_KEY_ID`	Writing shared public artifacts
`OPENFACTOR_R2_SECRET_ACCESS_KEY`	Writing shared public artifacts

Normal users do not need R2 credentials. The default shared semantic cache is read through the public URL. If discovery finds new labels on a machine without R2 write credentials, the result is still returned; OpenFactor just skips the shared cache write-back.

How it works:

Step	Behavior
Trigger	Runs only when `discover_semantic_factors()` is called
Residual window	Uses recent residual-return history, default `63` trading days
Discovery	Uses residual PCA, deterministic exposures, and web search to propose missing common risks
Guardrail	Rejects candidates already explained by market, sector, industry, or existing style factors
Membership	Classifies each universe stock as binary `0/1`, not a fragile LLM score
Refit	Keeps candidates when idiosyncratic return variance is lower after adding them
Cache	Reuses old binary labels and only asks the LLM for missing ticker/factor cells; write-back is optional

The shared semantic cache lives in the Cloudflare public bucket:

semantic_factors.csv

Shape:

ticker,ai_infrastructure,retail_flow
NVDA,1,0
GME,0,1
AAPL,0,0

Load members for one semantic factor:

import openfactor as of

stocks = of.semantic_factor_members("Retail Speculation")
# ["GME", "HOOD", "RDDT", ...]

The function accepts either the readable factor name or the cache column id:

stocks = of.semantic_factor_members("retail_speculation")

If yesterday's cache covered 1000 stocks and today's universe has 1001, the existing 1000 labels are reused and only the new ticker is classified. Rows for tickers that leave the universe can stay in the cache; they are harmless and useful if the ticker re-enters later.

Institutions can also pass their own client with a complete_json(instructions, payload) method. Pass a local semantic_cache path for private experiments if they want write-back without OpenFactor maintainer credentials; the default shared cache is a public read-only object for normal users.

Model Methodology

OpenFactor separates exposures (how much each stock loads on a factor) from factor returns (what each factor earned), and estimates both with no look-ahead.

Exposures

Exposures are built from price history, market data, point-in-time fundamentals, forward estimates, analyst data, and sector/industry classification. Each scalar exposure is winsorized around the cross-sectional median (MAD-based, so a handful of outliers can't dominate) and then standardized to a z-score. The cap-weighted mean is removed so the market sits near zero on every style factor (each exposure reads as a tilt relative to the market), and the score is divided by the equal-weighted standard deviation so a few mega-caps don't set the scale. It falls back to equal weighting when caps are missing. Sector and industry exposures stay categorical.

Exposures for a given day use only information known before that day's return: prices through the prior close, and the fundamentals and estimates effective as of that date. Nothing from the future leaks in.

Factor returns

Each day, factor returns come from a single Barra-style cross-sectional regression of stock returns on exposures:

stock return = S&P 500 benchmark market + sector + industry + style factors + residual

The fit is built to be robust:

Root-cap weighted (WLS): regression weights are √(market cap), so large, liquid names anchor the fit without a handful of mega-caps dominating it.
Sector returns constrained to a cap-weighted sum of zero, so sector returns read as clean tilts relative to the benchmark market leg.
Winsorized stock returns: a single name's blow-up day can't distort the estimates.
Explicit market, sector, broad-industry, and style factors, with thinly-populated industries folded out of the cross-section.
Rolling and point-in-time: re-run each day on that day's as-of exposures, producing a clean daily history of factor returns and per-stock residuals.

The residuals are what remains after every common factor, and they drive idiosyncratic risk.

Risk

Factor covariance is the annualized sample covariance of recent daily factor returns. Idiosyncratic risk is each stock's annualized residual volatility, treated as uncorrelated across names.

Risk attribution then combines portfolio factor exposures with the factor covariance matrix for common-factor risk, and adds idiosyncratic risk at the portfolio level to give factor, idiosyncratic, and total risk.

Benchmark and active risk

The report carries public index and ETF benchmark files outside the stock factor universe: broad-market and size proxies (SPY, QQQ, IWM, IJH, IJR), the eleven sector SPDRs, and a set of style/factor ETFs (momentum, value, quality, volatility, dividend, and growth). Return attribution uses S&P 500 via SPY as the default benchmark return when index_returns.csv is present, so the headline is SPY benchmark return plus active return equals portfolio return.

openfactor-us1000 is the stock universe and public dataset namespace. It is not used as the return benchmark. When SPY returns are available, the model pins the market factor to SPY and estimates the remaining style, sector, industry, and idiosyncratic returns around that benchmark leg.

The ex-ante risk model still needs a holdings-style risk proxy. Until OpenFactor publishes index look-through or index factor exposures, tracking error and model beta use the cap-weighted model universe (every model constituent weighted by market cap) because that risk proxy ships with the model and needs no index license.

Active exposures are the portfolio's exposures minus the risk proxy's (active = portfolio - risk proxy), and the same factor covariance and idiosyncratic risk produce active factor risk, active idiosyncratic risk, and total tracking error. Because style exposures are standardized around the cap-weighted mean, the risk proxy sits near zero on every style factor: active style exposures read as the portfolio's tilts, the market factor nets to zero, and sector and industry carry the real risk-proxy-relative bets.

Return attribution uses lagged exposures times realized factor returns, plus idiosyncratic returns. The active-return table reconciles active return versus SPY with style, sector, industry, idiosyncratic, and total active rows; it does not show a separate universe-return leg. % Active is contribution divided by active return, so it can exceed 100% when positive and negative drivers offset. TE Share is the same factor's contribution to tracking error from the ex-ante risk model.

Model Quality

Evidence that the model explains returns, measured on the published openfactor-us1000 model. These are in-sample, explanatory statistics: they describe how well the factors fit realized returns, not a forward risk-forecast calibration (bias statistics are future work).

Cross-sectional fit

Statistic	Value
Daily cross-sectional R², mean	63.57%
Daily cross-sectional R², median	63.35%
Trading days in window	252
Average stocks per regression	861

On an average day the model explains roughly 64% of the cross-sectional dispersion of stock returns across market, sector, industry, and style factors. The R² is weighted consistently with the WLS fit and measured around the cap-weighted mean return, so it reflects dispersion explained relative to the market and is not inflated by large index moves. It is a raw, in-sample fit over the latest 252 trading days (~1 year, a single market regime), and the near-identical mean and median indicate a stable day-to-day distribution. The roughly 861 of 1000 names per day reflect stocks dropped when required inputs are missing or their industry group is too thin to estimate.

Factor sanity check

OpenFactor's momentum factor return tracks recognized public momentum factors:

Benchmark	Correlation	Sample
Ken French U.S. Mom	0.77	daily, ~1 year overlap
AQR VME U.S. Momentum	0.59	monthly, ~12 observations

The daily correlation with Ken French is the stronger signal; the monthly AQR figure rests on only ~12 points and should be read as directional. OpenFactor's factor is purified, a cross-sectional regression return orthogonal to the model's other factors (size, beta, sector, and the rest), while the benchmarks are raw sorted portfolios, so a correlation in this range is what we expect and confirms the factor captures momentum rather than replicating any single index.

Roadmap

OpenFactor ships a clean, transparent baseline today. Planned enhancements to the covariance and idiosyncratic-risk estimation include:

Eigenfactor covariance adjustment: debias the factor covariance for use in optimized portfolios.
Volatility-regime scaling: align forecast risk with the current market volatility level.
Newey-West adjustment: account for serial correlation in daily factor returns.
EWMA / half-life weighting: give recent observations more weight.
Bayesian shrinkage of idiosyncratic risk: stabilize idiosyncratic estimates using observation counts.
Bias-statistic calibration: measure the model's forecast accuracy over time.

Files

The public model is stored as inspectable CSV and JSON files:

exposures.csv
details/exposures_long.csv
details/exposures_panel.csv.gz
factor_returns.csv
residual_returns.csv
factor_covariance.csv
idiosyncratic_risk.csv
universe.csv
indexes.csv
index_prices.csv
index_returns.csv
metadata.json

Current public files:

File	URL
Latest pointer	latest.json
Metadata	metadata.json
Exposures	exposures.csv
Long exposures	details/exposures_long.csv
Exposure panel (gzip)	details/exposures_panel.csv.gz
Factor returns	factor_returns.csv
Residual returns	residual_returns.csv
Factor covariance	factor_covariance.csv
Idiosyncratic risk	idiosyncratic_risk.csv
Universe	universe.csv
Index metadata	indexes.csv
Index prices	index_prices.csv
Index returns	index_returns.csv
Semantic cache	semantic_factors.csv

The runtime loader reads the public model files and returns:

snapshot.universe
snapshot.exposures
snapshot.factor_returns
snapshot.residual_returns
snapshot.factor_covariance
snapshot.idiosyncratic_risk
snapshot.indexes
snapshot.index_prices
snapshot.index_returns
snapshot.metadata

Scope

OpenFactor is the risk model layer.

It does not optimize portfolios, run strategy backtests, or simulate execution costs. Those workflows should consume OpenFactor as the risk-model layer from separate portfolio construction or backtesting packages.