GitHub - sovai-research/open-investment-datasets

4 min read Original article โ†—

Hugging Face Logo

Building the industry's first open-source datasets for investment research. Unlike traditional vendors who limit trials to institutions, we believe in open access. Free access to all datasets with a brief delay; subscribers receive real-time data.

๐Ÿ‘‹ About Me

I'm Derek Snow, founder and researcher at sov.ai. My focus lies in AI & ML in Quantitative Finance, where I develop and curate datasets to advance research and applications in this field.

๐ŸŒ Connect with Me

LinkedIn | GitHub | Hugging Face

๐Ÿ“š Datasets

The price is for commercial usage of lagged data. If you are an academic ignore it, simply download your data from Huggingface!

Emoji Dataset Description Documentation Price p/m
๐Ÿ“ฐ sovai/news_sentiment Two types of news datasets have been developed, one is ticker-matched, and the next is theme-matched. Documentation $200
๐Ÿ“ˆ sovai/price_breakout A dataset with daily updated predictions of price breaking upwards for US Equities. Documentation $220
๐Ÿ” sovai/insider_flow_prediction More than 60+ insider trading features helpful for machine learning, including a flow prediction value. Documentation $465
๐Ÿ’ผ sovai/institutional_trading The dataset provides a comprehensive analysis of institutional investment behaviors, strategies, and portfolio dynamics. Documentation $580
๐Ÿ“ข sovai/lobbying_data A ticker-matched lobbying data to see fine-grained corporate lobbying behavior. Documentation $645
๐Ÿ”ฝ sovai/short_selling This section covers the usage of various short-selling datasets for risk analysis. Documentation $780
๐Ÿ“– sovai/wikipedia_views A look at some of the largest firms and their daily Wikipedia page views and trends. Documentation $200
๐Ÿ’Š sovai/pharma_clinical_trials This section covers a very unique dataset that tags clinical trials with their predicted outcome success. Documentation $850
๐Ÿ“Š sovai/factor_signals This dataset includes traditional accounting factors, alternative financial metrics, and advanced statistical analyses, enabling sophisticated financial modeling. Documentation $270
๐Ÿ“‰ sovai/financial_ratios More than 80+ financial ratios calculated from financial statement and market data. Documentation $270
๐Ÿ“œ sovai/government_contracts Government contracts data from publicly traded companies. Documentation $580
โš ๏ธ sovai/corp_risks Chapter 7 and Chapter 11 bankruptcy predictions made easy for over 13,000 US publicly traded stocks. Documentation $270
๐Ÿ›ก๏ธ sovai/risks We offer daily updates on global risk perceptions, using leading indicators and advanced models to forecast various types of risk. Documentation $270
๐Ÿ’ฌ sovai/cfpb_complaints This section covers the usage of the Consumer Financial Complaint ticker-mapped dataset. Documentation $480
๐Ÿงฎ sovai/risk_indicators We construct a comprehensive corporate risk score for US stocks by analyzing company events. Documentation $270
๐Ÿšฆ sovai/traffic_agencies Data on government website agency traffic. Documentation $250
๐Ÿ‘ฅ sovai/earnings_surprise Earnings announcements are obtained from external sources as well as estimate information leading up to the actual announcement. Documentation $680
โ— sovai/bankruptcy Chapter 7 and Chapter 11 bankruptcy predictions made easy for over 5,000 US publicly traded stocks. Documentation $270

Cost Tip: For commercial access to all 30 real-time datasets on docs.sov.ai, I recommend you subscribe to the $285 p/m package, you can save as much as 90% of the costs.

All our datasets are in beta, be part of our development process. Submit suggestions or error reports through the issues portal.

๐Ÿงช Example Use Cases

Below are example code snippets demonstrating how to load each dataset using the Hugging Face datasets library.

  • ๐Ÿ“ฐ sovai/news_sentiment

    from datasets import load_dataset
    df_news_sentiment = load_dataset("sovai/news_sentiment", split="train").to_pandas()
  • ๐Ÿ“ˆ sovai/price_breakout

    from datasets import load_dataset
    df_price_breakout = load_dataset("sovai/price_breakout", split="train").to_pandas()
  • ๐Ÿ” sovai/insider_flow_prediction

    from datasets import load_dataset
    df_insider_flow = load_dataset("sovai/insider_flow_prediction", split="train").to_pandas()
  • ๐Ÿ’ผ sovai/institutional_trading

    from datasets import load_dataset
    df_institutional_trading = load_dataset("sovai/institutional_trading", split="train").to_pandas()
  • ๐Ÿ“ข sovai/lobbying_data

    from datasets import load_dataset
    df_lobbying_data = load_dataset("sovai/lobbying_data", split="train").to_pandas()
  • ๐Ÿ”ฝ sovai/short_selling

    from datasets import load_dataset
    df_short_selling = load_dataset("sovai/short_selling", split="train").to_pandas()
  • ๐Ÿ“– sovai/wikipedia_views

    from datasets import load_dataset
    df_wikipedia_views = load_dataset("sovai/wikipedia_views", split="train").to_pandas()
  • ๐Ÿ’Š sovai/pharma_clinical_trials

    from datasets import load_dataset
    df_pharma_trials = load_dataset("sovai/pharma_clinical_trials", split="train").to_pandas()
  • ๐Ÿ“Š sovai/factor_signals

    from datasets import load_dataset
    df_factor_signals = load_dataset("sovai/factor_signals", split="train").to_pandas()
  • ๐Ÿ“‰ sovai/financial_ratios

    from datasets import load_dataset
    df_financial_ratios = load_dataset("sovai/financial_ratios", split="train").to_pandas()
  • ๐Ÿ“œ sovai/government_contracts

    from datasets import load_dataset
    df_government_contracts = load_dataset("sovai/government_contracts", split="train").to_pandas()
  • โš ๏ธ sovai/corp_risks

    from datasets import load_dataset
    df_corp_risks = load_dataset("sovai/corp_risks", split="train").to_pandas()
  • ๐Ÿ›ก๏ธ sovai/risks

    from datasets import load_dataset
    df_risks = load_dataset("sovai/risks", split="train").to_pandas()
  • ๐Ÿ’ฌ sovai/cfpb_complaints

    from datasets import load_dataset
    df_cfpb_complaints = load_dataset("sovai/cfpb_complaints", split="train").to_pandas()
  • ๐Ÿงฎ sovai/risk_indicators

    from datasets import load_dataset
    df_risk_indicators = load_dataset("sovai/risk_indicators", split="train").to_pandas()
  • ๐Ÿšฆ sovai/traffic_agencies

    from datasets import load_dataset
    df_traffic_agencies = load_dataset("sovai/traffic_agencies", split="train").to_pandas()
  • ๐Ÿ‘ฅ sovai/earnings_surprise

    from datasets import load_dataset
    df_earnings_surprise = load_dataset("sovai/earnings_surprise", split="train").to_pandas()
  • โ— sovai/bankruptcy

    from datasets import load_dataset
    df_bankruptcy = load_dataset("sovai/bankruptcy", split="train").to_pandas()