phrase-ticker Dataset Generator
The phrase-ticker Dataset Generator is a Python project, designed to link S&P 500 company tickers with relevant natural language phrases. Utilizing GPT-4, this tool aims to enhance stock ticker extraction from textual data, supporting financial NLP tasks like sentiment analysis and entity recognition. It's crafted for easy integration with the Hugging Face datasets library, making it an invaluable asset for machine learning applications in finance.
Contents
- Data Collection: Utilizes web scraping to gather names and ticker symbols of S&P 500 companies from Wikipedia.
- Phrase Generation: Employs GPT-4 to generate natural language phrases related to each company.
- Dataset Construction: Assembles the generated phrases and tickers into a structured dataset.
- Hugging Face Integration: Formats the dataset for seamless use with the Hugging Face
datasetslibrary.
Getting Started
Prerequisites
- Python 3.8+
- Jupyter Notebook or JupyterLab
- Pip for package installation
Installation
- Clone the repository:
git clone git@github.com:rohanmahen/phrase-ticker.git
- Navigate to the project directory:
- Install required Python packages:
pip install -r requirements.txt
Configuration
Ensure your environment is correctly set up:
- Sign up at OpenAI to obtain an API key.
- In the project root, create a
.envfile and insert your OpenAI API key:Important: To secure your API key, do not share or commit theOPENAI_API_KEY='your_api_key_here'.envfile.
Usage
Open and run the src/main.ipynb notebook to generate the dataset:
jupyter notebook src/main.ipynb
or
jupyter lab src/main.ipynb
Follow the notebook instructions for detailed steps on data collection, phrase generation, dataset construction, and export.
Contributing & Usage
We welcome contributions! Feel free to fork the repository and submit pull requests with enhancements, bug fixes, or documentation improvements. Also please feel free to use the dataset for your own projects via HuggingFace.
License
This project is licensed under the MIT License - see the LICENSE file for details.