Speed up your Feature Engineering with Large Language Models (GPT-4) using CAAFE: A step towards…

Watch CAAFE generate and evaluate features in this demo.

What if the most time-consuming aspects of your data science workflow could be automated, freeing you up to focus on complex problem-solving and critical thinking?

Automating aspects of machine learning like model selection, training, and scoring has been very successful. However, according to the “State of Data Science” report by Anaconda (2020), these tasks account for only around 23% of a data scientist’s time. What about the rest? Data cleaning and engineering, often deemed the more laborious and time-consuming tasks, are largely left to human practitioners.

So far, automation tools were not able to understand the context in which a dataset was being used, so automation was successful in optimizing technical parameters but failed on domain-dependent challenges. This changes with Context-Aware Automated Feature Engineering (CAAFE). CAAFE leverages Large Language Models (LLMs) like GPT-4 to automate the creation of features in tabular datasets based on the description of the dataset and takes into account the context of a data science problem.

Feature engineering is a crucial step in the data science pipeline where the raw dataset is transformed or enriched to improve the performance of machine learning models. This often involves domain knowledge, creativity, and plenty of manual effort. It requires understanding of the data, and then applying mathematical, statistical, and domain-specific transformations to create new features that better represent the problem to the model.

Automation of this process can be a significant boost for data scientists, speeding up model development, reducing human error, and potentially even surfacing new insights by generating features a human engineer might not have considered. CAAFE takes us a step closer to this reality.

CAAFE uses large language models to interpret the description of a dataset and iteratively generate Python code that creates new, semantically meaningful features. But it doesn’t stop there; CAAFE also explains why these generated features could be useful. This approach not only enhances the machine learning model’s performance, but it also provides interpretability and transparency.

Press enter or click to view image in full size

CAAFE accepts a dataset as well as user-specified context information and operates by iteratively proposing and evaluating feature engineering operations. Source: https://arxiv.org/abs/2305.03403

Why not let GPT generate your features directly or simply use OpenA’s Code Interpreter? While GPT-4 is a powerful model, it’s not specifically designed for ML. CAAFE steps in with a systematic verification process to ensure the generated features are useful for the task at hand, providing feedback to the LLM, and safeguarding code execution.

How to apply CAAFE?

Press enter or click to view image in full size

CAAFEs output for a dataset predicting the winners of a round of Tic-Tac-Toe. CAAFE transforms the per field inputs to a representation where the number of symbols in lines, rows and diagonals are considered, which increase accuracy by over 10% in this example.

First, we initialize a CAAFEClassifier object, specifying the base classifier and the language model.

from sklearn.ensemble import RandomForestClassifier
from caafe import CAAFEClassifier

# Initialize your sklearn base classifier
clf_no_feat_eng = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)# Initialize CAAFEClassifier
caafe_clf = CAAFEClassifier(base_classifier=clf_no_feat_eng,
                      llm_model="gpt-4",
                      iterations=2)

Next, we fit the CAAFE-enhanced classifier to our training data.

# df_train is your training data
# target_column_name is the column you're predicting
# dataset_description is a textual description of your dataset
caafe_clf.fit_pandas(df_train,
               target_column_name=target_column_name,
               dataset_description=dataset_description
              )

After fitting the model, we use it to make predictions on our test data.

# df_test is your test data
pred = caafe_clf.predict(df_test)

Finally, to view the Python code that CAAFE generated for creating new features, we use the code attribute.

print(caafe_clf.code)

Remember, the effectiveness of CAAFE is significantly based on the accuracy and detail of the provided dataset description. Therefore, providing a detailed and correct description of your data will allow CAAFE to generate more valuable features.

A full demo notebook can be found at: https://colab.research.google.com/drive/1mCA8xOAJZ4MaB_alZvyARTMjhl6RZf0a

Why is CAAFE a Game-Changer for Data Scientists?

CAAFE is a tool that empowers data scientists to focus on more complex tasks. By generating semantically meaningful features, CAAFE not only speeds up the process but also potentially introduces novel perspectives that human data scientists might have missed.

The automation of feature engineering brings us closer to the goal of complete AutoML pipelines. The ability to semi-automate data science tasks is not just a time-saver. It’s a step towards making data science more accessible, by lowering the barrier to entry and enabling a wider range of people to build effective machine learning models. In a broader sense, tools like CAAFE emphasize the significance of context-aware solutions and hold the potential to extend the scope of AutoML systems to semantic AutoML.

But why does this matter? CAAFE is more than just a tool. It shows the potential of LLMs to automate a broader range of data science tasks. The question is no longer about whether AutoML can evolve to cover more of data science, but how we can harness this potential to make data science more accessible and efficient. Welcome to the exciting future of automated data science.

Automating the integration of domain knowledge into the AutoML process has clear advantages, including i) Reducing the latency from data to trained models; ii) Reducing the cost of creating ML models; iii) Evaluating a more informed space of solutions than previously possible with AutoML, but a larger space than previously possible with manual approaches for integrating domain knowledge; and iv) Enhancing the robustness and reproducibility of solutions, as computer-generated solutions are more easily reproduced.

Executing AI-generated code requires careful consideration. We’ve implemented a whitelist of safe Python commands, but risks remain. Also, AI can replicate or even exacerbate biases present in the training data. Much more work is needed to avoid this. Please use CAAFE cautiously and examine its generated features critically, especially with an eye on principles from algorithmic fairness.

We’d love to hear your feedback on this work. There are bound to be issues, but we’d be interested in hearing about which ones people actually encounter in practice. Please reach out to us with questions and improvements and visit us on priorlabs.ai.

Read the full paper on arxiv: https://arxiv.org/abs/2305.03403