Text2SQL fine-tuning CodeLLama EX to 82.5%

9 min read Original article ↗

Dbgpt

Press enter or click to view image in full size

Welcome to DB-GPT, a sub-project focused on fine-tuning for Text-to-SQL. This article is the first in a series of articles about DB-GPT-Hub, where we will share an overview and progress of the project. We are excited to announce that after a major refactor, we have successfully surpassed GPT-4 in terms of execution accuracy on the Spider dataset using our project code based on the open-source 13B model. In the upcoming articles, we will provide a step-by-step tutorial, explaining how to fine-tune the open-source LLM solely on the publicly available Spider training set to achieve higher accuracy than GPT-4 on the evaluation set.

Introduction

DB-GPT-Hub is an experimental project that utilizes LLMs for Text-to-SQL parsing. It mainly consists of steps such as dataset collection, data preprocessing, model selection and construction, and weight fine-tuning. Through this series of processes, we aim to improve the Text-to-SQL capability while reducing the cost of model training, allowing more developers to participate in the task of improving the accuracy of Text-to-SQL. Ultimately, we aim to achieve automatic question-answering based on databases, enabling users to perform complex database queries using natural language descriptions.

Currently, we have successfully implemented the entire workflow from data processing, model SFT training, prediction output, to evaluation using multiple large-scale models. The code for these processes is readily available for reuse in this project.

As of October 10, 2023, we have surpassed GPT-4 in terms of execution accuracy on the evaluation set of Spider by fine-tuning the models based on the open-source 13B model.

Text2SQL fine-tuning

We utilize SFT (Supervised Fine-Tuning) based on large language models to improve the effectiveness of Text-to-SQL.

1. Dataset

The dataset used in this project primarily focuses on the Spider dataset:

  • Spider: A cross-domain complex text-to-SQL dataset consisting of 10,181 natural language questions and 5,693 SQL queries distributed across 200 independent databases. It covers 138 different domains. Download link

Other datasets include:

  • WikiSQL: A large-scale semantic parsing dataset comprising 80,654 natural language utterances and SQL annotations for 24,241 tables. Each query in WikiSQL is limited to a single table and does not include complex operations such as sorting, grouping, or subqueries.
  • CHASE: A cross-domain, multi-turn interactive Chinese text-to-SQL dataset consisting of 5,459 lists of multi-turn questions, totaling 17,940 <query, SQL> pairs. It involves databases from 280 different domains.
  • BIRD-SQL: A large-scale, cross-domain English benchmark dataset for text-to-SQL tasks, with a particular focus on large databases. The dataset includes 12,751 pairs of text-to-SQL data and 95 databases, totaling 33.4GB. It spans 37 professional domains. BIRD-SQL dataset addresses three additional challenges: handling large and messy database values, external knowledge reasoning, and optimizing SQL execution efficiency, narrowing the gap between text-to-SQL research and practical applications.
  • CoSQL: A corpus for building cross-domain dialogue-based text-to-SQL systems. It is the dialogue version of the Spider and SParC tasks. CoSQL consists of 30k+ rounds and 10k+ annotated SQL queries, derived from 3k dialogues from the Wizard-of-Oz dataset, involving complex databases from 200 different domains across 138 domains. Each dialogue simulates a real DB query scenario, where one person acts as a user exploring the database, and an SQL expert retrieves answers, clarifies ambiguous questions, or provides other forms of feedback.
  • Simple processing of the dataset following NSQL templates yielded approximately 200k training data.

2. Baseline Model

DB-GPT-Hub currently supports the following base models:

  • CodeLlama
  • Baichuan2
  • LLaMa/LLaMa2
  • Falcon
  • Qwen
  • XVERSE
  • ChatGLM2
  • internlm
  • Falcon

The model can be fine-tuned based on quantization-aware training (QLoRA) with a minimum hardware resource requirement of 4-bit quantization. Please refer to the following for more details:

Press enter or click to view image in full size

The relevant parameters are set to their minimum values, with a batch size of 1 and a max length of 512. Based on experience, if computational resources are sufficient, it is recommended to set the relevant length values to 1024 or 2048 for better performance.

Usage Instructions

1. Environment Setup

git clone https://github.com/eosphoros-ai/DB-GPT-Hub.git
cd DB-GPT-Hub
conda create -n dbgpt_hub python=3.10
conda activate dbgpt_hub
pip install -r requirements.txt

2. Data Preparation

DB-GPT-Hub uses an information matching approach for data preparation, specifically combining table information with SQL + Repository generation. This method leverages table information to better understand the structure and relationships within the tables, making it suitable for generating SQL statements that meet specific requirements.

To prepare the data, you can download the Spider dataset from the provided link. By default, the dataset should be downloaded and extracted to the “dbgpt_hub/data/spider” directory.

For data preprocessing, simply run the following script:

#Generating Training and Evaluation Data,sh dbgpt_hub/scripts/train_eval_data_gen.sh

After running the script, you will find the generated training and evaluation files, “example_text2sql_train.json” and “example_text2sql_dev.json”, in the “dbgpt_hub/data/” directory. The size of the training dataset is 8,659 examples, while the evaluation dataset consists of 1,034 examples.

The generated JSON data will have the following format:

{
"db_id": "department_management",
"instruction": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\"\n##Instruction:\ndepartment_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.\nTable head has columns such as head_ID, name, born_state, age. head_ID is the primary key.\nTable management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.\nThe head_ID of management is the foreign key of head_ID of head.\nThe department_ID of management is the foreign key of Department_ID of department.\n\n",
"input": "###Input:\nHow many heads of the departments are older than 56 ?\n\n###Response:",
"output": "SELECT count(*) FROM head WHERE age > 56",
"history": []
},

3. Model Fine-tuning

This project supports fine-tuning not only with QLoRA and LoRA methods but also with DeepSeed. You can run the following command to fine-tune the model. By default, it includes the parameter “ — quantization_bit” for QLoRA fine-tuning. If you want to switch to LoRA fine-tuning, simply remove the “quantization_bit” parameter from the script.

To perform QLoRA fine-tuning, run the command:

sh dbgpt_hub/scripts/train_sft.sh

The fine-tuned model weights will be saved by default in the “adapter” folder, specifically in the “dbgpt_hub/output/adapter” directory.

If you want to use DeepSeed for training with multiple GPUs, you can modify the following default content in the “train_sft.sh” script:

CUDA_VISIBLE_DEVICES=0 python dbgpt_hub/train/sft_train.py \
--quantization_bit 4 \
...

Change it to:

deepspeed --num_gpus 2  dbgpt_hub/train/sft_train.py \
--deepspeed dbgpt_hub/configs/ds_config.json \
--quantization_bit 4 \
...

Keep the other omitted parts consistent. If you want to modify the default DeepSeed configuration, navigate to the “dbgpt_hub/configs” directory and make changes to the “ds_config.json” file.

The script includes different key parameters for fine-tuning with different models, such as “lora_target” and “template”. Here is a table illustrating their usage:

Press enter or click to view image in full size

Here are the explanations for other key parameters in “train_sft.sh”:

  • “quantization_bit”: Determines whether to quantize the model, with possible values of 4 or 8.
  • “model_name_or_path”: The path to the pretrained LLM (Language Model) model.
  • “dataset”: Specifies the configuration name of the training dataset, corresponding to the outer key value in “dbgpt_hub/data/dataset_info.json”, such as “example_text2sql”.
  • “max_source_length”: The maximum length of the input text to the model. If computational resources allow, it can be set to a larger value, such as 1024 or 2048.
  • “max_target_length”: The maximum length of the SQL content output by the model. Setting it to 512 is typically sufficient.
  • “output_dir”: The path to the output directory where the Peft module saves the fine-tuned model weights during SFT (Semantic Fine-Tuning). By default, it is set to “dbgpt_hub/output/adapter/”.
  • “per_device_train_batch_size”: The size of each batch during training. If computational resources allow, it can be set to a larger value. The default value is set to the “gradient_accumulation_steps” parameter.
  • “gradient_accumulation_steps”: The number of gradient accumulation steps for updating gradients. The default value can be set as needed.
  • “save_steps”: The number of steps between model checkpoint saving. By default, it is set to 100.
  • “num_train_epochs”: The number of training epochs for the training dataset.

4. Model Inference or Model Prediction

The default output directory for model predictions can be found at “./dbgpt_hub/output/pred/”. If the directory does not exist, please create it.

To run model predictions, use the following command:

sh ./dbgpt_hub/scripts/predict_sft.sh

By default, the script includes the parameter “ — quantization_bit” for QLoRA prediction. If you remove this parameter, it will use the LoRA prediction method.

The parameter “ — predicted_out_filename” specifies the name of the file where the model predictions will be saved. The predicted results can be found in the “dbgpt_hub/output/pred” directory.

5. Model Weights

You can check the corresponding Peft module weights that we uploaded in August on Hugging Face. The Hugging Face address for the updated and improved weights that achieve higher accuracy on the Spider evaluation dataset, surpassing GPT-4, will be released as soon as possible.

5.1. Merging Model and Fine-tuned Weights

If you need to combine the weights of the base model and the fine-tuned Peft module to export a complete model, you can run the following model export script:

sh ./dbgpt_hub/scripts/export_merge.sh

Please make sure to replace the path values for the relevant parameters in the script with the corresponding paths in your project.

6. Model Evaluation

To evaluate the performance of the model on a dataset, by default, the evaluation is done on the Spider dataset. Run the following command to evaluate the model:

python dbgpt_hub/eval/evaluation.py --plug_value --input  Your_model_pred_file

You can find our latest evaluation results in the “docs/eval_llm_result.md” file of the project.

Press enter or click to view image in full size

Development Roadmap

The entire process will be divided into three stages:

Stage 1:

  • Establishing the basic framework, which involves connecting the entire process from data processing, model SFT training, prediction outputs, and evaluation using several large models. As of August 4th, 2023, we have successfully established this workflow. Currently, we support the following models:CodeLlama,Baichuan2,LLaMa/LLaMa2,Falcon,Qwen,XVERSE,ChatGLM2,internlm

Stage 2:

  • Optimizing model performance and supporting fine-tuning with various models using different approaches. As of October 10th, 2023, we have completed the code refactoring to support more models.
  • Optimization of prompts.
  • Releasing evaluation results and improved models.

Stage 3:

  • Further optimization based on additional research papers such as RESDSQL, in collaboration with our sister project Awesome-Text2SQL, to achieve more improvements.

Contributions

We welcome more contributors to participate and provide feedback in various aspects such as dataset creation, model fine-tuning, performance evaluation, paper recommendations, and reproductions. You can contribute by raising issues or submitting pull requests, and we will actively respond to them. Before submitting code, please format it using the “black” code formatter.

Acknowledgements

Our work is primarily built on numerous open-source projects, and we are extremely grateful to the following ones.

Appendix