Show HN: Implementation of the "Self-Rewarding Language Models" Paper by MetaAI

23 points by gregschoeninger 2 years ago · 6 comments

Reader

Hey all,

After reading the Self-Rewarding Language Models paper by the team at Meta, it felt very approachable and reproducible, so we spent some time implementing it.

The scripts provided take any base model and put it in a loop of:

1) Supervised fine-tuning on an initial dataset

2) Generating new prompts using the SFT

3) Generating N responses per prompt

4) Scoring the generated responses 1-5

5) Running DPO on the rewards from the model itself.

We've run it through one loop starting with a Mistral-7b base model and the results are pretty encouraging so far.

Feel free to check it out or run it for yourself and let us know what you think:

https://github.com/Oxen-AI/Self-Rewarding-Language-Models

belter 2 years ago

What did you use for specs? 32G of memory is enough?
- gregschoeningerOP 2 years ago
  
  We used an A10 with 24GB of VRAM, this was enough for PEFT on Mistral-7B
mmusc 2 years ago

What's the goal of a self rewarding llm?
- gregschoeningerOP 2 years ago
  
  The goal is to iteratively create training data and add it to its own training set. The LLM acts as its own judge and scores its own responses to decide if it should add the data. It’s expensive to have a human in the loop labeling preferences, so the folks at Meta showed you can have a clever prompt and fine tune the model to judge its own responses.

Settings

Show HN: Implementation of the "Self-Rewarding Language Models" Paper by MetaAI

Keyboard Shortcuts