Settings

Theme

Evaluating Large Language Models Trained on Code

arxiv.org

11 points by aray 4 years ago · 1 comment

Reader

yewenjie 4 years ago

> On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%.

Interesting that they are comparing their model with GPT-J.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection