ProgramBench: Can Language Models Rebuild Programs from Scratch?
github.comI didn't managed to find the tests. How can we know that the tests are actually reasonable in this case ?
I didn't managed to find the tests. How can we know that the tests are actually reasonable in this case ?