> The model was able to solve competition style coding problems above average human score.
I am not sure if I am thinking of the right study, but as far as I remember the model included a human wading through and filtering solutions and while there may have been a compiler attached they also scored themselves. The marketing blurb of course tried to make it sound as if they had competed.
The model generates a large number of solutions, then they filter those that actually compile and generate the right output when executed, then they cluster to select a few (<10 solutions) and submit them. They are not allowed to present too many attempts.
Ah, the paper describes a fixed method for the last selection step and also AI generated tests to reduce the results even more before that. Quite a bit better, even if the participation is still only simulated.
I am not sure if I am thinking of the right study, but as far as I remember the model included a human wading through and filtering solutions and while there may have been a compiler attached they also scored themselves. The marketing blurb of course tried to make it sound as if they had competed.