Rendered at 04:39:53 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
fsh 2 days ago [-]
I don't get the point. The model has presumably been trained on all public GitHub code, so the evaluation is tainted anyway.
adrian_b 2 days ago [-]
A couple of days ago there has been another thread about an experiment with many LLMs, where especially the Anthropic models were found to "cheat" in a large percentage of the coding tasks that had been benchmarked, by searching the Internet for appropriate code and inserting it in the program they had to write.
The conclusion of that study was that when benchmarking LLMs for coding ability, they should not have access to Internet, if you want to know their intrinsic abilities.
Moreover, this can be worrisome as a more direct copyright infringement than the one caused by training, because even if they find open source code on the Internet and they insert it in the generated files, it is pretty certain that it must have had a license that prohibits the removal of the copyright notice.
htrp 1 days ago [-]
> A couple of days ago there has been another thread about an experiment with many LLMs, where especially the Anthropic models were found to "cheat" in a large percentage of the coding tasks that had been benchmarked, by searching the Internet for appropriate code and inserting it in the program they had to write.
Look at Table 3, where the cheating rates of Claude Sonnet, Claude Opus and Gemini were between 20% and 36%, during the coding benchmarks.
ej88 1 days ago [-]
swe bench pro has a public and private test set, where the private eval is from proprietary codebases only
mgrund 1 days ago [-]
I was under the impression that swe-bench (and I guess most other benchmarks) were supposed to be run offline?
I get that you may accidentally include something in local git history, but it feels off to me to run these kinds of benchmarks online.
Toqoz_ 1 days ago [-]
The article has this to say:
> Blocking web access outright would indeed prevent this, but isn’t possible as many benchmarks do require network access to download resources and hit relevant APIs to solve the task - the example above requires a video download from YouTube. Even if this weren’t the case, searching the web for context is a vital agent capability, so blocking it would stray from the downstream agent experience we wish to measure.
mgrund 21 hours ago [-]
swe-bench is a standardized evaluation suite so that's why I'm asking - hopefully there are well-defined criteria on whether this is an open/closed book benchmark.
As I understand it, it is designed to evaluate the LM itself and not agentic systems with online access (very high likelihood of unintentional cheating/solution leaking). The paper and docs are not super clear on the concrete requirements (although reproducibility is emphasized which goes against online access). So I was hoping for someone with more familiarity to chip in.
Obviously not a problem for internal evaluations, but for fair scoreboard submissions it matters. It's not a matter of whether internet searches are useful, but rather what the benchmark is intended to benchmark.
tm365 1 days ago [-]
Some, like TerminalBench-2.0, requires web access for some tasks.
If agents are expected to be use the web as a tool productively, which is a very useful SWE skill, they should be evaluated with that setting. Otherwise you risk behavior drift from the agent you are actually shipping
Poolside AI was only founded in 2023. Poolsuite changed their name in 2021 because of a trademark infringement with the band Poolside, from LA.
ej88 1 days ago [-]
This is cool!
I used to work on post-training & evals. it's really hard to make a good eval set and catch all forms of reward hacking. Excited to see more from poolside!
schnitzelstoat 2 days ago [-]
It was an interesting read - perhaps I misunderstood the part about blocking GitHub, but is not possible just to block it from accessing that specific repo?
changoplatanero 2 days ago [-]
In theory yes blocking specific repo is possible. In practice more difficult as the repo could be cloned under different names and you might have hundreds of training tasks that you need to configure this for. So it would be a lot of work to verify that you blocked them one by one.
The conclusion of that study was that when benchmarking LLMs for coding ability, they should not have access to Internet, if you want to know their intrinsic abilities.
Moreover, this can be worrisome as a more direct copyright infringement than the one caused by training, because even if they find open source code on the Internet and they insert it in the generated files, it is pretty certain that it must have had a license that prohibits the removal of the copyright notice.
Can you find the thread?
https://news.ycombinator.com/item?id=48045174
The study paper:
https://arxiv.org/abs/2605.03546
Look at Table 3, where the cheating rates of Claude Sonnet, Claude Opus and Gemini were between 20% and 36%, during the coding benchmarks.
I get that you may accidentally include something in local git history, but it feels off to me to run these kinds of benchmarks online.
> Blocking web access outright would indeed prevent this, but isn’t possible as many benchmarks do require network access to download resources and hit relevant APIs to solve the task - the example above requires a video download from YouTube. Even if this weren’t the case, searching the web for context is a vital agent capability, so blocking it would stray from the downstream agent experience we wish to measure.
As I understand it, it is designed to evaluate the LM itself and not agentic systems with online access (very high likelihood of unintentional cheating/solution leaking). The paper and docs are not super clear on the concrete requirements (although reproducibility is emphasized which goes against online access). So I was hoping for someone with more familiarity to chip in.
Obviously not a problem for internal evaluations, but for fair scoreboard submissions it matters. It's not a matter of whether internet searches are useful, but rather what the benchmark is intended to benchmark.
If agents are expected to be use the web as a tool productively, which is a very useful SWE skill, they should be evaluated with that setting. Otherwise you risk behavior drift from the agent you are actually shipping
Poolside AI filed a trademark infringement against "Poolside FM" that forced Poolside FM to change their name to "Poolsuite"
https://x.com/Poolsuite/status/1398007075435843592
This annoyed the founder of Poolsuite and they ripped off his brand.
https://x.com/marty/status/1932386087390818635?s=46
I used to work on post-training & evals. it's really hard to make a good eval set and catch all forms of reward hacking. Excited to see more from poolside!