LLMs can't justify their answers–this CLI forces them to

wheat.grainulation.com

11 points by volatilityfund 4 days ago · 21 comments

Reader

Ancalagon 4 days ago

how is everyone reviewing ai-generated tools/ai-generated websites like these? I cannot keep up with all of these

bayarearefugee 4 days ago

I just ignore it all, its just a bunch of useless cruft that nobody needs that people are throwing out there to seem AI-relevant.
If you need to know "Should we migrate to GraphQL?" (the example on the site) and your brain is already AI-mushed to the point where you can't deduce this yourself, just ask the model directly, it doesn't need 9 layers of slop-built bullshit stacked on top to answer this question.
- Ancalagon 3 days ago
  
  Another one on the front page today: https://news.ycombinator.com/item?id=47660262
- aid-ninja 4 days ago
  
  I broadly agree with you but if you use claude code you should give this a try, the website doesn't really do it justice but wheat really solves for a lot of pain points when using claude for longer sessions
throwaway290 4 days ago

You don't have to review them just ask your llm to make the same thing if you really want it;)
- aid-ninja 4 days ago
  
  I literally just tell claude, here is the deekwiki please use this tool for the next task lol
insin 4 days ago

I only stay on this particular variant of one of the 5 site designs every LLM spits out for long enough to check that - yes - it has the annoying thing where content fades in as you scroll, before closing the tab.
- goodmythical 4 days ago
  
  They got that from humans who made websites.
  Ergo, the existence of the effect is not an effective filter for determining use of an LLM in the creation of the website.
- aid-ninja 4 days ago
  
  if you like wheat, check out farmer & orchard (both are really cool)

lmeyerov 4 days ago

Evals or GTFO

volatilityfundOP 3 days ago

5x productivity boost in merged PRs (lots of open PR & merge rate goes down, but net positive)
Starting to build custom tooling around new "friction" points in dev cycle
(eng IC perspective)
- lmeyerov 2 days ago
  
  Evals let us agree on the baseline, measurement, etc, and compare if simple things others do perform just as well. For same reason, instead of 'works on my box' and 'my coding style', use one of the many community evals vs making up your own benchmark.
  That helps head off much of many of the unfalsifiable discussions & claims happening and moves everyone forward.
  - aid-ninja a day ago
    
    a rust version of that compiler (that the project runs on) ran at 480k claims/sec and it was able to deterministically resolve 83% of conflicts across 1 million concurrent agents (also 393,275x compression reduction @ 1m agents on input vs output, but different topics can make the compression vary)
    natively claude (and other LLM) will resolve conflicting claims at about 51% rate (based on internal research)
    the built in byzantine fault tolerance (again, in the compiler) is also pretty remarkable, it can correctly find the right answer even if 93% of the agents/data are malicious (with only 7% of agents/data telling us the correct information)
    basically the idea here is if you want to build autonomous at scale, you need to be able to resolve disagreement at scale and this project does a pretty nice job at doing that
    
    lmeyerov 21 hours ago
    
    My question was on claims like "5x productivity boost in merged PRs (lots of open PR & merge rate goes down, but net positive)", eg, does this change anything on swe-bench or any other standard coding eval?
    
    volatilityfundOP 19 hours ago
    
    The ecosystem is 8 tools plus a claude code plugin, the unlock was composing those tools (I don't regularly use all 9). The 5x claim was from /insights (claude code)
    Not for everyone, but it radically changed how I build. Senior engineer, 10+ years
    Now it's trivial to run multiple projects in parallel across claude sessions (this was not really manageable before using wheat)
    Genuinely don't remember the last time I opened a file locally
    
    lmeyerov 5 hours ago
    
    It sounds like the answer is "No, there is no repeatable eval of the core AI coding productivity claim, definitely not on one of the many AI coding benchmarks in the community used for understanding & comparison, and there will not be"

Settings

LLMs can't justify their answers–this CLI forces them to

Keyboard Shortcuts