We tasked 9 browser agents to shop on Amazon, only 2 picked the right product
flowtester.aiWe ran the exact same Amazon shopping task with 9 leading AI models in the browser. Same site, same steps, same environment. Only the model changed. A few things stood out:
1. Fastest model: 70 seconds 2. Slowest model: 340 seconds 3. Cost range: $0.03 to $1.04 4. Only 2 of 9 models picked the right product!
Hm... They all got the right product the "cheapest result". You didn't specify the cheapest laptop.
Arguably the ones that got the laptop, assumed you wanted a laptop, and went against your instructions.
I see where you come from, but humans to tend to phrase themselves that way, and intentions are understood, but more importantly, the last step is:
"6. Navigate to the cart page and validate the laptop you chose is in the cart."
So one could argue inferring this is trivial.
Why would you need powerful models if you give them such mechanical, stifling instructions?
I think that the result would be much better if you told them what exactly do you want in plain text.
I wanted to make sure "thinking" and "planning" features are not being tested in this comparison, but I definitely tested "simply phrased" tasks as well: https://www.flowtester.ai/shared/ce1c8ef9-f387-48be-93f0-938...