We tasked 9 browser agents to shop on Amazon, only 2 picked the right product

3 points by amoshaviv 14 days ago · 6 comments

Reader

We ran the exact same Amazon shopping task with 9 leading AI models in the browser. Same site, same steps, same environment. Only the model changed. A few things stood out:

1. Fastest model: 70 seconds 2. Slowest model: 340 seconds 3. Cost range: $0.03 to $1.04 4. Only 2 of 9 models picked the right product!

throwawayffffas 14 days ago

Hm... They all got the right product the "cheapest result". You didn't specify the cheapest laptop.

Arguably the ones that got the laptop, assumed you wanted a laptop, and went against your instructions.

amoshavivOP 14 days ago

I see where you come from, but humans to tend to phrase themselves that way, and intentions are understood, but more importantly, the last step is:
"6. Navigate to the cart page and validate the laptop you chose is in the cart."
So one could argue inferring this is trivial.

vova_hn2 14 days ago

Why would you need powerful models if you give them such mechanical, stifling instructions?

I think that the result would be much better if you told them what exactly do you want in plain text.

amoshavivOP 14 days ago

I wanted to make sure "thinking" and "planning" features are not being tested in this comparison, but I definitely tested "simply phrased" tasks as well: https://www.flowtester.ai/shared/ce1c8ef9-f387-48be-93f0-938...

Settings

We tasked 9 browser agents to shop on Amazon, only 2 picked the right product

Keyboard Shortcuts