Using GPT-4 Vision with Vimium to browse the web

437 points by wvoch235 2 years ago · 132 comments

Reader

e12e 2 years ago

It's insane that this is now possible:

https://github.com/ishan0102/vimGPT/blob/682b5e539541cd6d710...

> "You need to choose which action to take to help a user do this task: {objective}. Your options are navigate, type, click, and done. Navigate should take you to the specified URL. Type and click take strings where if you want to click on an object, return the string with the yellow character sequence you want to click on, and to type just a string with the message you want to type. For clicks, please only respond with the 1-2 letter sequence in the yellow box, and if there are multiple valid options choose the one you think a user would select. For typing, please return a click to click on the box along with a type with the message to write. When the page seems satisfactory, return done as a key with no value. You must respond in JSON only with no other fluff or bad things will happen. The JSON keys must ONLY be one of navigate, type, or click. Do not return the JSON inside a code block."

Maxion 2 years ago

The speed at which this is moving at is mind boggling. This may become crazier than the dot.com boom.
- pms 2 years ago
  
  Until you realize that it doesn't work well with less popular videos (any items really), because "Large Language Models Struggle to Learn Long-Tail Knowledge" [1].
  [1] https://proceedings.mlr.press/v202/kandpal23a.html
  - heroprotagonist 2 years ago
    
    Except in this case, the knowledge is 'how to search the web for X" instead of 'an understanding or familiarity with X'.

transistorfan 2 years ago

At my work there are a large contingent of people who essentially do manual data copying between legacy programs (govt), because the tech debt is so large that we can't figure out a way to plug these things together. Excited for tools like this to eventually act as a layer that can run over these sort of problems, as bizarre a solution as it is from a compute perspective

yreg 2 years ago

A long, long time ago I worked on a small project for a major multinational grocery chain.
I made them a tool that parses an Excel file with a specific structure and calls some endpoints in their internal system to submit the data.
I was curious, so I asked how they are doing it currently. They led me to a computer at the back of their office. The wallpaper had two rectangles, one of them said MS EXCEL and the other said INTERNET EXPLORER. Then the person opened these apps, carefully positioned both windows exactly into those rectangles and ran some auto-clicker - the kind cheaters would use in RuneScape – which moved the cursor and copied and pasted the values from the Excel into the various forms on the website.
Amazing.
- Valgrim 2 years ago
  
  I worked with a client who used a multi-millon dollar system for moving goods automatically into packaging stations. The system was built and maintained by a major european company. All the data was transfered automatically between systems normally, but one day, for some reason, there was an internal communication error inside the machine which caused a lot of packages to be sent without being recorded as such.
  Now normally we would just have contacted the company and asked them for a data extraction so we could cross-reference the data. But since it wasn't clear who was at fault, and we knew it would take weeks for that extraction, we looked for an internal solution first.
  Now there was a subsystem in the machine that worked only in Internet Explorer, with an old authentication scheme, that we could use to see the information we needed, so I, being the only person in the team without formal analysis training but having made my way there from a clerk job, knew exactly what to do.
  I fired up the old IE, Excel, wrote in 5 minutes a VBA script that did exactly what you described, click there copy that etc, and 30 minutes later we had our extraction, and resolved the issue completely before the packages were even shipped.
  All hail Excel.
  - mst 2 years ago
    
    For all its flaws as a programming language, VBA made an excellent bodging language and I salute your expedient field hack.
- kspacewalk2 2 years ago
  
  I wonder if it used something like AutoIt[0]. I remember using it at one of my more boring co-op jobs about 20 years ago to automate moving data between a spreadsheet and some obscure database product.
  [0] https://en.wikipedia.org/wiki/AutoIt
bboygravity 2 years ago

Funny that you and others on here don't seem to realize that literally everybody who uses the internet has the exact same data entry problem all the time. Blame it on "old software", but how about the entire internet?
copying (or in most cases even worse: re-typing) form data from one location on the screen into yet another webform.
Username, password, email address, physical address, credit card info etc etc.
Some extensions try to help with data entry, but none of them work properly and consistently enough to really help. Even consistently filling just username and pw is too much to ask.
It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.
I would pay a montly fee for any software that solves this once and for all and it sounds like it's coming (and I'm already paying their monthly fee).
- TeMPOraL 2 years ago
  
  > It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.
  Simple: it's because not solving this problem is how our godawful industry makes most of its money. Empowering the user means relinquishing control over their "journey"[0]. Ergonomics means fewer opportunities to upsell or show ads.
  I don't have the link handy, but I'm reminded of one of the earliest Windows user interface guidelines documents, back from Windows 95/98 era, which, in a section about theming/visual style, already recognized that they have to allow for full flexibility, because vendors will insist on fucking the experience up for the sake of branding anyway, and resisting it is futile[1].
  --
  [0] - I'm trying really hard to hold back my contempt towards terms like this, and the whole salesy way of viewing human-computer interactions.
  [1] - They put it in much more polite terms, but the feeling of helplessness was already there.
  - itronitron 2 years ago
    
    >> because vendors will insist on fucking the experience up for the sake of branding anyway
    I see that you too have at some point installed printer driver software.
  - musha68k 2 years ago
    
    Ted Nelson’s “intertwingularity” isn’t far off from the data entry problem described. He argues for universal data access where duplication is obsolete. Imagine form data as a single, linkable object across the web, editable in one place, reflected everywhere—no re-typing, just seamless auto-fill. That’s the unrealized potential of hypertext.
- anonzzzies 2 years ago
  
  Yeah, my dream would be using this to scrape pages, pop the content into my provide db, serving it up in my own format (which is going to be a white page with letters with inline images and videos that are not ads. And my interactions fed back to the vision model to post in the original. So I never have to see a ‘design’ (heavy js riddled unreadable crap) again in my life. And so I can, with my own tooling, browse and reuse my history including content instead rely on all the broken stuff bolted on the web.
- williamcotton 2 years ago
  
  Bash pipes? The free flow of information through composable tools.
  The commercial web? Not the above.
  This is just a baseline. I’m sure that an LLM can help the issue but the biggest problem is that these varied HTTP-with-datastores are islands passing messages in bottles back and forth while a bash pipeline is akin to fiber optics.
- fragmede 2 years ago
  
  consistently filling out username and password is all I wanted from my password manager, but it turns out it handles credit card number and other bits of information for me as well.
  - arkitaip 2 years ago
    
    I've used Bitwarden to faster fill out job applications.
  - mewpmewp2 2 years ago
    
    Doesn't chrome out of the box handle all of that?
- pseudosaid 2 years ago
  
  use a password manager. i havent copy pasted form data twice on a site in a long time
- loud_cloud 2 years ago
  
  FTL. See NiagraFiles.
haswell 2 years ago

The industry buzzword is "Robotic Process Automation", which as a category of products has been focused on using various forms of ML/AI to glue these things together in a common/structured way (in addition to good old fashioned screen scraping).
Up this this point, these products have been quite brittle. The recent explosion of AI tech seems like quite a boon for this space.
- keepamovin 2 years ago
  
  I totally agree on all points, especially around what AI means for this.
  I'm kind of in a happy accident situation because I was working on something for RPA, which then became a layer that was factored as its own product, but now might be able to come full circle as a result of AI.
  Essentially this layer can function as a "delivery medium" for RPA agent creation, that you can use on any device without download. However, as it has many others uses I've been working on those, but I've been seeking a great reason to get back into RPA.
  I have a cool idea to leverage human-guided AI creation of data maps and action tours for RPA, but similar to what you say, unless great care is taken you can end up with a brittle approach. Also, as the market has been quite saturated many reasonable approaches, I just haven't felt compelled.
  Yet now I think the possible merging of GPT level AIs with browser instrumentation to deliver an augmented way to browse the web makes that incredibly compelling.
  So I'm incredibly thrilled that I have this happy accident of BrowserBox^0 (the factored out layer originally from RPA work above) which provides a pluggable/iframe-emebeddable interface for remotely controlling a headless browser. So now I want to look at unifying BrowserBox with this kind of GPT driven exploration.
  It's even cooler, because, as BB enables co-browsing by default (multiplayer browsing) and turns the browser into a "client-server" architecture, I can see plugging in GPT-4V as a connecting client with some kind of minimal API affordance for it to use would, like the very cool vimium keyboard-enabled browsing in the OP, would be such interesting project to try!
  We're open source so if you want to check us out or get involved in this quest, come say hi, maybe get involved if you're game!
  0: https://github.com/BrowserBox/BrowserBox
  - jimmySixDOF 2 years ago
    
    I have watched your project for a while as a possible option for embedded browsers for XR applications like WebXR but the high licensing cost was a factor and solutions like Hyperbeam or Vueplex in Unity have been possible. Defiantly agree that multimodal LLM integration is a huge opportunity and multiplayer browsing with AI in realtime is a super cool idea if you package it right.
    
    keepamovin 2 years ago
    
    Hi jimmySixDOF thank you for the kind words and the attention on our project! :)
    Regarding pricing we have heard that feedback over time and gradually adjusted our licensing costs. It should now be much more affordable as it is targeted towards large deployments, with decreasing cost and increasing value at scale.
    If you'd like to send an email with any thoughts on our current prices on https://dosyago.com to cris@dosyago.com I'd highly value it!
    Your idea of WebXR and embedding within Unity is very interesting, and I think it could be a fit.
- leovander 2 years ago
  
  In the OP's specific instance when would you reach out for a traditional ETL tool vs an RPA solution?
  - teaearlgraycold 2 years ago
    
    RPA is for data sources and destinations that are meant for human consumption and entry. So you’d use RPA to take an image of a table and enter every row into a web form.
  - transistorfan 2 years ago
    
    How much does the involvement of a bank of fax machines complicate things?
    
    Roark66 2 years ago
    
    A little perhaps, but not much. One can replace a bank of physical fax machines with modems.
    It would be an interesting job for sure. Why wasn't it done before? I can imagine only two reasons. One, there isn't that much data to move and it makes no sense to build software for what few people spend 30min per day on. Two, the data in the legacy system is images and people are not just moving it between systems, but they also do categorisation, verification etc. In which case an AI model may be useful, but almost always hard coded rules will be faster.
Roark66 2 years ago

Whenever I hear about such a thing (people doing legacy system data extraction manually) I wonder if in every case someone got the estimate for the "proper" solution and just decided a bunch of people typing is cheaper?
Integrating things like Chatgpt will still require people who know what they are doing to look at it, and I wouldn't be surprised if the first advice they give is "don't use chatgpt for it".
- spaceman_2020 2 years ago
  
  If the market forces work as they’re supposed to (not a given anymore), then corporations that adopt better tech will see better profits through lower expenses. And then the laggards will have to adapt or die.
  Also remember that this is essentially v1 of the software- the Windows 95 of this adoption cycle
aikinai 2 years ago

I remember years ago thinking it was weird in Ghost in the Shell when a robot had fingers on its fingers to type really fast. Maybe that really won’t happen since they can plug into USB at least, but they will probably use the screen and keyboard input sometimes at least.
- nomel 2 years ago
  
  Why would a keyboard be required? I think the intent to hit a letter would more easily be sent over a bluetooth HID "device". ;)
- yjftsjthsd-h 2 years ago
  
  USB is an attack vector; if it's not exploiting your USB driver it's connecting your data pins to mains power. Keyboards are an air gap.
  - simbolit 2 years ago
    
    Isn't the keyboard connected to the computer via USB?
    If I have access to the keyboard, I have access to a USB cable plugged into the computer, right?
    Perhaps I misunderstand something....
    
    yjftsjthsd-h 2 years ago
    
    I meant the reverse; the computer attacking the robot using it
    
    simbolit 2 years ago
    
    Uhhhhh, thanks. That makes a lot of sense!
- pixl97 2 years ago
  
  The issue with USB is you have to have power protection circuits. Analog interface at least in the show appeared much harder to hack.
hubraumhugo 2 years ago

I believe that LLMs will automate most of our data entry/copy/transformation work. 80% of the world's data is unstructured and scattered across formats like HTML, PDFs, or images that are hard to access and analyze. Multimodal models can now tap into that data without having to rely on complex OCR technologies or expensive tooling.
If you go to platforms like Upwork, there are thousands of VAs in low-cost labor countries that do nothing else than manual data entry work. IMO that's a complete waste of human capital and I've made it my personal mission to automate such tedious and un-creative data work with https://kadoa.com.
- kristopolous 2 years ago
  
  I was thinking what the payoff would be to pose as human for these terrible pay click jobs and then assign them to an LLM en masse. There's an arbitrage there ... it may be a good strategy.
  I heard recently "click-work" works out to about $4/hr* If you could do that x50, passively, it's a fine income.
  * - see https://journals.sagepub.com/doi/full/10.1177/14614448231183... or listen to https://kpfa.org/episode/against-the-grain-october-30-2023/ ... it's a fascinating study. Terrible pay (way below minimum wage) but surprisingly high worker satisfaction. The users seem to view it as entertainment essentially categorizing it as casual gaming.
  The "asshole innovator" in me wonders if one could simply make it more entertaining and forego paying the user entirely.
  - hubraumhugo 2 years ago
    
    Interesting. Instead of doing the click work manually, microworkers will just instruct and guide multiple GPTs.
    
    kristopolous 2 years ago
    
    maybe. A lot of modern clickwork is actually model training and there is a model-collapse phenomena (https://arxiv.org/abs/2305.17493) which means that it should be banned for such work. I bet a number of clever people on the platforms are already trying to instrument AI to do the work regardless - it's pretty close to "free money" if you can pull it off and not get caught and at a spigot size where there's no real serious consequences if you do.
  - ishan0102 2 years ago
    
    Yeah this seems easy to build but would rather work on making tools that improve accessibility 10x
- ishan0102 2 years ago
  
  Yup, that's my long term goal. I want an "anything API" that brings structure to anything on the web.
morkalork 2 years ago

Kinda sci-fi, we're so close to a future where when/if original source code is lost, a mainframe runs in an emulator and the human operating it is also emulated.
FooBarWidget 2 years ago

It's bizarre computationally, but at this point maybe we have to compare it to the alternative: hiring a person. At least the AI only consumes electricity (which is hopefully green), while a person consumes food (grown with mined fertilizers), or meat (which we know is really bad for the environment).
specialist 2 years ago

> a large contingent of people who essentially do manual data copying
Yup.
I was briefly part of a decades long effort to migrate off a main frame backend. It was basically a very expensive shared flat file database (eg FileMaker Pro). Used by thousands of applications, neither inventoried or managed. Surely a handful were critical for daily operations, but no one remembered which ones.
And the source data (quality) was filthy.
I suggested we pay some students to manually copy just the bits of data our spiffy "modern" apps needed.
No one was amused.
--
I also suggested we find a suitable COBOL runtime and just forklift the mainframe's "critical" infra into a virtual machine.
No one was amused.
Lastly, I suggested we throttle access to every unidentified mainframe client. Progressively making it slower over time. Surely we'd hear about anything critical breaking.
That suggestion flew like a lead zeppelin.
alexirobbins 2 years ago

Working on this layer at https://autotab.com. This sounds like an amazing problem for browser automation to solve, would love to talk with you if you’re interested!
abrichr 2 years ago

This type of use case is exactly why are building https://github.com/OpenAdaptAI/OpenAdapt
Garlef 2 years ago

"Chinese Room Automation"
monkeydust 2 years ago

This has been fruitful ground for RPA offerings like UIPath and Automation Anywhere. Multi-model LLMs open up chance to disrupt them
gumballindie 2 years ago

Wow. Leaking confidential tax payer data.
- transistorfan 2 years ago
  
  I should have been clearer, it's between two apps that we host internally - applications on our own intranet cannot talk to each other. If you want to get any data out of either of these apps to the world, you need to do a manual export and email/usb which would obviously flag
  - gumballindie 2 years ago
    
    Correct, but chat gpt reads screen data to be able to "click" around. So you would need to expose at least data that is displayed on screen to this external product.

lachlan_gray 2 years ago

I think vim is unintentionally a great “embodiment” for chatgpt. There’s nothing that can’t be done with a stream of text, and the internet is full of vimscript already

I started a similar experiment if anyone else is thinking along the same lines :)

https://github.com/LachlanGray/vim-agent

gsuuon 2 years ago

This is a neat idea!

ishan0102 2 years ago

Hey! Creator here, thanks for sharing! Let me know if anyone has questions and feel free to contribute, I've left some potential next steps in the README.

celeste_lan 2 years ago

Omg I also just released something pretty similar earlier today https://github.com/Jiayi-Pan/GPT-V-on-Web. But it received little attention.
- ishan0102 2 years ago
  
  Woah looks great, not surprised that multiple people thought of this! Your prompt looks much better than mine, I'm not really taking advantage of any of the default Vimium shortcuts.
jimmySixDOF 2 years ago

Nice. I know Open Interpreter are trying to get Selenium automated to natural language control and quite a few other projects are also popping up on HN lately. The vimium approach is a lot lighter so looks promising. One way or another the as-published world wide web is turning into its own dynamic API overlay server. Ingest all the Sources!
jgalentine007 2 years ago

Very cool use for Vimium, I like the approach!
- ishan0102 2 years ago
  
  Thank you!
squeegmeister 2 years ago

How does this differ from how ChatGPT currently browses the web?
poulpy123 2 years ago

could it be used to make a bot that visit and parse websites to extrat relevant information without writing a parser for each websites ?
roland35 2 years ago

what terminal are you using???
- ishan0102 2 years ago
  
  Warp! (warp.dev)

maccam912 2 years ago

I've been playing with a similar idea of screenshots and actions from GPT-4 Vision for browsing, but after trying and failing to overlay info in the screenshot, I ended up just getting the accessibility tree from playwright and sending that along as text so the model would know what options it had for interaction. In my case it seemed to work better, I see the creator is here and has a list of future ideas, maybe add this to the list if you think its a good idea?

ishan0102 2 years ago

Cool that’s a solid idea, I was trying to only use visual data but this could make the agent a lot more powerful, I’ll try this really soon
manmal 2 years ago

Probably better to capture all the content and not just what fits on one screen. Most pages should fit as text (or HTML?) in the new extended token window.
- arbuge 2 years ago
  
  Better watch token costs. The per token costs are lower now but even so a full context load still costs almost $4.

mackross 2 years ago

Been playing with this through the ChatGPT interface for the past few weeks. Couple of tips. Update the css to get rid of the gradients and rounded corners. I found red with bold white text to be most consistent. Increase the font size. If two labels overlap, push them apart and add an arrow to the element. Send both images to the API, a version with the annotations added and a version without.

karmasimida 2 years ago

We can create an autopilot for browser.

It is going to incredibly difficult moving forward to distinguish bot traffic, if this is deployed at scale.

The problem I see is this isn't going to be cheap or even affordable in short term.

ishan0102 2 years ago

I think costs can come down if you finetune open source models like llava or cogvlm. This demo also cost about 6 cents so it's not insanely expensive either, especially with clever prompting.

reqo 2 years ago

How will tools like this affect web tracking or generally advertisements on the internet? Imagine you could have an agent browse the web for you and fetch exactly what you are seraching for without you seeing any ads/pop ups or being tracked along the way! Could be a great ”ad blocker”! Could it perhaps also make SEO useless and thus improve the quality of internet? But I wonder if it also could have negative effects such as the ads being “interweaved” into the fetch content somehow!

famouswaffles 2 years ago

Since this is sending screenshots of pages to GPT, won't it see the ads as well?

FooBarWidget 2 years ago

Many Dutch companies pay salaries by

1. receiving payslips from the accountant, and then

2. manually initiating bank transfers to each employee for the amount in the corresponding payslip, and then

3. manually initiating a bank transfer to the tax authority to pay the withholded salary taxes.

This is completely useless manual labor. There should be no reason for this to be a manual procedure. And yet it's almost impossible to automate this. The accountant portal either has no API, or it has an API but lets you download the data as PDF, and/or the API costs good money. The bank either has no API, or it requires you to sign up for a developer account as if you're going to publish a public app, when you're just looking to automate some internal procedures.

So the easiest way to pay salaries and taxes is still to hire a person to do it manually. Hopefully one day that won't be necessary anymore. I wouldn't trust an AI to actually initiate the bank transfers, but maybe they can just prepare the transactions and then a person has to approve the submission.

martinald 2 years ago

I don't think this really has much to do with AI. In the UK there are solutions like Pento now which do all this, including automating payments via open banking to the user and the tax authority and automatically filing tax filings:
https://www.pento.io/la/payroll-software
nvm0n2 2 years ago

That's just a bank problem. Certainly this isn't how payroll works for large companies. Banks usually let you upload XML files that define a set of SWIFT payments, this is how I do payroll even for a small company. The accountants supply the XML file too, presumably they have an app that generates it.
is_true 2 years ago

In my country it's similar but for some data you have to upload to the government agency's site, I think it was earlier this year that they released a statement saying that people using software to perform actions on the website could get banned.
abrichr 2 years ago

Thanks for the tip!
Automating repetitive GUI workflows is the goal of https://github.com/OpenAdaptAI/OpenAdapt

snake_doc 2 years ago

Ah, very similar to Adept’s[1] concept? Though, their product seems not yet ready.

[1] https://www.adept.ai/

jatins 2 years ago

It's also a little insane to me that what Adept has been supposedly building for years with 300+ mil in funding can now be built in a day with Open AI APIs?
I think Adept pivoted along the way but original concept was very similar to this.
- abrichr 2 years ago
  
  Agreed! This is part of the motivation behind https://github.com/OpenAdaptAI/OpenAdapt
- sunshadow 2 years ago
  
  But its too expensive to become practical with the OpenAI API. Also, demo is cool until you see the real-world webpages, then you'll realize that this only works less than %50 of webpages.
  - famouswaffles 2 years ago
    
    GPT-4V may be surprisingly robust here. Set of mark prompting(which is accomplished here with Vim) improves grounding by a silly high amount. https://som-gpt4v.github.io/
amks 2 years ago

https://www.adept.ai/blog/experiments :)
ishan0102 2 years ago

Yep, took inspiration from them and a couple other startups
- QkPrsMizkYvt 2 years ago
  
  What other startups did you use for inspiration?
karmasimida 2 years ago

This is precisely the demo I am thinking.

dangerwill 2 years ago

How is this making your browsing experience any better? You still have to know what you want to do, and it is just faster to type Rick roll into youtube directly and click the links directly instead of having to type k, or vh, or whatever. You are just adding a useless chatgpt middleman between you and the browser that you likely spend all day in anyway and should be adept at navigating

circuit10 2 years ago

It's a proof of concept for how it could do more complicated tasks

bnchrch 2 years ago

Personally. This is what Im really excited about chatgpt for. Data has just become alot more free to access.

thekid314 2 years ago

I'm curious to see what it does when it sees a captcha.

ishan0102 2 years ago

From OpenAI docs[1]: "For safety reasons, we have implemented a system to block the submission of CAPTCHAs."
[1] https://platform.openai.com/docs/guides/vision
- circuit10 2 years ago
  
  There was an exploit that let you access the GPT-4 vision model months before release (and this restriction) and it could do this: https://media.discordapp.net/attachments/1020661972322230272...
- xur17 2 years ago
  
  Yeah, I've been feeding screenshots from selenium to the vision API, and when I trigger bot detection on a website, chatgpt refuses to process the image.
  - NorwegianDude 2 years ago
    
    It does solve, or at least try to solve, captchas for me. It gets like half the characters correct, it's very bad at it.
    
    circuit10 2 years ago
    
    Maybe try asking it to list what's in each square before giving the final answer

burcs 2 years ago

This is amazing, I feel like these vision models are going to make everything so much more accessible. Between the Be My Eyes app integration and now this, I'm really excited for how this transforms the web.

ctoth 2 years ago

I agree, and I think we're a year or two away from a full end-to-end trained screen reader. The ground truth from existing systems would provide great training material.
As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.
- supriyo-biswas 2 years ago
  
  There are open source models such as https://github.com/THUDM/CogVLM and https://github.com/haotian-liu/LLaVA.

ternaus 2 years ago

Love the idea.

It also shows that GPT-4V created a new angle in web scraping.

I guess, this or similar code would be leveraged in many projects like:

1. Scrape XXX websites, say LinkedIn or Twitter use all types of methods in the DOM to prevent it, but fighting working well GPT-4V + OCR would be ultra hard.

2. Give me an analysis of what these XXX companies are doing. And this could be done for competitors, to understand the landscape of some industry, or even plainly to get news.

Large-scale scrapping, not depending on the source code of the pages is a powerful infrastructural change.

sebastiennight 2 years ago

It took me a while to get what you meant, because... I'm not sure "XXX websites" usually means what you intended to convey here :)
- ternaus 2 years ago
  
  I feel very innocent now, as it did not even cross my mind ;)

DalasNoin 2 years ago

I tried to use it, but unfortunately it often did not add the little annotations for the different options to the screen and it got stuck in a loop. This bot works by adding a two letter combination to each clickable option, but sometimes they don't show up. It managed to sign in to twitter ones, but really quickly I burned through the 100 images api limit.

Maybe for a future version it only uses vision for difficult situations in which it gets stuck and otherwise uses the text based browser?

comment_ran 2 years ago

It's so cool. I was wondering if we can make crawler tool much easier and better. It's more similar to the "human" way to interact with a website.

ranulo 2 years ago

This could enable human language test automation scripts and could either improve my life as a QA engineer a lot or completely destroy it. Not sure yet.

sunshadow 2 years ago

You're good until this is cheaper than your salary.

jackconsidine 2 years ago

Looks extremely cool. Trying to run it though, I get stuck at "Getting actions for the given objective..." (using the example on the repo)

ishan0102 2 years ago

Huh weird, I'm getting that too. OpenAI has been having periodic outages today, think that might be why since it was working fine earlier.
- jechamt 2 years ago
  
  https://www.bleepingcomputer.com/news/security/openai-confir... News reports and their https://status.openai.com/incidents/21vl32gvx3hb incident reports indicate they are mitigating / fighting off attacks recently

silentguy 2 years ago

Usually there are a lot of comments about how text is the best interface and it's making a comeback in the LLMs but in this case picture is the better medium since parsing the webpage js would prove too difficult. I think a screenshot of a webpage has a smaller footprint than the raw payloads (js, assets, etc.).

snthpy 2 years ago

Looks cool. Unfortunately I expected this to enhance my Vimium experience but it looks like this is using Vimium to enhance GPT4, right?

silentguy 2 years ago

I think this can be extended to desktop as well. There are programs that act like vimium for your desktop (win-vind, etc.). I don't have the openai API key to try it but I wish someone gave it a try (in obviously an isolated environment).

jonathanlb 2 years ago

Hmm interesting. I'm curious what this means for accessibility and screen readers.

imranq 2 years ago

Is the vision model directly reading the screen and therefore also reading the Vimeo tags? It might be more effective to export the DOM tags and the associated elements as a Json object that is fed into chatGPT without using the vision component

dymk 2 years ago

> Currently the Vision API doesn't support JSON mode or function calling, so we have to rely on more primitive prompting methods.
- maccam912 2 years ago
  
  I found that it works well to ask it to generate JSON the best it can, then pass it to gpt-3.5-turbo with the JSON response mode and instruct it to just clean up whatever input it received.
  - ishan0102 2 years ago
    
    Perfect, I have this as a todo in my readme and I’ll implement this soon

gvv 2 years ago

Nice job! The horrors GPT-4 must endure to watch ads, truly inhumane

doctorM 2 years ago

i think this is actively dangerous. well not yet. but getting there.

i know - ai isn't meant to be sentient. but if it looks like a duck and quacks like a duck...

how do i know that the comments here aren't done by dedicated hacker news ai bots?

the potential danger could come from lack of supervision down the road.

i didn't get much sleep last night so this is less coherent than it could be.

braindead_in 2 years ago

Why not build a new browser with GPT baked in?

reustle 2 years ago

Curious, how would that differ? Assuming it is just grabbing the rendered HTML DOM after each action, isn’t it nearly the same?

owenpalmer 2 years ago

This will be fantastic for accessibility

nostrowski 2 years ago

This will be in a future history book under a chapter titled "the beginning of the end"

startages 2 years ago

There is just so much you can do with GPT-4 vision, I just hope it's more affordable.

mediumsmart 2 years ago

this is awesome and great news, nevermind that the AI found the wrong video in the demo

https://www.youtube.com/watch?v=jRyX1tC2OS0

bilekas 2 years ago

This is actually pretty interesting.. I am thinking maybe it would be faster than writing up selenium tests themselves if we could just give a few instructions.

I'm still going through the source, but really nice idea and great example of enriching the GPT with tools like vimium.

rpigab 2 years ago

This is amazing that it's possible and works, but I wonder if the electricity cost is sustainable in the long run.

For handicapped people who depend on tools like this for accessibility, it's justified, but I wouldn't use it myself if it uses too much power.

I'm sure OpenAI and friends love operating at a loss until everyone uses their products, then enshittify or raise prices, like Netflix, Microsoft, Google, etc., but CO2 emissions can't be easily reversed.

I'd be glad to listen to other points of view though, maybe everything we do on computers is already bad for the environment anyway and comparing which one pollutes more is vain, idk.

Settings

Using GPT-4 Vision with Vimium to browse the web

Keyboard Shortcuts