Settings

Theme

OpenAI GPT-4 vs. Groq Mistral-8x7B

serpapi.com

105 points by tanyongsheng 2 years ago · 138 comments

Reader

wruza 2 years ago

The prompt, for those interested. I find it pretty underspecified, but maybe that's the point. For example, "Business operating hours" could be expanded a little, because "Closed - Opens at XX" is still non-processable in both cases.

  You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.

  Data to scrape: 
  title: Name of the business
  type: The business nature like Cafe, Coffee Shop, many others
  phone: The phone number of the business
  address: Address of the business, can be a state, country or a full address
  years_in_business: Number of years since the business started
  hours: Business operating hours
  rating: Rating of the business
  reviews: Number of reviews on the business
  price: Typical spending on the business
  description: Extra information that is not mentioned yet in any of the data
  service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
  is_operating: Whether the business is operating
  
  HTML: 
  {html}
  • infecto 2 years ago

    This should be higher up. This whole blog post is mostly worthless because the way they are extracting data is less than optimal.

    Lower end models do not have the attention to complete tasks like this, GPT4Turbo will generally have the capability. But to have an optimal pipeline you should really be splitting up these tasks into individual units. You extract each attribute you want independently and then combine it back together however you want. Also asking for JSON upfront is equally suboptimal in the whole process.

    I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

    Edit: I am not suggesting that an LLM is more optimal than what ever traditional parsing methods they may use, simply the way they are doing it is wrong from an LLM flow.

    • ilyazub 2 years ago

      > I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

      Cool, cool. I'm super interested. Please share the process and the results.

    • wruza 2 years ago

      Also, my (limited) experience with prompts tells that you want to invest more into the “You are” part. I’ll share my understanding, corrections are appreciated.

      LLMs aren’t people even in a chat-roleplaying sense. They complete a “document” that can be a plot, a book, a protocol of conversation. The “AI” side in the chat isn’t an LLM itself, it’s a character (and so are you, it completes your “You: …” replies too - that’s where the driver app stops it and allows you to interfere). So everything you put in that header is very important. There are two places where you can do that: right in the chat, as in TFA, or in the “character card” (idk if GPTs have it, no GPT access for me). I found out that properly crafting a character card makes a huge difference and can resolve the whole classes of issues.

      Idk what will work best in this case, but I’d start with describing which sort of a bot, how it deals with unclear or incomplete information, how amazing it is (yes, really), its soft/tech skills and problem solving abilities, what other people think of it, their experience and so on. Maybe would add few examples of interactions in a free form. Then in the task message I’d tell it more and specific details about that json.

      One more note - at least for 8x7B, the “You are” in the chat is a much weaker instruction than a character card, even if the context is still empty. I low-key believe that’s because it’s a second-class prompt, i.e. the chat document starts with “This is a conversation with a helpful AI bot which yada yada” in… mind, and then in that chat that AI character gets asked to turn into something else, which poisons the setting.

      Simply asking the default AI card represents 0.1% of what’s possible and doesn’t give the best results. Prompt Engineering is real.

      I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

      Same. I think that no matter how good a model is, this prompt just isn’t a professional task statement and leaves too much to decide. It’s a task that you, as a regular human, would hate to receive.

    • mhuffman 2 years ago

      Do you have an example of a more optimal prompt to share?

      • infecto 2 years ago

        The prompt does not matter as much as the workflow which is describe above. 1) Extract one attribute at a time. 2) Don't ask for json during extraction, but on binary small attributes it might not matter as much.. 3) Combine the data later.

        There are differences that can be marked on how different models perform against the same raw prompt but generally the workflow is what matters more. The raw text prompt will be dependent on what model you are using as there are those differences but I don't think its a level of "prompt engineering" like we had a year ago.

feintruled 2 years ago

Brave new world, where our machines are sometimes wrong but by gum they are quick about it.

  • RUnconcerned 2 years ago

    I too am a big fan of having my computer hallucinate incorrect information.

    • darthrupert 2 years ago

      Yesterday I asked my locally running gpt4all "What model are you running on?"

      Answer: "I'm running on Toyota Corolla"

      Which was perhaps the funniest thing I heard that day.

    • harryf 2 years ago

      >> print(“Hello, world!”.ai_reverse()) world, Hello!

      • ben_w 2 years ago

        First few versions of Swift kept changing how strings work because it's not entirely obvious what most people intend from the nth element of a string.

        Used to be easy, when it was ASCII.

        Reverse the bytes of utf-8 and it won't always be valid uft-8.

        Reverse the code-points, and the Canadian flag gets replaced with the Ascension Island flag.

      • samus 2 years ago

        Character-level operations are difficult for LLMs. Because of tokenization they don't really "perceive" strings as a list of characters. There are LLMs that ingest bytes, but they are intended to process binary data.

RUnconcerned 2 years ago

Finally, something more offensive than parsing HTML with regular expressions: parsing HTML with LLMs.

  • AlphaAndOmega0 2 years ago

    I for one am glad I can offload all the regex to LLMs. Powerful? Yes. Human readable for beginners? No.

    • cornedor 2 years ago

      Why tough? To me, it seems more prone to issues (hallucinations, prompt injections etc). It is also slower and more expensive at the same time. I also think it is harder to implement properly, and you need to add way more tests in order to be confident it works.

    • RUnconcerned 2 years ago

      Personally when I am parsing structured data I prefer to use parsers that won't hallucinate data but that's just me.

      Also, don't parse HTML with regular expressions.

      • rybosome 2 years ago

        Generally I agree with your point, but there is some value in a parser that doesn’t have to be updated when the underlying HTML changes.

        Whether or not this benefit outweighs the significant problems (cost, speed, accuracy and determinism) is up to the use case. For most use cases I can think of, the speed and accuracy of an actual parser would be preferable.

        However, in situations where one is parsing highly dynamic HTML (eg if each business type had slightly different output, or you are scraping a site which updates the structure frequently and breaks your hand written parser) then this could be worth the accuracy loss.

        • samus 2 years ago

          You could employ an LLM to give you updated queries when the format changes. This is something where they should shine. And you get something that you can audit and exhaustively test.

    • okamiueru 2 years ago

      Deterministic? No.

retrac98 2 years ago

There are so many applications for LLMs where having a perfect score is much more important than speed, because getting it wrong is so expensive, damaging, or time consuming to resolve for an organisation.

  • nathan_compton 2 years ago

    If you need a perfect score, don't use LLMs. This seems obvious to me, even given the state of the art LLMs. I am a heavy user of GPT4 and I wouldn't bet $1000 bucks on it being 100% reliable for any non-trivial task.

    • retrac98 2 years ago

      They'll get better. Humans are far from perfect, and I have no doubt that LLMs will eventually outperform them for non-trivial tasks consistently.

      • nathan_compton 2 years ago

        Maybe so, but at this stage I wouldn't be betting a business model on it.

        • Socnic 2 years ago

          Businesses do bet on imperfect and even criminal models all the time (way before LLMs existed)... they call it cost of doing business when they get it wrong or get caught.

      • Jensson 2 years ago

        > Humans are far from perfect

        Humans running multishot with mixture of experts is close to perfect. You can't compare a multishot mixture of expert AI to a single human, humans doesn't work in isolation.

      • littlestymaar 2 years ago

        Machine learning models will get better for sure. We don't know if LLM are the end game though and it's not sure if this particular technique is what we'll need to reach the next level.

      • somewhereoutth 2 years ago

        Or they might not get better. It could be that we are at a local optimum for that sort of thing, and major improvements will have to wait (perhaps for a very long time) for radical new technologies.

        • luma 2 years ago

          Maybe, but it certainly hasn’t been the arc of the past few years. I don’t know how anyone could look at this and assume that it’s likely to slow down.

      • samus 2 years ago

        They already have superhuman image classification performance.

        • pooper 2 years ago

          I remember talking to a radiologist who said he was sure something like this was coming like ten years ago where instead of a radiologist looking at scans manually, a machine would go through a lot of images and flag some for manual review.

          We haven't even gotten there yet, have we?

          • osrec 2 years ago

            Yes, we absolutely are there: https://youtu.be/D3oRN5JNMWs?feature=shared

            My professor (Sir Michael Brady) at university 14 years ago set up a company to do this very thing, and he already had reliable models back before 2010. I believe their company was called Oxford Imaging or something similar.

            • wruza 2 years ago

              Yep, everyone seems to forget that ML was available before 2021. Had a conversation recently with my former colleague who learned about some plastic packaging company which used "AI" to predict client orders and inform them about scheduling implications. When I told him that you don't need Transformers and 30GB models for that, he was quasi-confused, cause he kinda knew it but the hype just overtook his knowledge.

              • anon373839 2 years ago

                In ML courses, you’re taught to try simpler methods and models before turning to more complex ones. I think that’s something that hasn’t made it into the mainstream yet.

                A lot of people seem to be using GPT-4 for tasks like text classification and NER, and they’d be much better off fine-tuning a BERT model instead. In vision, too, transformers are great but a lot of times, a CNN is all you really need.

          • dagw 2 years ago

            We haven't even gotten there yet, have we?

            Yes and no. Countless teams have solved exactly this problems at universities and research groups across the world. Technically it's pretty much a solved problem. The hard part is getting the systems out of the labs and certified as an actual product and convincing hospitals and doctors to actually use them.

          • matheusd 2 years ago

            Maybe it's a liability issue, not a competency issue.

        • jojobas 2 years ago

          Until a single pixel makes a cat a dog or something like that.

          • samus 2 years ago

            Changing a single pixel is usually not enough to confuse convolutional neuronal networks. Even so, human supervision will probably always be quite important.

  • spaniard89277 2 years ago

    I've tried to apply it to parsing HTML as this article into a pretty long pipeline. I'm using DeepInfra with Mistral 8x7B and I'm still unsure if I'm going to use for production.

    The problem I'm finding is that the time I wanted to save mantaining selectors and the like is time that I'm spending writing wrapper code and dealing with the mistakes it makes. Some are OK and can deal with them, others are pretty annoying because It's difficult to deal with them in a deterministic manner.

    I've also tried with GPT-4 but it's way more expensive, and despite what this guy got, it also makes mistakes.

    I don't really care about inference speed, but I do care about price and correctness.

    • ogogmad 2 years ago

      Might be a silly question, but if you want determinism in this, why don't you get the LLM to write the deterministic code, and use that instead? Interesting experiment, though!

      In fact, what about a hybrid of what you're doing now? Initially, you use an LLM to generate examples. And then from those examples, you use that same LLM to write deterministic code?

    • Eisenstein 2 years ago

      Have you tried swapping Mistral 8x7B with either command-r 34B, Qwen 1.5 70B, or miqu 70B? Those are all superior in my experience, though suited for slightly different tasks, so experimentation is needed.

    • samus 2 years ago

      Parsing HTML and tagsoup is IMHO not the right application for LLMs since these are ultimately structured formats. LLM are for NLP tasks, like extracting meaning out of unstructured and ambiguous text. The computational cost of an LLM chewing through even moderately-sized document can be more efficiently spent on sophisticated parser technologies that have been around for decades, which can also to a degree deal with ambiguous and irregular grammars. LLMs should be able to help you write those.

  • malux85 2 years ago

    Yeah I agree - just an hour ago I was dealing with an LLM that was missing a "not" thus inverting the meaning of a rather important simulation parameter!

  • worldsayshi 2 years ago

    It makes much more sense to me to have the LLM infer the correct query for extracting data on the page. Much faster and reliable and it wouldn't really be a problem to have a human in the loop every now and then.

  • onion2k 2 years ago

    All the places I see AI being applicable to my work don't require a perfect score, and a threshold is actually much more useful, especially where multiple factors come together to make evaluation to a single value hard.

  • bberrry 2 years ago

    If you have speed you can generate multiple answers and have another model pick the best one.

    • Drakim 2 years ago

      If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

      That's understandable. The real problem is when the AI lies/hallucinates another answer with confidence instead of saying "I don't know".

      • simion314 2 years ago

        The problem is asking for facts, LLM are not a database so they know stuff but it is compressed so expect wrong facts, wrong names, dates, wrong anything.

        We will need an LLM as a front end then it will generate a query to fetch the facts from the internet or a database , then maybe format the facts for your consumption.

        • samus 2 years ago

          This is called Retrieval Augmented Generation (RAG). The LLM driver recognizes a query, it gets send to a vector database or to an external system (could be another LLM...) and the answer is placed in the context. It's a common strategy to work around their limited context length, but it tends to be brittle. Look for survey papers.

        • ch_sm 2 years ago

          That‘s exactly it. It‘s ok for LLMs to not know everything, because they _should_ have a means to look up information. What are some projects where this obvious approach is implemented/tried?

          • Jensson 2 years ago

            But then you need an LLM that can separate between grammar and facts. Current LLMs doesn't know the difference, that is the main source to these issues, these models treat facts like grammar and that worked well enough to excite people but probably wont get us to a good state.

      • m348e912 2 years ago

        The weird problem is with LLM hallucinations is that it usually will acknowledge its mistake and correct itself if you call it out. My question is why can't LLMs included a sub-routine to check itself before answering. Simply asking itself something like "this answer may not be correct, are you sure you're right?"

        • Shrezzing 2 years ago

          >The weird problem is with LLM hallucinations is that it usually will acknowledge its mistake and correct itself if you call it out.

          From what I've tested, all of the current models will see a prompt like "are you sure that's correct" and respond "no, I was incorrect [here's some other answer]", irrespective of the accuracy of the original statement.

        • greenavocado 2 years ago

          In my experience the corrections can be additional hallucinations one after another after pointing out inaccuracies even multiple times in a row.

        • Eisenstein 2 years ago

          > My question is why can't LLMs included a sub-routine to check itself before answering.

          Because LLMs don't work in a way for that to be possible if you operate them on their own.

          Here is the debug output of my local instance of Mistral-Instruct 8x7B. The prompt from me was 'What is poop spelled backwards?'. It answered 'puoP'. Let's see how it got there starting with it processing my prompt into tokens:

             'What (3195)', ' is (349)', ' po (1627)', 'op (410)', ' sp (668)', 'elled (6099)', ' backwards (24324)', '? (28804)', '\n (13)', '### (27332)', ' Response (12107)', ': (28747)', '\n (13)',
          
          It tokenized 'poop' as two tokens: 'po', number 1627, and 'op', number 410.

          Next it comes up with its response:

             Generating (1 / 512 tokens) [(pu 4.43%) (The 66.62%) (po 11.96%) (p 4.99%)]
             Generating (2 / 512 tokens) [(o 89.90%) (op 10.10%)]
             Generating (3 / 512 tokens) [(P 100.00%)]
             Generating (4 / 512 tokens) [( 100.00%)]
          
          It picked 'pu' even though it was only a ~4% chance of being correct, then instead of picking 'op' it picked 'o'. The last token was a 100% probability of being 'P'.

             Output: puoP
          
          At no time did it write 'puoP' as a complete word nor does it know what 'puoP' is. It has no way of evaluating whether that is the right answer or not. You would need a different process to do that.
        • ZitchDog 2 years ago

          The problem is that if you call it out, it will frequently change its answer, even if it was correct. LLMs currently lack chutzpa.

        • Jensson 2 years ago

          That is a common bullshitting strategy, talk a lot of bullshit, and then backtrack and acknowledge you were wrong when people push back. That way they will think you know way more than you do. Many people will see thought that, but most will just think you are a humble expert who can acknowledge when you are wrong instead of you always acknowledging you are wrong even when you aren't.

          People have a really hard time catching such bullshitting from humans, which is why free form interviews doesn't work.

        • asimovfan 2 years ago

          Its because theres no entity that is actually acknowledging anything. Its generating an answer to your prompt. You can gaslight it into anything being wrong or correct.

        • samus 2 years ago

          They simply don't work that way. You are asking it for an answer, it will give you one since all it can do is extrapolate from its training data.

          Good prompting and certain adjustment to the text generation parameters might help prevent hallucinations, but it's not an exact science since it depends on how it was trained. Also, an LLMs training data frankly said contains a lot of bulls*t.

      • helsinkiandrew 2 years ago

        > If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

        Think the commenter meant use another model/LLM which could give a different answer, then let them vote on the result. Like "old fashioned AI" did with ensemble learning.

infecto 2 years ago

This test is interesting from a general high level metric/test but overall the way they are extracting data using a LLM is suboptimal so I don't think the takeaway means much. You could extract this type of data using a low-end model like 8x7B with a high degree of accuracy.

  • samus 2 years ago

    The better way would be to ask it to generate a program that uses CSS selectors to parse the HTML.

emporas 2 years ago

Mixtral works very well with json output in my personal experience. Gpt family are excellent of course, and i would bet Claude and Gemini are pretty good. Mixtral however is the smallest of the models and the most efficient.

Especially running on Groq's infrastructure it's blazing fast. Some examples i ran on Groq's API, the query was completed in 70ms. Groq has released API libraries for Python and Javascript, i wrote a simple Rust example here, of how to use the API [1].

Groq's API documents how long it takes to generate the tokens for each request. 70ms for a page of document, are well over 100 times faster than GPT, and the fastest of every other capable model. Accounting for internet's latency and some queue that might exist, then the user receives the request in a second, but how fast would this model run locally? Fast enough to generate natural language tokens, generate a synthetic voice, listen again and decode the next request the user might talk to it, all in real time.

With a technology like that, why not talk to internet services with just APIs and no web interface at all? Just functions exposed on the internet, take json as an input, validate it, and send the json back to the user? Or every other interface and button around. Why pressing buttons for every electric appliance, and not just talk to the machine using a json schema? Why should users on an internet forum, every time a comment is added, have to press the add comment button, instead of just talking and saying "post it"? Pretty annoying actually.

[1] https://github.com/pramatias/groq_test

imaurer 2 years ago

Groq will soon support function calling. At that point, you would want to describe your data specification and use function calling to do extraction. Tools such as Pydantic and Instructor are good starting points.

I am collecting these approaches and tools here: https://github.com/imaurer/awesome-llm-json

bambax 2 years ago

Interesting post, but the prompt is missing? How do the LLMs generate the keys? It's likely the mistakes could be corrected with a better prompt or a post check?

Also, Google SERP page is deterministic (always has the same structure for the same kind of queries), so it would probably be much more effective to use AI to write a parser, and then refine it and use that?

tosh 2 years ago

I initially thought the blog post is about scraping using screenshots and multi-modal llms.

Scraping is quite complex by now (front-end JS, deep and irregular nesting, obfuscated html, …).

crowdyriver 2 years ago

There's lots of comments here about how stupid is to parse html using llms.

Have you ever had to scrape multiple sites with variadic html?

  • samus 2 years ago

    The example here has HTML with a somewhat fixed format. It would indeed have been better to have samples with different format and aiming for a low error rate.

    If you are scraping a limited amount of sites, you could for each site ask the LLM for parsing code from some samples, review that, and move on.

malux85 2 years ago

Sorry to be nit-picky but thats the essence of these benchmarks - Mistral putting "N/A" for not available is weird - N/A is not applicable, in every use I have ever seen, and they DONT mean the same thing. I would expect null for not available and N/A for not applicable

Impressive inference speed difference though

  • mewpmewp2 2 years ago

    I have always known N/A as not available.

    • malux85 2 years ago

      Curious, where are you from? If I Google N/A every single hit on the first page is explaining it means "Not applicable"

      are you from a non-english country? Maybe its cultural?

      • selcuka 2 years ago

        The first entry on Google is Wikipedia [1] for me:

        > N/A (or sometimes n/a or N.A.) is a common abbreviation in tables and lists for the phrase not applicable, not available, not assessed, or no answer.

        [1] https://en.wikipedia.org/wiki/N/A

        • malux85 2 years ago

          Thats interesting, wikipedia is not on the first page for me, my first hit is Cambridge dict: (and then a bunch of other dicts) - Im flying right now but IP geolocation puts me in the US

          Meaning of n/a in English written abbreviation for not applicable: used on a form to show that you are not giving the information asked for because the question is not intended for you or your situation: If a question does not apply to you, please put N/A in the box provided. COMMERCE.

          TIL

          • Jensson 2 years ago

            In a data table "not available" is usually the right word for it, like if you have a list of national statistics then some of the values wont be available due to political reasons etc. But all of those means basically the same thing to the end user, this value isn't there.

      • mewpmewp2 2 years ago

        I'm from North Europe, so not a native English speaker, but still it seems like based on my experience in life it seems as the first idea is that it's Not Available.

        If I was to code something and for whatever reason some data wasn't available I would use N/A.

        "Not applicable" doesn't feel right to me about N/A.

        For instance if there is a table of comparison and for whatever reason there is data missing for some entity, while there should be, I would use N/A. So not applicable feels wrong for me for that reason alone.

        This all is coming from intuition though.

  • throwaway11460 2 years ago

    It means all of these.

huqedato 2 years ago

Can somebody explain why this Grok is more performant than Microsoft infrastructure ? LPU better than TPU/GPU ?

  • kkielhofner 2 years ago

    LLM performance is about parallelism but also memory bandwidth.

    Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].

    From the linked reference:

    "In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."

    That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).

    If you consider the requirements for actual real-world inference serving workloads you need to serve multiple models, multiple versions of models, LoRA adapters, sentence embeddings models (for RAG), etc the economics and physical footprint alone get very challenging.

    It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:

    1) This analysis uses cloud GPU costs for Nvidia pricing. Cloud providers make significant margin on their GPU instances. If you look at qty 1 retail Nvidia DGX, Lambda Hyperplane, etc and compare it to cloud GPU pricing (inference needs to run 24x7) break even on hardware vs cloud is less than seven months depending on what your costs are for hosting the hardware.

    2) Nvidia has incredibly high margins.

    3) CUDA.

    There are some special cases where tokens per second and time to first token are incredibly important (as the article states - real time agents, etc) but overall I think actual real-world production use or deployment of Groq is a pretty challenging proposition.

    [0] - https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

  • tosh 2 years ago

    The Mistral Mixed Expert model has way fewer parameters active during inference and Groq has special purpose hardware (and probably less concurrent demand).

    • kkielhofner 2 years ago

      > probably less concurrent demand

      This is a significant understatement. ChatGPT has an estimated 100m monthly active users.

      Groq gets featured on HN from time to time but is otherwise almost completely unknown. According to their stats they have done something like 15m requests total since launch. ChatGPT likely does this in hours (or less).

  • naiv 2 years ago

    It's a totally different approach for interference

    In short:

    Groq - Ai Chip Microsoft etc. - Nvidia Gpu

ttrrooppeerr 2 years ago

A bit off-topic but maybe not? Any words on GPT-5? Is that coming? Or is OpenAI just focusing on the Sora model?

  • YetAnotherNick 2 years ago

    There's no reason for OpenAI to release the model. They have close to 100% market anyways and releasing GPT-5 likely won't increase the total market as it is a incremental leap. And it's a open secret that most other models used GPT-4 synthetic data for training to come close to it.

    They would likely wait till any model performs better than GPT 4 for the same price

    • whiplash451 2 years ago

      The same reasoning would have applied for GPT-3.5. In the hindsight, you can say that it was obviously a good idea to build and ship GPT4. But hindsight is 20/20.

      • YetAnotherNick 2 years ago

        There are few differences. Firstly, GPT-3.5 wasn't ahead of Palm etc. from Google which was published at the same time as GPT-4.

        Secondly, GPT-4 increased overall AI market. According to all the sources, interviews and leaks, GPT-5 won't be a big leap over GPT-4 as the model size and training data won't be significantly larger. I doubt GPT-5 would do that. (I could be wrong in my assumption though that GPT-5 would just be a incremental gain).

    • chilmers 2 years ago

      By any chance did you used to work in leadership at Nokia or Research in Motion? :-D

      • YetAnotherNick 2 years ago

        Nokia wasn't that ahead in technology and Motion wasn't that ahead in market. GPT-4 is ahead in both.

    • lewhoo 2 years ago

      There is reason to release new models if said models would be capable of grabbing a significant portion of job market currently occupied by humans.

    • tosh 2 years ago

      100%?

      Claude 3 Opus is in the capability ballpark of GPT-4, GPT-3.5 has alternatives that are cheaper (Claude 3 Haiku) or cheaper and work offline (Qwen 1.5, Mixtral, …).

      • ZitchDog 2 years ago

        100% market share.

        A competitor will likely need to be 10x better than ChatGPT in order to get significant market share, not just marginally better in certain scenarios.

      • Kostic 2 years ago

        Is Claude 3 Opus generating more profits and taking considerable amount of customers from OpenAI? I'm not seeing that yet. Granted, I'm in Europe (outside of EU) so I can't pay for Opus but I guess that kinda confirms my statement. GPT4 is still a good product and there are no market pressures to release GPT5.

  • burrish 2 years ago

    I hear it should be dropped this summer

    • cornedor 2 years ago

      According to Sam Altman in a podcast with Lex Fridman this week, there is no real indication that it will be dropped this year. They will release a new model, but it might not be GPT-5

      • burrish 2 years ago

        Fair enough, I got the info from this article

        https://web.archive.org/web/20240319224624/https://www.busin...

      • whiplash451 2 years ago

        Which is an indication of nothing. In which world would Sam A. drop any kind of info about such a sensitive topic? If anything, this could just be deception before a massive drop.

        • HarHarVeryFunny 2 years ago

          Could also be resetting expectations for people who've been expecting GPT-5 (or just GPT-4.5) sooner - been a year now since GPT-4 was released.

          The other odd thing from Altman was saying that GPT-4 sucks.

          I think the context for both announcements is the recent release of Anthropic's Claude-3, which in it's largest "Opus" form beats GPT-4 across the board in benchmarks.

          I personally think OpenAI/Altman is a bit scared that any moat/lead they had has disappeared and they are now being out-competed by Anthropic (Claude). Remember that Anthropic as a company was only formed (by core members of the OpenAI LLM team) at the same time as GPT-3 was released, so in same time it took OpenAI to go from GPT-3 to GPT-4, Anthropic have gone from nothing -> Claude-1 -> Claude-2 -> Claude-3 which beats GPT-4 !!

          Anthropic have also had quite a bit of success attracting corporate business, quite a bit of which is more long-term in nature (sharing details of expected future model capabilities so that partners can target those).

          So, I think OpenAI is running a bit scared, and I'd interpret this non-announcement of some model (4.5 or 5) "coming soonish" to be them just waving the flag and saying "we'll be back on top soon", which they presumably will be, briefly, when their next release(s) do come out. Altman's odd "GPT-4 sucks" statement might be meant to downplay Claude-3 "Opus" which beats it.

    • DalasNoin 2 years ago

      My understanding from the lex podcast: they will release a lot of new models this year, but they will release intermediate models first before gpt-5

dns_snek 2 years ago

For all the posturing and crypto hate on HN, we're entering a world where it's socially acceptable to use 1000W of computing power and 5 seconds of inference time to parse a tiny HTML fragment which would take microseconds with traditional methods - and people are cheering about it. Time for some self-reflection? That's not very green.

  • delegate 2 years ago

    Crypto energy requirements go up as the currency gets more traction.

    TFA shows that groq is many times faster than GPT-4. Up to 18x groq claims. Faster means less energy. So I think it's just a matter of time until these things become ridiculously power efficient (eg run on phones in sub second times)

    • jodleif 2 years ago

      How does faster mean less energy? Thats only true if you’re running faster on the same hardware…

      • delegate 2 years ago

        Presumably. Less time the giant chip has to draw power for computation. The point is that everyone's interested in making AI power efficient, while crypto's proof of work is a competition for more power burned hashing and throwing away the result.

      • wenebego 2 years ago

        I think they are talking about the case where, hypothetically, there is a 10x increase in speed but only 2x increase in power consumption

    • drexlspivey 2 years ago

      Bitcoin energy requirements will be cut in half in a few days..

    • samus 2 years ago

      It's still a monstrosity compared to a traditional parser. You can even be fancy and use complex parsers that backtrack and can deal with mildly context-sensitive languages (as required for HTML, XML, and many programmin languages), and you'd still be more efficient.

  • shanehoban 2 years ago

    This is a valid point, but we are still in the early stages of AI/LLMs, so one would expect the speed and efficiency to improve drastically (perhaps accuracy too) over the coming years.

    At least AI & LLMs have large scale practical applications as opposed to crypto (IMO).

    • AlchemistCamp 2 years ago

      AI is a lot older than blockchain. There were full-fledged neural networks in the 40s and the perceptron was implemented in hardware in the 50s.

      • IshanMi 2 years ago

        It's also interesting to think that IBM released an 8-trillion parameter model back in the 1980s [0]. Granted it was an n-gram model so it's not exactly an apples-to-apples comparison with today's models, but still, quite crazy to think about.

        [0]: https://aclanthology.org/J92-4003.pdf

        • lukeschantz 2 years ago

          Interesting to see Robert Mercer the former CEO of Renaissance Technology is one of the authors on that paper. He is a former IBMer. If his name is unfamiliar he is a reclusive character who was a major funder of Breitbart, Cambridge Analytica and the Republican candidate in the 2016 presidential election.

      • varjag 2 years ago

        I wouldn't call the early McCulloch & Pitts work quite "full-fledged". Also backpropagation, essential for multi level perceptrons was not a thing until 1980s.

        • samus 2 years ago

          Backprop is just applied calculus. People simply didn't think about using it for neuronal networks yet.

          • varjag 2 years ago

            It was thought of as early as in 1960s by Rosenblatt but he did not come up with a practical implementation at the time. Lotsa things look obvious in hindsight.

  • ogogmad 2 years ago

    You're partially right. It's obvious that the solution is to combine traditional programming with AI, using traditional programming wherever possible because it's greener. Assuming you want things to turn out well in every possible future scenario, your decisions only matter if AGI isn't right around the corner. So assume it isn't right around the corner. Then there's going to be some interesting combining-together of manual human intervention, traditional software, and AI. We'll need to charge more for some uses of electricity, to incentivise turning AI into traditional software wherever possible.

    Crypto is nearly pure waste.

    • CaptainFever 2 years ago

      > We'll need to charge more for some uses of electricity, to incentivise turning AI into traditional software wherever possible.

      I don't understand this. This adds bureaucracy and I don't see why different uses need to be charged differently if they all use energy the same.

      In other words, if energy costs X per unit, and an inefficient (AI) software takes 30 units and an efficient (traditional) software takes 10 units, then it is already cheaper to run the efficient software, and thus people are already incentivised to do so. There's no need to charge differently. If one day AI turns out to only need 5 units, turning more efficient, then just charge them for 5X. People will gravitate towards the new, efficient AI software naturally then.

  • Jensson 2 years ago

    Websites will never be fast, will they? Even with 1000x more compute than now they will just perform everything in LLM calls and stuff are just as slow as now.

  • qup 2 years ago

    It would take microseconds after a complete program was written by a human?

    It no longer requires an expert human

    • josho 2 years ago

      And if this use case hit any kind of scale. We’d just have an llm generate a parser and be back to microseconds.

      This was just a blog to generate traffic on the site. Not to showcase some new use case for an llm.

  • samlinnfer 2 years ago

    Any amount of energy spent useful work is vastly superior than whatever “POW” crypto burn does.

    >For all the posturing and forest fire hate on HN, it’s now socially acceptable to run a toy steam engine to power a model car? Not very green of you.

    • CaptainFever 2 years ago

      It's almost a fallacy at this point to declare something bad simply because of the existence of carbon emissions, without first comparing the benefits of what is being produced, and the alternative tradeoffs.

      To be fair to GP, they did compare it to alternatives (dumb HTML parsing), but failed to consider versatile HTML parsing or other uses for Groq LLM.

    • samus 2 years ago

      While you are not wrong, crypto is not what this is being compared with.

  • londons_explore 2 years ago

    While energy remains cheap and human minds remain expensive, it always makes sense to use AI to reduce human effort.

    If one cares about the environment, a carbon cap/tax is what you should campaign for. Then carbon-based energy sources will be curtailled, energy costs will go up, and AI like this will be encouraged to become more energy efficient or other methods used instead.

    • osigurdson 2 years ago

      It is a nice idea in principle but ends up being a political tool and a tariff on goods and services of your own country. A global and corruption free carbon tax might work but that is impossible to achieve.

      • londons_explore 2 years ago

        The only way it's gonna work is if a bunch of countries get together, agree a carbon cap/tax, and then tell other countries that they need to join the scheme if they want to trade goods with the group.

        One way to combat corruption is to ask an international panel of experts to assess how many extra emissions came from non-official sources in each country and reduce next years cap by that amount. Then countries have an incentive to stamp out corruption.

        • osigurdson 2 years ago

          I don't know. Corruption gets easier with increased centralization. I think a far better approach is to innovate our way out of it. If carbon free energy sources are less expensive then the problem will solve itself essentially. A global carbon tax will enviably extract some portion of global GDP from corruption. That money would likely be better spent in other ways.

          Basically, carbon tax is the accountant's solution, innovation is the engineer's.

          • londons_explore 2 years ago

            carbon-free will take a really long time to be cheaper.

            As soon as demand for oil starts to drop, so will oil prices, and I suspect they could go down by a factor of 10 or more and oil-rich nations would still think it worthwhile to exploit at least some reserves.

  • infecto 2 years ago

    Because crypto has very little real world use.

    There is a lot of business value happening in the AI space and its only going to get better.

  • skc 2 years ago

    One is actually useful day to day though.

  • rafaelero 2 years ago

    What a ridiculous complaint. Energy efficiency won't remain static, and even if it were, it's not up to you to decide how to best leverage the available electricity.

    • lm28469 2 years ago

      > it's not up to you to decide

      Unless you live in a dictatorship it's definitely up to us to decide... Otherwise you leave your voice to the top 0.0001% business owners and expect them to work for your good and not for their own interests

      Also read about the rebound effect. Planes are twice as efficient as they were 100 years ago yet they pollute infinitely more as a whole.

      There is nothing ridiculous about the comment you're replying to

      • infecto 2 years ago

        Yes you are right and the future is dependent on innovation and using more electricity with a large percentage of it coming form renewable sources. I don't want to go live on the farm myself.

      • rafaelero 2 years ago

        Ok, then let's start by getting away with all the wasteful animal farming.

  • satisfice 2 years ago

    AND it's not even reliable.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection