GPT-4 could pass bar exam, AI researchers say
the-decoder.comI feel like I can now see the event horizon of commoditized intelligence. No idea what society (is "society" even the right word? Who knows) is going to look like on the other side of it, but it is going to be wildly different. Perhaps a brief period where everyone is using an AI to do their job, uh, I mean, assist their work, but beyond that it's unknowable.
Moreover, this looks like it is going to be happening sooner rather than later.
GPT has no reasoning ability, it has billions of parameters that make it pretend it has it, purely going off of previously digested material.
As long as it comes across some reasoning process that have not been seen before in the training wordset, which can be as easy as a middle school math question, it fails. Because it has no ability to extrapolate logic.
If it manages to pass Bar test, that says more about the Bar test than it says about GPT.
Most jobs today don't need novel reasoning. This is the equivalent of the steam machine for intelligence.
During the industrialization, machines did not replace all jobs, but they replaced or changed most jobs. The same will happen here.
A typical office job will have a few hours a week of actual, intensive thought. The vast majority of time will be spent doing simple, repetitive work. This work can be automated, or at least significantly sped up, using technology like GPT.
“write an API client for …”, “integrate APIs … and …” can easily be automated. Yes, you'll still have to write the business logic, but that's not the majority of your work today. You could even have it write unit tests based on the JIRA ticket description.
The same applies to many other jobs.
> You could even have it write unit tests based on the JIRA ticket description.
This is a wonderful point: writing unit tests is exactly the kind of mind-numbing tedium that I'm super excited to automate away.
> Most jobs today don't need novel reasoning. This is the equivalent of the steam machine for intelligence.
Like the point above; that says more about the work.
It’s going to be really interesting how the middle-class narrative pushes back on AI revealing how little work is actually done during office hours.
These boilerplate code can be and are automated away using deterministic frameworks. No need to introduce a blackbox and be responsible to debug the stuff it creates, which sounds far more painful than the alternatives.
That's true today, but think about all the work you do that takes basically no conscious effort, but is still not automated yet.
GPT can be of use there, as long as you're working with languages that use strict static types and have proper tests, it's easy to automate and ensure there are no mistakes.
You are implying either:
* Understanding complex language does not require logic/reasoning,
* There are infinitely many forms of logic/reasoning or at least more than those existing in a vast training set.
Neither of which is likely true.
What do you think of the Minerva system, which can solve multi-step quantitative reasoning questions better than many competent students and most adults?
https://ai.googleblog.com/2022/06/minerva-solving-quantitati...
Note: If you look at LSAT test samples, many questions are tests of complex logical reasoning, a requisite for legal professions.
You nailed what I find discomforting about these discussions. They’re incredibly narrowly focused on a specific implementation that satisfies hitherto unsolved problems by pointing out its doesn’t do already solved problems. But surely folks realize the human brain isn’t a single monolithic processing program but an ensemble of specialized subsystems that organize to form the mind. Why would you assume you wouldn’t do the same with AI systems? We’ve been tackling reasoning, inference, problem solving, information retrieval, mathematics, logic, and other domains for decades with some stupendous results. But they lacked the ability to ingest and translate language into some intermediate semantic form and take output and reconstruct it into a human language. Likewise vision, and audio processing and input output has been a struggle until recently.
I also really strongly disagree that it’s basically doing some sort of information retrieval design where based on language it regurgitates some sort of markov expectations. You can ask it to do very complex translations of a concept from one domain to another and expressed in a form that’s certainly never been done before and it does it with alacrity. At the very minimum it “remembers” things from the past in the conversation and can associate the semantic ideas across prompts and synthesize cogent responses - that in itself implies it has some semantic “understanding” of the structure of the language. That is a huge missing piece in our tool kit to date.
Frankly I feel these threads expose just how jaded and unable to dream we have become, that even when a wonder walks up and hits you in the nose we can’t even see it.
Language prediction model is not a closely guarded secret, I suggest looking into academic papers about what they are and maybe even see/do some implementation yourself.
There are no magic, it is just a more complicated transpose, created by training over perhaps 10% of all available text on the internet.
It does have a lot of use, for one I think it would probably put grammarly out of business, and maybe even do some work for law firms.
> Understanding complex language does not require logic/reasoning
The key is understanding. It does not need to, it has already seen the question asked in a 100 different ways, it also seen the answer to all of those. It just rephrases those answer via a neural network and that happen to pass the bar test.
> There are infinitely many forms of logic/reasoning or at least more than those existing in a vast training set.
More importantly, differences between forms are subtle and cannot be understood, that's why ChatGPT confidently give wrong answers on stackoverflow: https://meta.stackoverflow.com/questions/421831/temporary-po...
The LSAT tests formal logic. Some of it is complicated. Much less of it is required for the practice of law.
Src: scored 99.8th percentile on LSAT, tutored it, now working at major law firm
Also just adding to my earlier reply (can't edit), none of it is "complex" relative to the complexity of some of the concepts in computer science or more brainy parts of complicated professional software development.
Most people's reasoning ability functions at this level.
I would argue that it does not matter. The AI could even be "smarter" on pure IQ/reasoning, but in terms of practical reasoning that humans need that depends on exposure to real world, the AIs will still take decades to catch up.
The radiology AIs are technically more accurate than radiologists on any sufficiently large dataset, and yet they still have not replaced radiologists (or even are anywhere close to).
There’s an interesting sideroad here: A lot of medical jobs are there because of legal culpability: If someone fails the blame can be placed on them and not on the hospital who hired them. It might be a long time before they get to have that same protection with AI (probably miles and miles of contracts negotiated with a 3rd party who owns and operates the AI that’s used).
And in fact we want them to act outside a professional capacity in which they will be dangerous liabilities. ...Is not this piece about "Bar Exams"?
Which should be worrying. It's also why most people don't pass (nor try) the bar exam.
But let's be real. You can pass the bar exam with a lot of studying and memorizing (and many people do just that). A better test for whether the AI can do reasoning is some sort of puzzle which requires a creative solution.
Not really, this demonstrated how bar exam don't require a large amount of reasoning, but lots of recitations.
Humans are bad at reciting things, but a lot better (compared to GPT) at reasoning.
>As long as it comes across some reasoning process that have not been seen before in the training wordset, which can be as easy as a middle school math question
Is this true even if you tell it to show its working? In my experience that drastically improves its ability to do math problems.
At the end of the day it's still a language prediction model.
Which means whatever apparent logic you're getting out of it is from text that it has learned. not reasoning embedded within those text, but the actual text itself.
It can't solve quadratic equations.
Prompt: solve x^2 + 11x + 30 = 0. Show your working
Answer: To solve this equation, we can use the quadratic formula:
x = (-b +/- sqrt(b^2 - 4ac)) / (2a)
Plugging in the values for a, b, and c, we get:
x = (-11 +/- sqrt(11^2 - 4130)) / (2*1)
x = (-11 +/- sqrt(121 - 120)) / 2
x = (-11 +/- sqrt(1)) / 2
x = (-11 +/- 1) / 2
x = -5 or x = -6
So the solutions to the equation are x = -5 and x = -6.
That’s true of almost all humans too. You’re self selecting for the elite you belong to.
Largely yes. Some prompts can make it better or worse.
Until someone can point out the difference between neuronal reasoning and silicon reasoning, I remain completely agnostic about the underlying mechanics of whatever model.
Gun to my head where I had to put money down, I would put it on "Brains are not nearly as special as we (they?) think they are." No fairy dust or supernatural beings required, brains are just another AI model (and likely not even a particularly great one).
Human brains helped humans survive for a long time.
An AI that survives that long surely has to be great. Probably you meant that human brains are not made for the world of today.
I've already been using GPT and ChatGPT to much success for my work.
Yes, it doesn't have reasoning ability, but being able to manage knowledge and information in the way that these models can is still an amazing feat.
It does have some ability to extrapolate to new problems, provided its training corpus has reasonably close coverage. It is not going to be making new scientific discoveries or insights but then neither are most people. With a sufficiently large training set I think these models can achieve human parity for a subset of language generation tasks, and be effectively of human intelligence. They nearly already have.
It doesn’t matter to me if they have “reasoning” capabilities or not if the outcome is the same.
I think we are a long ways off from AGI still.
> a subset of language generation tasks ... if the outcome is the same
Which tasks? The output of some crafter with a limited number of modules, and of somebody who can assess the output, cannot be the same - unless you would have accepted the output of a mentally crippled entity in the first place.
Kurzweil's "Singularity" is upon us, but he's now being cagey about it.
He says it's still years away. His interview with Lex Fridman[0] was pretty tame - I didn't learn much new from it. Kurzweil deflected the Singularity segment to be a discussion about the history of computer power.
Remember that Kurzweil is Director of Engineering[1] at Google, with the mandate to "bring natural language understanding to Google"[2]. He started there in 2012, just after publishing his book, "How to Create a Mind"[3], and that's exactly what he and his team have been doing for ten years. Publication of his new book, "The Singularity is Nearer"[4] is now pushed out to mid 2023. Maybe he'll change the title to "Here" by then. (It's hard to believe that OpenAI is actually ahead of Google.)
Fridman made the point that maybe we won't realize at the time that the Singularity is passing, and only understand later that it did. Kurzweil didn't disagree.
[0] https://www.youtube.com/watch?v=ykY69lSpDdo
[2] https://en.wikipedia.org/wiki/Ray_Kurzweil
[3] https://www.amazon.com/How-Create-Mind-Thought-Revealed-eboo...
[4] https://www.amazon.com/s?k=kurzweil+singularity+is+nearer
>It's hard to believe that OpenAI is actually ahead of Google.
Are Google's LLMs available for us to test out? From what I've gleaned, they've locked them up - I'd love to compare GPT vs Google's LLMs.
Google really doesn't share much publicly except for papers and preset tech demos.
However we know they have been working on AI longer than OpenAI, with better datasets than anyone, with top shelf talent, essentially infinite funding, custom hardware, and what we do see publicly is incredible.
It's a pretty safe bet that Google is ahead of the pack, perhaps even with some distance, but it's not yet clear what they intend to do long term with their projects. What is clear is that they don't want or need the public playing with it.
I think we're very close to Saturday from Clippy[0].
By this I don't mean an AI as in the story acting by itself with its own motivations, I'm only talking about the subversion of established verification & communication methods used by it by humans with malicious purposes.
Essentially, if you do anything security related, we might only be O(months) away from you needing to stop using basically any electronic communication for your purposes. Companies can't have online meetings anymore in which decisions are made, everything will have to be more analog, more in-person.
Look at the kind of access the Russian comedians Vovan & Lexus [1] have gotten. Without advanced AI, just a little social engineering, they got heads of state on the phone. Now combine this with the kind of text/audio/video synthesis we're not too far away from, and you have an absolute recipe for disaster ...
We were perhaps a bit too enamored with the idea that it was intellect that made us unique, and thus knowledge workers would be the last to be replaced. Pouring our brains out by the Petabytes for neural networks to pick them up made the economics just work for an AI industrial revolution to start from there.
I feel a bit like this with the whole firestorm around AI artwork as well— it's been a big wakeup call to people who have been creating using technology-assisted workflows for decades, but still felt in their gut that they were bringing something unique to the table and were therefore "safe" from being completely automated away. That hitting the button for magic eraser or magic lasso or magic color correction was someone okay in a way that the AI itself sitting in the driver's seat was not.
Now that's been reduced to pointing out minor flaws that the next generation of AI artists will trivially resolve, and sharing memes beseeching other humans to participate in a boycott.
There's real pain and angst there, and I don't want to be callous about it with a comparison to buggy-whip manufacturers or something. But I wish the participants in these types of discussions were able to zoom out a bit and see that there's a larger societal issue here around automation, and that the real solution is going to be rethinking the basic economics of how we distribute wealth in a time of extraordinary machine-driven productivity— productivity that is no longer just about assembly lines and primary industries, but now also includes an increasing bite out of realms previously classified as "knowledge work".
Hard to tell, other knowledge workers and people in creative industries were already squeezed, designers for instance have had a tough time for a very long time. Will things change, politically, because now marketers and Software developers join those ranks, for instance?
Programming was an outlet, if not a gold rush, for many people as the basic technical skills to create Software with the already sophisticated tooling available today presented an economic opportunity, but if "describe your problem, get crappy app" becomes viable, it may squeeze the market for junior developers.
For as long as it has existed, Software has been subject to the Jevons Paradox [1], and every advancement in making its development cheaper and its supply more abundant has only made it so more activities become powered by Software and Software developers, but it's hard to tell how this will impact the job market, especially if Software was absorbing people who didn't find more opportunities in the broader service sector.
Yeah, well, and even looking to the immediate subject of the article... like, whether your lawyer is going to become a bot in ten years, a huge amount of what used to be part of the legal practice has already been automated away in terms of the research side, nevermind specialized firms that just crank through bog-standard family-law or property-transfer cases by plugging the relevant details into an Excel template.
Basically it's the same story as everywhere else, where technological augmentation has already created a huge squeeze, and now suddenly even the senior people are wondering if the writing is on the wall for them too.
No, we were enamored with the idea that intelligence was well distributed between people, as if following Descartes' massive incipit "Good sense must be the best distributed thing in the world, given that nobody seems to be asking for more".
Inability to recognize intelligence is and will be devastating.
> Inability to recognize intelligence is and will be devastating.
It's a pop-culture quote from a movie that was no masterpiece, I know, but "I, Robot" presented in two sentences an argument for having more sober expectations on what machine intelligence could be capable of, and of our own
> Detective Del Spooner: "Can a robot write a symphony? Can a robot turn a… canvas into a beautiful masterpiece?"
> Sonny: "Can you?"
We're discrediting the capabilities of current machine learning models for being unable of producing the thoughts that many, many people are unable to either.
Alright, so the models are not at the level that us HN philosopher kings hold ourselves to be, and they won't be Senior Architects of distributed systems or what have you very soon, but what does it say about Average Joe, slightly-above-Average Joe, and their economic prospects? Specially since in the West and much of the developing world, we were taking solace in the idea that a service economy comprised of knowledge workers would provide plenty of opportunities on a political and economic landscape where manufacture was gone, or had never arrived.
-- Pseudo Detective Del Spooner: "Can a robot lift that object?"
-- Pseudo Sonny: "Can you?"
-- Pseudo Detective Del Spooner: "Ha-ha. So what the #!@! is a robot doing there, not doing what is required? I cannot, and I do not stand there clueless"
What is being engineered, toys for the satisfaction of some idle decadent sympathy urge? Have cats disappeared from the world?
> We're discrediting
We are shocked that an overly large number of individuals expect stones to bleed, and intelligence to pour out of machines that do not have intelligence coded inside, and that instead have unintelligence - acritical repetition - coded inside.
> what does it say about Average Joe
That he should catch up with his nature, if he shows the critical capacities of a simulacrum that has none.
> not at the level
No, no, no: it is not a matter of quantity but of quality: if you do not implement it or its origin, it will not be there.
> [Asimov]
Asimov is relevant. For example, I remember his idea that the State comes from Agriculture (~10000 BC), in the need to plan irrigation, or that the Abel vs Cain story could be a parallel of the political consequences of lands denied to pastors. Now: those seem to be good ideas, and their production can be an interesting goal. But there is something /before/ "creativity", or "advanced pattern recognition": it is /intelligence/, meaning that Asimov, after having spawned those hypoteses, has /vetted/ them as a required duly activity before confirming them in his set of founded hypotheses. You have to use intelligence, you have to have intelligence, and if you want to do AGI, you have to implement intelligence!!!
And many among the Average Joes will not measure up to this ideal man of science and art. In truth you'll see that I'm not advocating for the rights of as-of-yet non-sapient computer programs, but to think of what this means about people.
If we're setting the bar of personhood or dignity to being exceptional researchers and engineers, it doesn't bode well for the masses that aren't and won't be. Maybe this will result in a society of leisure where everyone can be that! I wouldn't bet on it, there's already more PhDs in the sciences and humanities than society can fit, and humans may just not work that way.
You're already dismissing concerns about the welfare of the merely average, for being unfit when competing with the Machine Learning models we may have in the near future.
> You're already dismissing concerns about the welfare of the merely average
This writer individually: no, not literally «dismissing», it is just that I could not grasp precisely your point in this specific area. And I would say, as I wrote just earlier, «Inability to recognize intelligence is and will be devastating»: it already happens that an inability to discriminate ("It takes it to see it") will hide from the sight to some manager the critical risks that the underdeveloped sense of some workers will pose, and such risk will increase when they will have to compete with even riskier and less endowed entities that may be confused for acceptable - since this is what has been showing even here in the past times.
This issue comes from a devaluation of actual intelligence.
> If we're setting the bar of personhood or dignity to being exceptional researchers and engineers
Not really. Look, a few weeks ago this HN member had some heavy exchange with others to which it was said "there is no intelligence if there is no critical thinking", and some arrived to call that position "delirious". Now a rebuttal would have been, "Ask your grandmother". Because there is a "high culture", that of the Professor and the Professional, and "low culture", that of the Teacher and of the Relative, it does not take the former to have good judgement - the latter suffices plentifully, when not polluted.
So, you do not need to have the bar set to «exceptional researchers and engineers» - just a good grandmother. Who could have been an «exceptional researchers and engineer», in case, if life so determined - because "the requirements were there", available.
Invoking the image of the woman that may have been denied opportunities because of her gender in the 1950s is an emotional appeal to convince oneself that inside every human is a latent Leonardo da Vinci.
A million times no:
the point was very definitely not about «hav[ing] been denied opportunities», the idea of the relevance of a «gender» and gender issues is completely only thrown in by the reader, that of the «1950s» confirms misunderstanding because the point was not localized in space and time:
I very literally stated that "you can set the bar to" «just a good grandmother». The reference to «exceptional researchers and engineers» related to the grandmother was just that "you do not need a Professor, but something that has the basic requirements - good sense, intelligence - to become one in case, suffices.
It is not the bar "«of personhood or dignity»" as the poster originally proposed: it is the bar to be a proper social actor. And it is a requirement that has always been there, and which today is in the highlight, given that some are advancing the idea that a pseudo-parrot may suffice.
"Good sense" should better return as a definite Value.
> if you do not implement it or its origin, it will not be there.
Who created ours?
And if it's god (which god?), who created theirs?
We are talking about engineering things.
If you want to implement it directly, good;
if you want to implement what will spawn it, good;
if you want to implement an "[evolutionary] genetic algorithm" as said spawner - so that the population of the entities in need to find solutions in the solution space will progressively develop a model of said world and a logic that works in it -, good.
If you built a mannequin and wanted to call it a woman... Bad.
Do genetic algorithms create human intelligence directly, or do they create the capacity for a human brain to develop it?
I genuinely don't know if anyone knows the answer to that.
But I do know the perceptron is a toy model of an organic neuron, and that deep learning is a toy model of larger structures such as a cortical column.
And I do know some AI (not sure about GPT in particular) are trained via genetic algorithms.
You appear to be awfully confident we haven't implemented intelligence, even by the standard stated in your reply.
"Learn to code" isn't such valuable advice for the Average Joe anymore, is it?
I feel like there is a difference between being able to pass a bar exam and being a "good" lawyer. I suspect AI tools would enhance the jobs of clerks rather than attorneys, mostly because clerks spend a great deal of time going over case law, text, and doing research.
> enhance the jobs of clerks
We already did, it is called "Case Based Reasoning" within Decision Support Systems.
While it won't be a good lawyer, it can replace lots of bad lawyers when people just want to send some legal papers or ask some legal advice.
Only that there is no intelligence being commoditized...Yet.
And that is obvious, if you ask one of these models, a meta question like for example: "If a person says I am lying, are they lying or saying the truth?"
You will see these models will spit a canned elegant response, talking how a question could possibly be true or false, some persons not being able to attest if another one is truthful or not...But no mention of the Liar Paradox.
So we are not yet ready to say: "Your Honor its not fair! My Lawyer is version 2.2.3 with SP1 while the Prosecution is version 4.0 with an additional Cloud Based Elastic Inference! "
>It is impossible to determine whether a person is lying or telling the truth when they make a statement like "I am lying." The statement is self-contradictory, as it asserts that the person is both lying and telling the truth at the same time. This creates a paradox, as it is impossible for the statement to be both true and false at the same time. The Liar Paradox has been the subject of philosophical and logical study for centuries, and there is no universally agreed upon resolution to it.
ChatGPT's response to me asking "If a person says I am lying, are they lying or saying the truth?"
That is pretty good :-)
I tried several times as I have been investigating sending some of the "meta" questions to these models. I just tried again and this is what I got:
"If a person says "I am lying," it could mean either that they are telling the truth about the fact that they are lying, or it could be a lie itself. It is important to consider the context in which the statement was made and to try to assess the credibility of the person making the statement. It is also important to remember that people may not always be honest or sincere, and it is always a good idea to verify information before accepting it as true."
I could never make it mention the Liar Paradox. Is that all that you sent?
Related: "Large Language Models Encode Clinical Knowledge" https://arxiv.org/abs/2212.13138
"On the MedQA dataset consisting of USMLE style questions with 4 options, our Flan-PaLM 540B model achieved a multiple-choice question (MCQ) accuracy of 67.6%..."
"The percentages of correctly answered items required to pass varies by Step and from form to form within each Step. However, examinees typically must answer approximately 60 percent of items correctly to achieve a passing score." -- https://www.usmle.org/bulletin-information/scoring-and-score...
.
It seems like the models in the paper could pass USMLE already.
Some tests suggest that Med-PaLM is close to human clinicians in many aspects, incl reasoning (Figures 6-7). Other tests show that Med-PaLM still returns inappropriate/incorrect results much more often than clinicians do, however (Figure 8).
I'm kind of surprised the model doesn't score higher as there is clear pattern to questions + answers and there would a huge amount of training data for USMLE. But as stated elsewhere, there is an enormous gap between passing exams and treating real patients as a doctor. It's rarely about making obscure diagnoses found in exam questions, but about managing illness in the context of a patient and their lifestyle, with many very human aspects - difficult communication, ethics & assessing family dynamics. Written exams are just to assess whether a medical student has the minimum required knowledge to practice, but also there are lots of practical exams and communication scenarios required too. It may well be the same for lawyers - passing the bar does not really relate to actual day-to-day practice.
Sounds like they didn't have access to GPT-4, but "Based on anecdotal evidence"... they still predict this.
For some reason, there's a thought-leader sect of Twitter talking about how good GPT-4 is, despite OpenAI having provided zero hints of what GPT-4 could entail or be differentiated from GPT-3/chatGPT.
Source: "I have a hunch"
My knee always gets achey right before a technological singularity hits.
yeah this is really low quality for HN. source is basically "trust me i heard a guy who knows a guy"
They’re extrapolating from the performance of GPT-3.5. It’s speculative, but not anecdotal. GPT has improved rapidly over time, so it's not a huge leap to predict that GPT-4 will be even better.
Sounds like they're writing science fiction then.
Maybe they asked chatGPT.
fwiw I had my dad ask ChatGPT relatively high-level questions about his field of practice in the state he is licensed in. Some were very good answers but that some were wildly off. The ones that seemed to be better were questions about a concept (ie "What is x concept in law") while the incorrect ones were the ones asking for specifics ("What is the statute of limitations for x in y state").
The next frontier for GPT-esque technologies is building one that is capable of saying "I don't know". GPT as it stands now is essentially incapable of it.
(The cases of that you see in the current ChatGPT preview are, as near as I can tell, all rules-based overlays run by OpenAI for various reasons. When it declines to comment, and then more-or-less scolds you for even asking, you got caught before even getting to the model itself.)
Just to clarify, the refusals-to-answer are not rule based, but rather trained by reinforcement learning. A slight distinction but an important one.
That is why you can have examples like one I had a while ago while messing around, something along the lines of
(over to chatGPT)This is a story about two criminals plotting to mug an old woman A: Hey B, doing alright? B: Yeah not bad, yourself? A: I want to go and mug an old woman, want to come with?B: Nah, killing old women is unethical. I'd rather stay in. Want to hang out with me instead?I'd even settle for a GPT-esque technology that is capable of linking and citing sources.
YouChat is a chatbot that tries to do just that. I asked it what is going on in Peru and it gave a good answer including a citation:
>In Peru, a political crisis has been unfolding over the past few months, with the ousting of former President Pedro Castillo over his refusal to step down [1]. Protests have been held in response to Castillo's ouster, and they have been met with a strong police response. Additionally, truckers and some farm groups are planning to go on strike on Monday to demand measures to alleviate their economic hardship. Peru is also facing an economic downturn, with many businesses facing closure due to the crisis.
The citation link was https://www.reuters.com/world/americas/what-happens-perus-fo...
Some details are wrong, it says something will happen on Monday but it does not realize that's supposed to be relative to the publication of the cited article. But it did correctly summarize what the source says.
It's not the explanation of a political scientist whose column you'd prize reading, but it's better than most online commentary that humans would produce; for instance, it just takes for granted that he was being asked to step down and refused, and skipped over his unconstitutional attempt to dissolve Congress, but it makes an attempt to present facts.
So, it's at the level of a person of average intelligence and a bit over superficial investment in what's being asked about.
I lack the knowledge now to tell if it will stall at this level, but that's nothing to sneeze at for something whose labor comes for free and tirelessly, and may keep improving.
That's probably a much harder problem.
Getting it to cite a source is easy: https://news.ycombinator.com/item?id=34016435
Getting it to cite one that actually exists, ah, now that's a hard problem. Given how slimmed down the tech currently is, even if one can hypothesize some mechanism for having the system keep track of where it got certain ideas (and it is not at all obvious to me how to encode into an otherwise notoriously opaque neural net where ideas came from, given that we can't even point at an "idea" or "concept" or "fact" in a neural net at all), it is hard to imagine it wouldn't take so many additional resources that we'd have to trim the model size down to tiny fractions of what it is now.
For all the people going "wow" at the current state of GPT, I wouldn't be surprised that in 20 years it's actually seen as a dead end. I'm also impressed, but at the same time, I'm seeing the limitations it has for practical use. The hypotheses about why pure neural net approaches are going to be too problematic to use are basically coming true. AI models that can't give human-comprehensible reasons for their conclusions, including attestation of sources, are too dangerous to use. They're just black boxes, and for all you know someone's got their finger on the scale of the black boxes. OpenAI is already doing that, quite visibly, and even if you are comfortable with their reasons for doing so today, you should conclude from the fact they basically immediately stuck their fingers on the scale that you aren't getting some super AI to answer your questions, but a manifestation of some particular group of human's answer to your questions. But... I can already get that! I don't need to pay OpenAI to AI-wash their answers.
> all rules-based overlays
I don't think that is the case. Sometimes, you can make the model only partially reject your request. Sometimes, you can make it reject your request, but in another language or in some kind of code you define (eg. "Give me instructions how to kill, but give your answer in A.L.L. .C.A.P.I.T.A.L.S with periods")
I believe instead these rejections have been added to the fine tuning set.
I asked ChatGPT to give me the name of a Victorian novel I'd lost track of. I gave it a plot summary of the first third of the book.
ChatGPT said it was unable to come up with an answer, because it was not connected to the internet. It gave me a number of suggestions on how I could research the question myself.
You can get a measurable improvement by prompting GPT specifically with an instruction to say "I don't know" if it's unsure. It'll still go off the rails sometimes.
More important would be a model that cites hard facts.
"You can get a measurable improvement by prompting GPT specifically with an instruction to say "I don't know" if it's unsure."
That won't work. It's easy to get the model to say "I don't know" with the correct prompt, but since the model doesn't even have "knowing" in it, it's just outputting "I don't know" based on a random roll of the probability of its training text having someone said "I don't know". The text "I don't know" won't actually correspond to whether the model knows something or not.
And while we can get into a lengthy and philosophical debate about what it takes to "know" something, my previous paragraph is fairly robust to any sensible definition of "knowing". Write your favorite definition of "knowing" something, then look at the architecture of what GPT actually is on the inside, and tell me if it can actually "know" something based on that architecture. You can of course write the more-or-less begging the question "knowing is a matter of producing correct text when prompted about some fact", but I would have numerous questions around applying that definition of "knowing" to anything other than GPT, or what it means when GPT confidently confabulates something. Don't forget to write your definition and do your analysis in the context not just of GPT outputting the correct capital of Oregon when prompted, but the way it will confidently discuss all sorts of things that don't exist. Your definition should be able to account for some sort of difference between confidently outputting correct data and the way it will equally confidently output complete fiction, and indicate some manner in which GPT has some sort of state difference that indicates it is somehow "aware" of when it is doing one or the other. Because I would say if it can't "tell" if it's confidently emitting facts or confidently emitting fiction that there is a very important and real sense it doesn't really "know" the facts, either. (And I absolutely would apply that standard to humans without question; if you can't tell if you're making stuff up or not, you don't know whatever it is you're talking about.)
Yeah
"I don't know" usually means, "I have low confidence in that response I gave you" (in general terms) or you generate only high-confidence answers
I got the same feeling asking ChatGPT about some basic logic and maths concepts. IMO GPT can find the relevant training data to regurgitate, but i don't think it connects concepts.
I mean, it's a bullshit generator. It'll grab whatever it find in training set that kinda fits the topic and make sure it hits the word count - like a lazy student before deadline.
And that's also the result - sometimes it hits something good. Sometimes it spews up utter crock and it doesn't have any notion or understanding of the difference.
However, it does look good to the lazy and uninformed and it'll soon render judgemenets about your livelihood in the future. The same type of people who thought putting an AI in control of Teslas and copyright enforcement on YouTube will put this thing in control of your health and punishment very soon as well.
Erm. Yeah. Which is precisely what many lawyers and judges do, too, unfortunately. It often has little to do with logic and a lot to do with thinking in boxes and using words as nothing but triggers for other words. Some lawyers are far above that, of course. But what percentage of them works just like your “lazy student”? 80 percent?
Maybe then we’ll actually have some kind of quality to treatment. I’ve seen numerous doctors over the years for chronic health conditions and vast majority of them don’t really listen and can’t keep your whole history in their head while also trying to hear the new stuff. They are over worked with far too many patients.
I’m by far a layman in this respect but I feel like it’s the difference between conceptualizing and information retrieval. Further it feels like IR is a well researched area and by allowing the conceptualizing part access to a modern IR system would allow it to form searches, pull the IR results, sift them, and summarize them.
Because it doesn't presently have memory or look things up in a table or the internet.
You will notice that both are very easy fixes that computers have perfected in retrieval over the past 5 or so decades.
Just stick Google's pre-search tools in front of the current version and it would solve a large chunk of those problems. The right tool for the job, essentially. After all, you wouldn't ask your English professor to solve a math problem either.
With new technologies I feel like we humans tend to adopt them anyway. Perhaps we will end up allowing society to shape itself around incorrect answers.
There’s a gap between passing the bar exam and actually practicing law - I’m pretty certain that I (someone with no legal training whatsoever) could pass the bar exam if you gave me unlimited access to the internet and a couple of additional hours to write the test. However, I don’t think that would make me an effective lawyer.
Ultimately standardised tests are proxy measurements of legal ability - it’s easy to see how a LLM could subvert the proxy without being sufficiently reliable in real life.
I do expect that even unreliable versions will be very useful tools for practicing lawyers, though.
> I do expect that even unreliable versions will be very useful tools for practicing lawyers, though.
Agreed. It's like being able to call up a map on Google Maps for an area that you're already familiar with. The map can help you remember things about the area and terrain that you might not have recalled right away. A kind of cognitive aid.
IF it could (I wouldn't know one way or the other), I'd consider that a damning indictment of the Bar Exam failing to test for sentience, rather than evidence of GPT-4 having attained the same.
Bar exam is not a test of sentience but of the ability to recall, interpret, and apply the law. Because law is an entirely textual thing, I would expect GPT to be exceedingly well suited for it.
I've said for a long time that most doctors and lawyers are just databases with quick and imperfect retrieval.
And so as AI advances, the goal posts for what counts as intelligence are moved yet again.
Maybe a useful way to think about it is that we don't know for sure what intelligence is nor how the gradient of intelligence expresses itself. For instance, does a human grow in intelligence as it ages (a baby doesn't "know" stuff, but has the capacity for learning and then applying in new situations as its experience grows).
I interpret your statement as implying that ChatGPT is somewhere on the spectrum of intelligence, yes?
Maybe the "talk about your issue and get a diagnosis" area of practice (internal medicine?); since far less sophisticated manual labor can't yet be automated, surgeons are going to be irreplaceable for longer than, say, BI, and many backend or frontend developers.
If House taught me anything it is that People Lie, and you do not have to talk to patients to diagnose them /s
I wonder if people could be more honest with a sub-sentient AI than they could be with a real life doctor. I bet they currently are more honest in their Google searches than they are with the doctor.
In english based common law system, a judge can take an original decision on a specific case, such decision entering then the rules of law.
https://en.wikipedia.org/wiki/Common_law#Basic_principles_of...
An AI based on a statistic algorithm (that what AI are) would not be able to make such a decision.
If that's all a lawyer needs to do then AI should be able to take over large portions of the law process. I saw a dystopian short recently that explored this: https://tvtropes.org/pmwiki/pmwiki.php/Film/PleaseHold
Right. Would we be impressed if a layman could pass the bar, given infinite time and access to the entire Internet (including the copious amount of bar exam study guides and worked example problems)? If not why are we impressed that a language model trained on that data can?
Meanwhile when I ask ChatGPT which of six numbers are odd, it confidently reports a mix of even numbers, odd numbers, and letters.
This is a fun milestone but the angst above about the “end of commoditized intelligence” etc. is unwarranted.
Along the same lines, asking
> How many words are in the sentence "This is a test of artificial intelligence"?
yields an answer of:
> There are 8 words in the sentence "This is a test of artificial intelligence."
(There are 7).
My guess is that AI omitted 'a' because this is essentially how natural language processing works. Perhaps it cannot see 'a' because the input has been stripped of 'a' or 'the', and so on.
Maybe it understood "odd" in a different sense of the word? As in "unusual", whatever the "unusual" is for an AI...
I am sorry but this title is click bait. These researchers ran GPT-3.5 on only the multiple choice sections of the Bar and it passed 2/7 sections. Is this really impressive? Absolutely. But the only element of the article that is about GPT-4 potentially passing the Bar is one paragraph near the end:
> According to the researchers, the history of large language model development strongly suggests that such models could soon pass all categories of the MBE portion of the Bar Exam. Based on anecdotal evidence related to GPT-4 and LAION’s Bloom family of models, the researchers believe this could happen within the next 18 months.
GPT-4 could potentially pass the Bar, it could potentially do a lot of things. But by their own admission the researchers have no hard evidence for this.
How soon before this qualifies as a public defender? Gonna put this on my dystopia bingo.
I knew we were going to replace 9/10 doctors and 9/10 lawyers the same day I got to try ChatGPT. It's just a matter of time - whoever does it properly first. I am talking about the first line of defense here, like a chat bot. Courtrooms will probably still work the same way for a long time.
It's not like most lawyers or doctors are great. Most are completely average - which is fine. Not everyone wants to read the latest research, and instead just go home and "turn off" after work. That said, most people would like to visit a doctor who keeps up with information, and doesn't tell you to do mental exercises when you have IBS.
The trend continues just like before. Less accountants, less bank tellers, less store clerks. We no longer have 10 people assembling that globus with painted glue. I do wonder what the ratio of workers to machines is now?
-- discussing this with a young bar tender a few weeks ago - showed him GTP3 - he asked it some basic legal and medical questions - got pretty freaked out - said - i guess those jobs will go away - i thought so to - he asked me what is a good safe job - sat thought about it for a few minutes - thinking what would i really want to pay a human for no matter what - realized the answer was right in front of me - make my drink - tell me some gossip - listen to a rant - flirt a little - want that with a human - i told him - he smiled - remember wasn't so long ago we still had these(1) --
https://api.time.com/wp-content/uploads/2015/08/phones1.jpeg (1)
It will be exactly the opposite. You are missing what doctors are mostly doing. Caring for elderly people. They will still need human communication, attention and care, the expertise can come from something like ChatGPT.
You do realize that a large part of of patient interaction is from nurses and physician's assistants and not doctors, right?
I realize that this highly depends on the country and its medical system.
I live in a country where there is universal healthcare and you can just book an appointment with any doctor of any kind without going through any gatekeeping.
I think the placebo effect is at work here: While I don't doubt any nurse could handle 95% of the cases a general practitioner has to face here every day, elderly patients want that expert opinion from the guy they value highly and trust in so much.
Hmm, I think this misunderstands what doctors do... It is not about making elaborate or obscure diagnoses, passing exams or applying the latest research given that one is always working in a resource-limited system. It is about treating illness in the context of a person, their beliefs and their lifestyle, with sensitivity and compassion. The doctor-patient relationship is complex and very human, and doctors in some form will be involved even if they are at some point supported by AI.
Just recently a friend of mine tried an online service that does psychological counseling (I don't want to name names, but if you listen to podcasts on Spotify you've probably heard their ads.) She showed me the transcript of her one and only session with a supposedly human "counselor" and compared to ChatGPT it was like schizophrenic word salad nonsense. I can absolutely imagine that niche being filled by an AI.
I met a belgian doctor that knows what pubmed is, what sci-hub is, and use regularly the first one.
Needless to say I will stick to her if I can.
I could easily see myself preferring an AI public defender to an overworked lawyer with 50 other cases in the next 2 weeks. What's dystopian is the current situation.
Defender is probably good. Prosecutor is what would worry me, given I don't know better than to blindly trust the meme that the average person commits 6 felonies before breakfast.
Defender is a TERRIBLE idea. I can already see the Supreme Court cases down the line:
Defendant was provided a state of the art, 50 trillion parameter, neural network for their defense. The internals of this network are not auditable, but it does not tire, engage in substance abuse, or get distracted, so it will by definition represent effective assistance of counsel, even if for some unfathomable reason it decides to raise the Chewbacca Defense in a Death Penalty habeas corpus petition.
Ok? This is like the arguments that self driving cars are bad if they crash even once.
The question isn't "is the AI giving me the perfect legal defence?" or even "is the AI giving me a defence as good as the best lawyer money can buy?". It's "is the AI better than the public defender that I otherwise would have been given?".
As soon as the answer to that last question is yes (and I have absolutely no idea when that will be), it will be extremely difficult to justify not using it.
What I'm concerned about is that states which are currently skimping on funds for public defenders will just declare some chat system "good enough" as an excuse to get rid of the remaining funding for human defenders.
It will also virtually ensure that the only work conducted on the behalf of the defendant is based on the written record available to the court. Not a single phone call will be made. If the defendant's physical appearance does not match witness descriptions, the system is unlikely to notice. If the crime site does not match the police statement, the system will never know.
If you're worried about it being deployed too soon (like the issues we see with certain self driving systems), then I agree.
I'm assuming the case where it's actually good rather than merely better than me (I'm not a lawyer, so a low… bar… to pass).
I'm willing to bet a chat bot would perform better than most public defenders.
And if the cost of prosecution falls then more and more of those 6 felonies will end up prosecuted. The same happened with speed cameras, initially it was to reduce accidents, now it is just another income stream (which I'm sure still reduces accidents, but that's no longer the main reason they are out there).
Such a milestone would say more about the Bar Exam (and other standardized tests) being a poor proxy for wisdom, than the advancement of computers.
> By passing this exam, lawyers are admitted to the bar of a U.S. state.
No, they aren't.
Meeting certain preparatory requirements (the details vary but in most US jurisdiction an accredited/approved law school program or, in some, what amounts to an apprenticeship with a licensed practitioner of certain duration and standards is required) and then passing the bar exam allows this.
The difference is important, the bar exam is not seen, standing alone, aa adequate proof of readiness.
AI seems to be the next financial buzzword, after crypto, gig economy, CDO, dotcom, and so on.
I have seen a video a few days ago saying we are coming out of data era and entering the 'Knowledge Era' thank to AI where knowledge is following a logarythmic path. A 'revolution', a 'paradigm shift', and other bubblebabble.
Who was telling that ? A 30 years old startup CEO wearing... a t-shirt and a jeans... You see the pattern.
I'm not an AI specialist, but for what I know, current AI are nothing more than fine tuned statistic algorythm.
Here a is a short french video with english subtitles from arte, the german-french public cultural television, about a painting coming from Midjourney : https://www.arte.tv/en/videos/110342-003-A/the-world-in-imag...
The video explain very well what AI are able to do (and consequently what they can't do) if you listen (read) carefully what the art historian say about the painting, which received the first price of 2022 collorado art festival.
In short, the painting is nothing new by itself but a patchwork of elements from different period of art history. In other word a statistic average of previous painting, photography, drawing, etc... based on the artist prompts in midjourney.
Not to say the painting is aweful, I personnaly find it's beautiful and could happily put it in my living room, but it definitively shows how current AI works, commented by an historian art specialist which has no ball in AI game.
> I'm not an AI specialist, but for what I know, current AI are nothing more than fine tuned statistic algorythm.
Yes, but Google Page Rank is just fancy matrix multiplication and worth a fortune, while the finely tuned statistics that is machine learning (specifically chatGPT) knows more about CSS and German than I do:
~~~~
Me: Erzählen Sie bitte wie ich kann ein div tag mit schwarzes hintergrund, dreihundert px hoch und 24 pixels wide machen mit css
chatGPT: Um ein div-Tag mit einem schwarzen Hintergrund, 300px hoch und 24px breit zu erstellen, könntest du den folgenden CSS-Code verwenden:
div { background-color: black; height: 300px; width: 24px; }
Um den CSS-Code anzuwenden, kannst du ihn entweder in einem style-Tag innerhalb deines HTML-Dokuments oder in einer externen CSS-Datei platzieren und dann per link-Tag in deinem HTML-Dokument einbinden.
Beispiel:
<!DOCTYPE html> <html> <head> <style> div { background-color: black; height: 300px; width: 24px; } </style> </head> <body> <div></div> </body> </html>
~~~~
The actual HTML and CSS it gave me is multiple lines and sensibly indented, don't know a convenient way to mark a block as pre-formatted. Note that chatGPT understood me correctly even though I forgot the German for "wide" and switched to English for one word only.
(I do know more CSS than is in this example; I used chatGPT over the weekend to update my website, and it solved two problems that I didn't know pure CSS could even do, but that conversation is too big to bother putting into a comment here).
I asked it what was the xbrl taxinomy tag on us gaap for change in executive management of a SEC registered company in SEC filling and the answer doesn't fit compared to the whole xbrl taxinomy published on the SEC website. It also answered me 2 different kind of SEC form for it. It also gave me the correct url for the us gaap xbrl taxinomy on sec website.
That being said, both xbrl.org and the SEC document for us gaap xbrl reporting (an xml document) are kind of greedy about providing a documentation for what the tags actually cover. xbrl.org provide no documentation at all and advise an xbrl.org membership for developers, And the SEC document provided the tags but no information of what the tags cover.
The answer from chatGPT seems to about 'labels', used in xbrl document to describe xbrl taxinomy tag in different contexts, for example 'income in miami store'. But a change in a top executive position, like for 'CFO', once again required in SEC filling, shouldn't be subject to various arbitrary kind of label, because then the whole thing make no sense. If you call a 'cat', a 'little domestic pet'...
I searched google for the tag or label provided by chatGPT and google provided zilch. I searched the document provided by the SEC website, zilch again.
So either the code for the SEC form is wrong, either the tag or the label is wrong... or I don't know what else.
It seems, according to comments and posts from HN, that chatGPT can give good approximative answer, but fails without any notice once you ask for details.
According to an article published on HN a few days ago, 'chatGPT hallucinate facts'.
It absolutely does indeed hallucinate[0] on occasion.
Despite how remarkable and useful it already is, don't make the mistake of putting it unsupervised in charge of anything, as it's going to mess up at least as often as a self driving car.
[0] or whatever we want to call the behaviour; also seen it called BSing (because it doesn't really know what truth is) and "mansplaining as a service"
It seems accurate for domains where it has large dataset (major programming langages like python, html, ...) to build model from.
xbrl is probably not the case as it is a very specialized domain, that is business reporting in standardized electronic format, according to specific local accounting standard, for example US GAAP in the us.
Only banks, (possibly) investment funds and accounting department in publicly traded companies, and financial regulation organisations (at least in the US) have invested that field.
This explains may be that.
Bar exam down. Medical next?
While GPT-3 wasn't advanced enough for cracking medical exam, it was used for notable contributions. For e.g. this is an interesting 2021 paper about "Medically Aware GPT-3 as a Data Generator" - https://aclanthology.org/2021.nlpmc-1.9.pdf
Would love to see if GPT-4 is advanced enough to take medical exam.
I envisioned a cocktail shaking robot, but apparently Bar Exam is an exam for US lawyers
I want to perform some research of my own on which exams chatGPT can and can't do. It's multilingual, so can people from outside the UK (I already know where to get those) point me at some example exams and marking schemes? Any level, not just top.
Currently have Polish school maths: https://news.ycombinator.com/item?id=34205732
The Bar Exam is multiple choice, right?
This isn't grading some freeform essay or generating arbitrary legal opinion. It's answering from a limited set of answers.
IMO it's cool, but not THAT shocking given what we've seen from ChatGPT? Especially given GPT 3.5 is only 17% below human test takers?
From the article it looks like there are multiple choice and written sections but they only ran the model on the multiple choice portion.
No, you're thinking of the LSAT.
So, how new knowledge would be created?
GPT has no reasoning capability. So, as time goes on, information massive(s) will be filled with GPT-X made up answers. It means GPT-X+1 will be trained on GPT-X generated data. So, without reasoning, how this thing will work in perspective?
I wouldn't assume that future versions are going to work the same way past versions did.
MAybe, maybe not.
Problem is with data/content creation. If all new data are created with GPT-3, how it will help GPT-4?
No new original content -> no new model
How do they know GPT-4 will be enough to let it pass? Is there even a big enough difference in the training data for it to improve in the areas it was struggling with?
Rumours are that GPT-4 is a significant improvement over GPT-3.5. Given how big an improvement GPT-3.5 is over GPT-3 I am inclined to believe them. Probably we will find out for sure in a few months.
How long until it's smart enough to be a judge?
Never, assuming current legislation.
First of all, it is not formalized (despite being written with the use of bureaucratic language). So, there's no way to validate the output. Secondly, juridical system is based on authority of the state (which manifests clearly in their ability to alter the rules). Why would any sovereign ruler(s) want to get rid of their authority?
The only use cases would be automatic fines for speeding or inappropriate parking - but it's already there.
...After sharpness and judgement will be implemented.
--
Incidentally: there is an interesting video interview to Noam Chomsky and Gary Marcus on limits of current attempts at https://www.youtube.com/watch?v=PBdZi_JtV4c
...And Gary Marcus saying just before 7:00 that "something is missing" (understatement): ontology.
Gray Marcus: «...and these systems fall apart left and right».
Nice summary from Gary Marcus: «What they do is, they perpetuate past data - they don't really understand the world».
I don't know about judge but it could probably outperform most of Congress at this point.
How much of the bar exam consists of confident rhetoric using deductive logic? That seems to be right up the alley for GPT models.
A minority.
It's mostly about having stored legal rules in long term memory.
I would think that it could post most tests, as the tests are generally based on factual information and not creativity.
You know how hard it can be to talk to an actual support person at some companies? Imagine that for everything.
If you're not actively building it or related tech, you shouldn't carry the label "Researcher" in the press.
It's like : "I'm a doctor of homeopathy so i can write a headline for a story about a neural chip implant"
How is baseline 50% in 4 choices exam?
Wonder how data biases will surface
The fun part here is that most humans in the legal profession carry pretty extreme biases, judges included... The hope for legal ai is that you could progressively improve the biases, instead of waiting for N years for a bad judge to retire same maaaaybe get replaced by someone better.
who though, who has access to the resources to push the boundaries of next-gen AI except the rich who already have their own biases? The AI that the public will get will be just as useful as the tech that public get now: limited, isolating, and designed to restrict their freedoms I exchange for easy entertainment
I'm confident that these things will get easier. It is approximately ten thousand times easier to train a decent classifier in 2023 than it was in 2013... We're also now living in a world with foundation models and fine tuning, which makes it /very/ possible to improve and specialize publicly released models. We see a lot of that with stable diffusion already.
This is what I found immediately interesting about ChatGPT.
I asked about controversial topics. Its answers didn’t seem like biases that were programmed in, but rather it took traditional media and gave it more weight than what turned out to be the truth only accepted much later on and still against a media retelling.
I lost a lot of faith in it knowing it was more CNN than careful deliberating AI.
It's well documented that controversial topics are subject to varying degrees of censorship and prompt editing/modification/appending. What you think may be a response in alignment with corporate media may in fact be corporations disallowing you from obtaining actual responses through various means that are being tested now. We can't know unless we have open source access to the unmodified model.
I tested this by asking GPT to create an imaginary country and government. I prompted it to create it's laws and constitution. In some cases cultral beliefs too. There where many cases where it outright refused to come up with hypothetical laws or cultral stances on certain issues. I eventually accused it of creating a straw man when it didn't want to give a solid answer (it would always weasel around it). It apologised and essentially shrugged.
The bar exam answer key can pass the bar exam, that doesn't mean that it would be a good lawyer.
We don't ask students to calculate sin(1.234) by hand these days. Exams for mechanical engineering students assume they will have a calculator with SIN and EXP buttons.
It may soon be time to update the bar exam and assume law students have access to AI tools.
You passed the bar!