How well can LLMs write COBOL?
bloop.aiThis looks interesting. I'm working on an OpenAI-based tool for coding tasks that are too complex for ChatGPT - https://github.com/plandex-ai/plandex
It's working quite well for me, but it definitely needs some time spent on benchmarking and ironing out edge cases.
I'm especially curious how it will do on more "obscure" languages. Not that Cobol is obscure exactly--I suppose there's probably quite a bit of it in GPT-4's training considering how pervasive it is in some domains. In any case, I'll try out this benchmark and see how it goes.
> Not that Cobol is obscure exactly--I suppose there's probably quite a bit of it in GPT-4's training considering how pervasive it is in some domains
There is a huge amount of COBOL code in existence – but, almost all of it is non-public code used to run business and governments. Very little of it is publicly source-available (whether open source or something more restrictive than that)
Unless GPT-4's training data includes non-public code bases (I doubt it), it likely has rather little COBOL code in it
I've been using GPT4 to help me navigate a mainframe and a COBOL codebase and it knows far more than what my googling abilities manage to fish up in forums. It's actually surprisingly good at surprisingly deep mainframe topics.
No doubt its training data contains a lot of IBM manuals, probably even some commercial books on relevant topics, maybe even the contents of some of the forums you mention – and all that could be enough to correctly answer your questions.
However, for languages like Python, Java, C, C++, JavaScript, Go, etc, it also contains untold millions of lines of code slurped from places like GitHub. Whereas, I really doubt it contains anywhere remotely near as much COBOL code, just because you look for COBOL code on GitHub public repos, you will find very little – the vast majority of COBOL code is in-house or vendor business software, and few seem to want to make that stuff public – and what COBOL code GitHub has is mostly toy exercises or ancient stuff, not examples of significant contemporary production code. The only way OpenAI is going to get a substantial quantity of that is if multiple private parties (such as banks) give them access to their COBOL code bases – not impossible, but absent some public info saying it has happened, it seems more likely it hasn't.
I expect GPT-4 (or any LLM) is not going to perform as well on complicated programming tasks for COBOL compared to other languages. For more mainstream languages, it has millions of examples to help it do a better job, for COBOL it likely doesn't.
Look, nobody is going to perform as well on complicated programming talks using cobol as with python. But knowing everything you said, I was amazed at how good it was. Try it.
Also, nobody needs to do complicated coffee talks with cobol, it wasn't meant for it. What we do need a lot of is translating cobol to python or Java.
To rephrase my point: the gap between the best an LLM can do and the best an experienced human can do, is likely larger for COBOL than for more mainstream languages, simply because LLMs have a lot more opportunities to gain "experience" with those more mainstream languages than they do with COBOL.
What you are saying may well all be true, but it doesn't contradict what I'm saying.
Are you hiring for it? I don't know COBOL, but do know python and Java to some extent, and enjoy esoteric legacy problems
You can know learn Cobol online, see IBM's offering in coursera, e.g.: https://www.coursera.org/learn/cobol-programming-vscode
I'm only hiring locally in Israel right now
It even sucks at Guile Scheme, according to my experiences with GPT 3.5.
But it probably read all the books ever published on COBOL.
Almost certainly not. There are heaps of books in libraries which nobody has scanned yet, including many on COBOL. No LLM has read those.
Whether that makes a difference depends. If you are dealing with mainstream modern COBOL (like recent versions of MicroFocus or IBM Enterprise), it probably won't. If you are dealing with some obscure legacy COBOL dialect, odds are high there is some very helpful printed book which nobody has scanned.
> I'm especially curious how it will do on more "obscure" languages.
There’s definitely a lack of training data and ability (but unfortunately not confidence) in less widespread languages. It’s quite bad at pinescript, confusing versions of the language and producing unrunnable code, and being unable to correct it when given feedback
This looks great! Can’t wait to try it out today
Considering MTOB(Machine Translation from One Book) result, where LLM in-context learns a new language from a grammar book, I wonder how LLMs fare given, say, GnuCOBOL Programmer's Guide PDF, which is easily available. It would be an interesting addition to the benchmark.
One thing that makes OpenAI so valuable over time is how they can take all of this expert input from the rush to test out the new technology and use that to exponentially improve the next model. You have to think, just like we see here, experts in their field who know all of these edge cases or other fundamental aspects of what makes their own companies or projects valuable and they tend to hand that over in the process.
It's kind if interesting how far this could be taken with all kinds of valuable information from people and companies, especially code and business logic. Everyone wants to say AI this and AI that to keep up with the times and they all keep dumping all of this valuable data in for free.
Then consider the authentication method used tied to this data collection and if you are using a company emial, etc, it helps them weed out the garbage too.
I guess we can't fight progress...
I think you’re forgetting about the use case where the LLM returns something partially correct to a discerning expert, who is still able to use the response, but does not bother with a message like “btw I had to do X to make your suggestions usable”.
That might be true, but my feeling so far is that OpenAI doesn't want to do what they'd need to do to make any one product actually good, so they're going to keep bouncing from AI thing to thing, making foundation models that have a lot of wow factor but can't really deliver on their promise because they're too closed off to integrate into a workflow that might alleviate the issues.
Other people will try to build on OpenAI stuff, find that it's not quite good enough and OpenAI doesn't care to really make it good enough because it's a lot of work, and it won't be until we get competitors that take less sexy model tech and take the time to make it REALLY GOOD at certain things that AI really makes good on its promise. I'm guessing that will be driven by people taking open source tools that are ~80% of the way there and really building a system and domain logic around it to make it excellent.
I tried to get ChatGPT to write LLVM IR last year. The results were interesting: the LLM wrote superficially correct-looking IR, but it ultimately failed to grasp the concept of SSA, as it kept trying to write to SSA registers. ChatGPT can generalize across language syntax reasonably well, but it doesn't understand deeper differences in language semantics.
I tried to get ChatGPT to write 6502 assembly and it had similar issues.
FWIW, I had similar issues trying to get both it and Claude to help me with x86 assembly: it kept thinking if it added another * or some more parentheses it could get some impossible assembly to work.
Reminds me of adding *'s and &'s until my C++ code worked in college.
I got this from Gemini:
> Gemini:
> JSON: In JSON, double quotes are used to delimit strings. If a double quote needs to be included within the string itself, it's escaped by doubling it (e.g., `"This is a string with a "quote" in it."`).
(And, just an editorial note: there are no backslashes in Gemin's response / HN is displaying it correctly. Gemini's response is nonsense — wrong, but also not even internally consistent.)
> Me:
> Does your JSON example parse as valid JSON?
> Gemini:
> Yes, the JSON example I provided ("This is a string with a "quote" in it.") is actually valid JSON.
> The reason it's valid is because the doubled quote within the string is interpreted as an escaped character, representing a single literal quote within the string itself. The JSON parser recognizes the double quotes at the beginning and end as delimiters for the entire string, not the escaped quote within.
(Its example, is, of course, quite malformed.)
I've asked chatGPT a fair number of Fortran questions. There are differences - Fortran is still in use lots of places, there are forums and documentation sites (though presumably that's true for COBOL). But compare to python, there is way less info out there on how to do different things, for example little Stack Overflow content.
I'd say I has mixed results, definitely chatGPT knows the language and can give examples but I've also had a lot of frustrating things it wasn't able to resolve.
Fortran also has the disadvantage of numerous separate dialects/flavours that plenty of times can't be mixed. And rarely is the exact style mentioned in random code found somewhere.
It got a lot better with Fortran 95 and newer, but in the old world of e.g. lahey compilers and custom commands only available there, any LLM has failed me consistently to stick to these intricacies. I can't even blame them, when asking humans questions about these topics, you'll get all sorts of answers that are equally close to a correct solution, but almost never precisely correct.
On one hand, it'd be super cool to solve the talent shortage in the field, also Cobol code isn't very pleasant to write.
On the other, I'm not sure I'd want tools known for poor code quality, and hallucinations, to write these super critical systems.
Guess there might be a copilotesque productivity booster for human developers, but I think these systems are some of the last places I'd want LLMs to contribute.
I wonder if a better goal would be valid translation of the COBOL into X language. Obviously that has its own can of worms but it seems like our goal generally should be getting critical systems to more modern languages that can be maintained more easily.
In 2009, 4 million lines of COBOL were migrated to Java using an automatic translator.
The problem is not translation - once it is ported to Java, who will maintain it? It has no specs, no tests, and it’s a spaghetti mess. Is any weirdness a bug or a feature? Nobody knows. Least of all the body-rental remote devs you hired to maintain it.
which is worse than the same situation, but in COBOL, how exactly?
Because you have the same mess plus all the mess and hard to debug incompatibilities and weirdness, added by the translator.
What people don’t get about cobol is that it is a language you can learn in a few days. It is meant to be readable for non-technical users so Cobol itself is not the issue. Once you get past the meaning of sections, and the implicit defaults that are different from modern programming languages, it is way easier than, say, C.
Also, COBOL is basically a special purpose language...It's designed for accounting apps. and reports.
Its really not bad for them either. Verbose, but not bad...
Thanks for sharing! Will need to dig into this
That would remove some portion of the human coders, too.
Why do people always add this "blah blah hallucinations and critical systems"? 1. People write bad and buggy code. 2. You act like we're just blindly throwing untested code at production systems from LLMs.
It's just intellectually dishonest to talk this way.
They will still be helpful but we obviously need to test before we add code into systems. It goes without saying.
> It's just intellectually dishonest to talk this way.
> They will still be helpful but we obviously need to test before we add code into systems. It goes without saying.
It's not intellectually dishonest at all. It's an issue of conditioning. There's a class of developers that blindly copy and paste code from StackOverflow or the first hit on Google. They're the same class that will uncritically copy and paste ChatGPT answers.
ChatGPT is worse than SO because it's adaptive. If someone pastes in a SO answer and it doesn't immediately work the developer has to at least engage with the code. ChatGPT can be asked to refine its hallucination until it parses/compiles.
The class of developer blindly copying and pasting answers will not have the expertise to spot hallucinations or likely even fix the inevitable bugs they introduce. Additionally ChatGPT by its nature elides the source of its answers. At the very least a SO answer has some provenance. Not only the poster but some social signally through votes that the answer is legitimate.
ChatGPT answers don't have any of that. It will also happily hallucinate references.
Conditioning junior developers and learners to rely on and trust AI coding is setting them up to fail. It's also going to stunt their growth as developers because they'll never gain any domain knowledge. In the meantime they'll be unknowingly sabotaging products with legit looking but broken code.
I should be worried that the very worst developers might paste bad code from ChatGPT and that's why it's dangerous? Looks an awful lot like mental gymnastics to me.
Look, we can't all just be realistic about a thing that's going to take our jobs, so we have to lean on tired old excuses instead. Instead of being reasonable, why don't you pick a team - for or against, and then fight about it on Internet forums because, well, what else are you going to do while the build compiles? Look at cat videos?
I generate my own custom cat videos with AI now tyvm.
Since the LLM sometimes generates invalid COBOL a simple practical solution would be to use an API and allow it to test its code with GnuCOBOL, feed the output and have it try again a couple of times. I wonder what the updated benchmarks would be in that setting.
The general approach seems to work anyway. I tried it out with ChatGPT 3.5 and an online Cobol compiler[0], manually feeding back the output, and it managed to produce a working program on the 10th attempt (that displays the first 10 Fibonacci numbers).
Edit: Well, maybe. With the example from the article it wasn't as successful.
I tried to get chatgpt to write 6502 assembler for the 1980s 8-bit home computer the BBC Micro. It was game, but clueless
It depends on what you test for.
I am from the C64 demo scene and in this regard ChatGPT is pretty useless. VIC tricks, raster timing - nothing a LLM can help with at the moment judging from my experience with ChatGPT 4.0 so far.
Same goes for Amiga and simple blitter access for scrolling.
LLMs will be very limited here unless they receive sensory feedback repeatedly.
Pure algorithms like sorting for example may be doable by GPT, but the mentioned machines are very creatively used to come up with effects.
Here are techniques reused, not so much code itself, which gets modified and optimized for every demo and often times recombined with other techniques.
Most techniques for the VIC are pretty well documented, but the timing as well as recombining them is the heavy lifting.
Try feeding it 8 pages of examples first? Something like this https://atariwiki.org/wiki/Wiki.jsp?page=Advanced%206502%20A...
I tried to get it to write 6502 assembler for the NES. As I recall it mistakenly thought that there was a BIOS containing a print function.
That's a little surprising; 6502 assembly is a fairly popular hobbyist thing, so I would expect the data is out there. It's also mostly pretty simple, but you do have to watch out for quirks like rotating through the carry bit.
It can write 6502, but you have to know 6502 yourself to coach it.
A while back, I asked a question here, roughly, why hasn’t someone written, say, a C to COBOL translator? Such a program might take a lot of work, but it seemed to me that with an impending dearth of COBOL programmers, there would be demand for such an app. I was informed that there were so many different COBOLs in use that the output of such a program would STILL have to be tended to by an experienced programmer in the output dialect desired. This is just the Copilot situation.
As much as everyone likes to poke fun at COBOL, the language itself really isn't the problem with maintaining/updating old COBOL systems. It's old, but it's not that bad.
The real problem is the entire ecosystem around those systems. Remember, a lot of COBOL software dates back to a time before things like relational databases. You'll be working with flat files that might, if you're very lucky, have column and record separators and useful names/documentation explaining what they are. If you're unlucky you'll have to figure out field widths from the code and infer what the fields are based on their actual usage. Oh and if you get it wrong you just messed up something related to payroll or financial compliance; enjoy the punishing fines.
That kind of stuff, more than the language, is the reason nobody wants to touch old COBOL systems.
Someone should test the benchmark on Claude 3 models.
It would be interesting to feed it a formal language specification of some language it hasn't seen and then ask it write code and see how it does.
That could be a test of reasoning and reading comprehension
I've been thinking about a benchmark designed this way for a while. It doesn't even need to be code, particularly, it could be basic reasoning problems. The key is that you define a new, random language that has never before been seen (maybe it has statistical similarity to existing languages, maybe not), create a translation key, then ask a question in that language.
Reasoning vs being a completion engine (I could make a guess at how well that would work)
Reasoning is a form of completion (logical), the problem is that LLMs aren't language agnostic in their learned semantic reasoning.
I suspect there were relatively few training examples for COBOL. It would be interesting to see the results for a system which had a significant number of such examples in the training set.
Indeed, even the author wasn't aware of modern COBOL, with IDE tooling and OOP capabilities, focusing on the classical micros instead.
The issue is less COBOL the language and more the literal business logic the COBOL is encoding. You can learn the COBOL language as easily as any other. What you can't learn as easily are the accounting rules, administration policies, and regulations any major COBOL codebase is implementing.
You'll be able to see the code multiplying a dollar value by 0.03 but not necessarily know that this is because of some statutory requirement of some Minnesota tax code that only applies to industries producing both left and right handed monkey wrenches but only if the company was incorporated before 1975. That obscure law isn't referenced in any documentation but was found by an accountant in 1982. The change was made to the code but only referenced in a paper memo with a small distribution list but all of those memos were shredded after being archived for ten years.
ChatGPT can't really help document code that's decades old and doesn't have any references to the why of the code. The how is straightforward but rarely as important as the why.
Not knowing a statutory law expressed in code is not just a COBOL thing.
There is probably a business opportunity for an AI company to build private LLMs for large enterprises trained on their own COBOL code bases. They won't find much available as open source, and individual companies tend to have significantly different coding styles tied to COBOL versions and database schemas.
Yeah I think Facebook and Google are already doing that internally.
wow! thats a memory lane i hope never revisit!