Settings

Theme

I'm a principal engineer at Microsoft. I barely program anymore

payne.io

23 points by payneio 2 months ago · 32 comments

Reader

proc0 2 months ago

I think we're going to see a negative impact on the software industry thanks to the LLM hype. There is a metric of LLMs which is hard to measure, and that is something like the quality of the solution, which includes how well the problem is abstracted, and how well the solution is decomposed in such a way that it becomes easily scalable, resilient, etc.

The article shows how this is happening. The examples given are translating code from one programming language to another, explaining a codebase, and generating small solutions to common problems (interview questions). At the end the author jumps to the conclusion that literally anything will be possible via prompting an LLM. This does not necessarily follow, and we could be hitting a wall, if we haven't already.

What LLMs lack is creativity and novel seeking functions. Without this you cannot have an intelligent system. LLMs are effectively smart (like 'smart' in smart phone) knowledge bases. They have a lot encoded knowledge and you can retrieve that knowledge with natural language. Very useful, with many great use cases like learning or even some prototyping (emphasis on some) capabilities.

If LLMs could actually write code as well as a human, even prompting would not be necessary. You could just give it an app, and tell it to improve it, fix bugs, add new features based on usage metrics. I'm sure the industry has tried this, and if it had been successful, we would have already replaced ALL programmers, not just senior programmers at large companies that already have hundreds or thousands of other engineers already.

  • vmnb 2 months ago

    I also watched the Karpathy Dwarkesh interview.

    You seem to share his conviction that you, at least, are not just regurgitating slop.

  • payneioOP 2 months ago

    Yes. These are all the same points I used to believe until recently... in fact the article I write two months earlier was all about LLMs not being able to think like us. I still haven't squared how I can believe both things at the same time. The point of my article was to try to explain why I think otherwise now. Responding to your thoughts in sequence:

    - These systems can re-abstract and decompose things just fine. If you want to make it resilient or scalable it will follow whatever patterns you want to give it. These patterns are well known and are definitely in the training data for these models.

    - I didn't jump to the conclusion that doing small things will make anything possible. I listed a series of discoveries/innovations/patterns/whatever that we've worked on over the past two years to increase the scale of the programs that can be generated/worked-on with these systems. The point is I'm now seeing them work on systems at the level of what I would generally write at a startup, open source project, or enterprise software. I'm sure we'll get some metrics soon on how functional these are for something like Windows, which, I believe is literally the world's single largest code base.

    - "creativity" and novel-seeking functions can be added to the system. I gave a recent example in my post about how I asked it to write three different approaches to integrate two code bases. In the old world this would look like handing a project off to three different developers and seeing what they came up with. You can just brush this all of with "their just knowledge bases" but then you have to explain how a knowledge base can write software that would take a human engineer a month on command. We have developed the principle "hard to do, easy to review" that helps with this, too. Give the LLM-system a task that would be tedious for a human and then make the results easy for a human to review. This allows forward progress to be made on a task at a much-accelerated pace. Finally, my post was about programming... how much creativity do you generally see in most programming teams where they take a set of requirements from the PM and the engineering manager and turn that into a code on a framework that's been handed to them. Or take the analogy back in time... how much creativity is still exhibited in assembly compilers? Once creativity has been injected into the system, it's there. Most of the work is just in implementing the decisions.

    - You hit the point that I was trying to make... and what sets something like Amplifier apart from something like Claude Code. You have to do MUCH less prompting. You can just give it an app and tell it to improve it, fix bugs, and add new features based on usage metrics. We've been doing these things for months. Your assertion that "we would have already replaced ALL programmers" is the logical next conclusion... which is why I wrote the post. Take it from someone who has been developing these systems for close to three years now... it's coming. Amplifier will not be the thing that does this... but it shows techniques and patterns that have solved the "risky" parts enough to show the products will be coming.

    • Madmallard 2 months ago

      "- These systems can re-abstract and decompose things just fine. If you want to make it resilient or scalable it will follow whatever patterns you want to give it. These patterns are well known and are definitely in the training data for these models."

      No? It absolutely does not do this correctly. It does what "looks" right. Not what IS right. And that ends up being wrong literally the majority of the time for anything even mildly complex.

      " I'm sure we'll get some metrics soon on how functional these are for something like Windows, which, I believe is literally the world's single largest code base."

      Now that's just not true at all. Windows doesn't even lay a finger to Google's code-base.

      "and then make the results easy for a human to review."

      This is in no way doable for anything not completely trivial from what an LLM produces. Software is genuinely hard and time-consuming if you want it to actually not be brittle and address the things it needs to and with trade-offs that are NOT detrimental to the future of your product.

      • payneioOP 2 months ago

        How are you verifying your claims? I'm actually seeing results that you describe as being impossible.

    • proc0 2 months ago

      Well, if it replaces all engineers, then I'm not up to date on the capabilities of the state of the art. So far I've just used the available commercial models. I quickly hit walls when I try to push its limits even a little.

      In theory, any prompt should result in a good output just as if I suggest it to an engineer. In practice I find that there are real limitations that require a lot of iterations and "handholding" that is unless I want something that has already been solved and the solution is widely available. One simple example is I prompted for a physics simulation in C++ with a physics library, and it got a good portion of it correct, but the code didn't compile. When it compiled, it didn't work, and when it worked it wasn't even remotely close to being "good" in the sense of how a human engineer would judge their output if I where to ask for the same thing, not to mention making it production ready or multiplatform. I just have not experienced any LLM capable of taking ANY prompt... but because they do complete some prompts and those prompts do have some value it seems as if the possibilities are endless.

      This is a lot easier to see with generative image models, i.e. Flux, Sora, etc. We can see amazing examples, but does that mean anything I can imagine I can prompt and it will be capable of generating? In my experience, not even close. I can imagine some wild things and I can express them in whatever detail is necessary. I have experimented with generative models and it turns out that they have real limitations as to what they can "imagine". Maybe they can generate car driving through a road in the mountains, and it's rendered perfectly, but when you change the prompt to something less generic, i.e. adding more details like car model, maybe time of the day, it starts to break down. When you try and prompt something completely wild, i.e. make the car transform into a robot and do a back flip, it fails spectacularly. There is no "logic" to what it can or cannot generate, as one might think. A talented artist that can create a 3d scene with a car can also create a scene with a car transforming into a robot (granted it might take more time and require experimentation).

      The main point is that there is a creative capability that LLMs are lacking and this will translate to engineering in some form but it's not something that can be easily measured right away. Orgs will adapt and are already extracting value from LLMs, but I'm wondering what is going to be the real long term cost.

      • payneioOP 2 months ago

        So, what we do is automate the hand-holding. In your physics simulation example, you can have the system attempt to compile on every change and fix any errors it finds (we use strict linting, type-checking, compile errors, etc.); and you can provide a metric of "good" and have it check for that and revise/iterate as needed. What we've found particularly useful is breaking the problem into smaller pieces--"The Unix Philosophy" as the system is quite capable of extracting, composing, defining APIs, etc. over small pieces. Make larger things out of reliable smaller things like any reasonable architecture.

        These things are not "creative"... they are just piecing together decent infrastructure and giving the "actor" the ability to use it.

        Then break planning, design, implementation, testing, etc. apart and do the same for each phase--reduce "creativity" to process and the systems can follow the process quite nicely with minimal intervention.

        Then, any time you do need to intervene, use the system to help you automate the next thing so you don't have to intervene in the same way again next time.

        This is what we've been doing for months and it's working well.

        • proc0 2 months ago

          Right, I can see how using an agentic system like that would go a long way. However there is a distinction between using AI models directly, and architecting a system in the context of this conversation, because it means the limitations of the models are being overcome by human engineers (and at scale, since this is hard outside of enterprise). If the models were intelligent enough this would not be needed.

          So my claim of knowledge bases still stands. An agentic system designed by humans is still a system of knowledge bases that work with natural language, and of course their capability is impressive, but I remained unconvinced they can push the boundaries like a human can. That said, maybe pushing boundaries is not needed for the majority of applications out there, which I guess is fair enough and what we have now is good enough to make most human engineering obsolete. I guess we'll see in the near future.

duxup 2 months ago

This feels like a LinkedIn article ...

Anyway does a Principal Engineer at Microsoft typically code a lot?

  • gdulli 2 months ago

    This place is becoming as much of a dumping ground for vanity blogging as LinkedIn already is. There's no discouragement of accounts like this that have no activity here but self promotion.

    • payneioOP 2 months ago

      What's wrong with "self promotion"? The point of this space has always been promoting projects. That's what Y Combinator is all about

      • Terretta 2 months ago

        What to Submit

        On-Topic: Anything that good hackers would find interesting. That includes more than hacking and startups. If you had to reduce it to a sentence, the answer might be: anything that gratifies one's intellectual curiosity.

        Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.

        https://news.ycombinator.com/newsguidelines.html

        • payneioOP 2 months ago

          Thanks for the extract. I feel quite comfortable that my post is on-topic and gratifying. I understand others may disagree (and do in nearly every post on HN)

    • duxup 2 months ago

      Agreed.

  • gloryjulio 2 months ago

    L6+ (google scale) are mostly doc engineers who work on architecture and leading the team. I don't think this is anything new

  • payneioOP 2 months ago

    Yes, I code a lot. My GitHub is public as are many of the projects I work on.

mawadev 2 months ago

Is it just me or do these people simply lose their grip on reality with all the organizational abstractions?

  • payneioOP 2 months ago

    Not just you. A lot of people think that, I'm sure.

    Not sure what you mean about the organizational abstractions. FWIW, I've worked in five startups (sold one), two innovation labs, and a few corporations for a few years. I feel like I've seen our industry from a lot of different perspectives and am not sure how you imagine being at Microsoft for the past 5 years would warp my brain exactly.

urbandw311er 2 months ago

This ends up being an advert for a new Microsoft tool

  • payneioOP 2 months ago

    It's not, actually. It's a glimpse into a research project being built openly and made freely, by the engineers building it, to anyone who wants to take a look.

    The products will come months from now and will be introduced by the marketing team.

Madmallard 2 months ago

If only AI was not completely and utterly useless for any unique problems for which there isn't extreme amounts of available training data. You know, something any competent programmer knows and has already known for years. And these problems end up being involved in basically every single non-trivial application and after not very long into development on those applications. If only AI didn't very readily and aggressively lead you down very bad rabbit holes when it makes large changes or implementations even on code-bases for which there is ample training data, because that's just the nature of how it works. It doesn't fact check itself, it doesn't compare different approaches, it doesn't actually summarize and effectively utilize the "wisdom of the crowd", it just makes stuff up. It makes up whatever looks the most correct based on its training data, with some randomness added. Turns out that's seriously unhelpful in important ways for large projects with lots of different technical and architectural decisions that have to make tradeoffs and pick a specific road among multiple over and over again.

Really sick and tired of these AI grifters. The bubble needs to pop already so these scammers can go bankrupt and we can get back to a rational market again.

  • payneioOP 2 months ago

    I get it. I've been through cycles of this over the past three years, too. Used a lot of various tools, had a lot of disappointment, wasted a lot of time and money.

    But this is the kinda the whole point of my post...

    In our system, we added fact checking itself, comparing different approaches, summarizing and effectively utilizing the "wisdom of the crowd" (and it's success over time).

    And it made it work massively better for even non-trivial applications.

    • Madmallard 2 months ago

      You're going to have to put quotes around "fact checking" if you're using LLMs to do it.

      "comparing different approaches, summarizing and effectively utilizing the "wisdom of the crowd" (and it's success over time)"

      I fail to see how this is defensible as well.

      • payneioOP 2 months ago

        Compiling and evaluating output are types of fact checking. We've done more extensive automated evaluations of "groundedness" by extra ting factual statements and seeing whether or not they are based on input data or hallucinated. There are many techniques that work well.

        For comparisons, you can ask the model to eval on various axis e.g. reliability, maintainability, cyclometeic complexity, API consistency, whatever, and they generally do fine.

        We run multi-trial evals with multiple inputs across multiple semantic and deterministic metrics to create statistical scores we use for comparisons... basically creating benchmark suites by hand or generated. This also does well for guiding development.

      • payneioOP 2 months ago

        And by "wisdom of the croud", I'm referring to sharing what works well and what doesn't and building good approaches into the frameworks... encoding human expertise. We do it all the time.

  • payneioOP 2 months ago

    Also... "scammer and AI grifter"?? Damn dude. It's any early-stage open-source experiment result and, mostly, just talking about how it makes me question whether or not I'll be programming in the future. Nobody's asking for your money.

    • Madmallard 2 months ago

      My last comment wasn't really directed at you it just reminded me of how I feel about the whole scene right now.

      • payneioOP 2 months ago

        I feel that. I've been on an emotional roller-coaster for three years now. I didn't expect any of this before then. :O

lunias 2 months ago

"Barely programs." Pffft. Read his previous article: "Wild Cloud" where he fixes society with a local cloud implementation.

https://payne.io/posts/wild-cloud/

It's not abundantly clear what makes a "wild cloud" different than say, "some computers networked together," but I'm eagerly awaiting an update!

  • payneioOP 2 months ago

    Yes. Please read it. I'm looking for collaborators. The links in this article point to recent work on Wild Cloud so you can see where it's currently at.

    Wild Cloud will is a network appliance that will let you set up a k8s cluster of Talos machines and deploy apps curated apps to it. It's meant to make self-hosting more accessible, which, yes, I think can help solve a lot of data sovereignty issues.

    I'm not sure what you mean by "barely programs"

    • lunias 2 months ago

      > I'm not sure what you mean by "barely programs"

      I felt like people were dumping on you in the comments for potentially "not coding a lot," but I checked and saw this is not true.

      > help solve a lot of data sovereignty issues

      I do agree with the need to solve data sovereignty issues, I'm not sure self-hosting is not already accessible; or that replicating the complication of cloud architecture makes it more accessible, but maybe I don't have a good grasp of the use case.

      • payneioOP 2 months ago

        Ah! Gotcha. Thanks for the clarification.

        The use cases I'm thinking that require cloud architecture are scaling up with GPUs (for self-hosted intelligence workloads). Also, Wild Cloud is meant to meet community needs more than individual needs (thought it will do that, too) so I'm imagining needing to scale horizontally more than just vertically. I would still recommend putting things like home-assistant or a home media server on SBCs.

        It still is way more complex than I want it to be for a person to set up a local cluster, but I'm still hopeful I can make it simpler.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection