Here was my “Your Year with ChatGPT” 2025 stats:
I was shocked to see myself in the top 1% of messages sent and top 0.1% of users (however that’s measured), and my best guess is that I have two use cases most users don’t: asking software engineering questions and asking very specific, individual-company-based-on-SEC-filings questions. Both use cases center on very dense, technical artifacts: code and financial statements. In each case, understanding the ins and outs is truly difficult, and at least for me requires asking tons of questions to LLMs, verifying the answers against documents, rephrasing my questions, rechecking answers, etc.
So, my use cases rather than some affinity for AI put me in the “power user” bracket. I never expected to send 11,450+ messages to ChatGPT going into this year (30+ questions per day)1. Analyzing my own behavior, I clearly felt the technology was useful enough to continue using it and want to analyze why in this post. Before I talk about the good, I want to note some obvious issues I see with AI in its current state. I want to do this same exercise in December 2026 (and maybe every year for a few years) and revisit what issues have been fixed and what issues remain.
Yes, I realize future models may fix the problems I’m about to list and that I can be part of the solution with better prompting, but I think these are still worth listing as despite my best efforts I still run into them most days:
Incorrect and misleading information / hallucinations even with the best-made prompts. I think made-up information right now is part of the package with AI no matter your skill in prompting. After a bunch of people told me my problem was my prompts weren’t specific enough, I invested time in courses like Google’s prompting course and MIT Sloan’s Generative AI for Business. Despite being more thoughtful with prompting, I still see made-up information or information-based-in-facts-but-terribly-misapplied on a regular basis.
AI getting stuck in thought loops or wasteful approaches. I am a sucker for watching AI think. I find it interesting but also part of me needs to see the approach to trust the output. I’m sure I’m overindexing on the exception rather than the rule, but in Deep Research /
ultrathink/ any type of thinking mode, I see AI pursuing approaches that don’t help get to a good answer, including, but not limited to, trying to access paywalled websites over and over again, trying to open applications on my computer that are unresponsive instead of moving on, and using misleading information as the basis for further inquiry (ex. assuming a company is growing EPS at 30% when EPS changed because of one-time factors). Sometimes (most of the time?) the AI thinking process is completely logical, but it only takes one mistake in a chain of logic to doom all future steps. If you haven’t seen Joanna Stern’s WSJ piece on letting Anthropic run a vending machine, I highly recommend it; it’s a great example of how limited context windows and complex tasks can cause strange lines of thinking. Anthropic also did a write-up on their own experiments with Claude-running-vending-machines, and IMO the results are similarly disastrous.Confirmation bias and sycophantic behavior - I had multiple debugging rabbit holes on the programming side where I had a theory about a bug that turned out to be totally wrong, but ChatGPT and Claude Code supported my approach because of context windows. One that stands out is when I was trying to figure out why a certain query was working on my computer but not in a cloud staging environment. I had absolutely convinced myself the issue was database config and Claude Code was happy to agree and cite tons of documents supporting my theory, when the actual issue was an old database URL I forgot to change. At no point do Claude, ChatGPT, etc. in my experience really push back against you the way a good friend / co-worker would and offer a healthy dose of skepticism. There are absolutely ways to mitigate this sycophantic behavior to be clear - asking the same LLM without a built up context window helps (ex. ask for instance Gemini when you’re not logged into your Google account) or rephrasing prompts. That said, LLMs by design are persuasive and have a history of convincing people they have stumbled upon revolutionary ideas that are actually mirages.
AI claiming mission accomplished when your task is not done. I have seen regular instances of AI producing code that doesn’t compile or creating some beautiful UI interaction that fails to write to a database (or a backend without a frontend). The only mitigation I know of here is ultra specific instructions on verification. For making websites, this almost always is something like “Make sure you can visit localhost:3000 and click on a button, then check the database at localhost:5432 and confirm rows in the table
reportsexist. Take screenshots of the website showing the successful state after a click.” For investing, it means reading whatever AI produces with a fine tooth comb and confirming all parts of your question were answered and make sense.
Another general issue I want to bring up here - it’s hard to have a constructive conversation about these issues with folks in love with AI in part because AI is non-deterministic. Everyone is going to have different experiences, and you yourself are going to see different results on the same prompt, especially if that prompt is complex and has some creative license in it. Ever ask ChatGPT / Gemini / Claude / Grok to review its own output for accuracy? It is amazing to me how often AI points out issues with its own output. My view is AI finds far more glaring issues with work it itself did than a human editing his or her own work would find.
Going back to a previous point above, I think this is why an objective definition of “definition of done” / verifiable output seems to be the silver bullet for many AI problems. Andrej Karpathy has a good post on this idea. LLMs are great at completing jobs that have black-and-white definitions of done. Karpathy is talking about reinforcement learning / training in this post, but I think the point below is equally applicable for LLM prompting:
It’s about to what extent an AI can “practice” something. The environment has to be:
resettable (you can start a new attempt),
efficient (a lot attempts can be made)
rewardable (there is some automated process to reward any specific attempt that was made).
The more a task/job is verifiable, the more amenable it is to automation in the new programming paradigm. If it is not verifiable, it has to fall out from neural net magic of generalization fingers crossed, or via weaker means like imitation. This is what’s driving the “jagged” frontier of progress in LLMs. Tasks that are verifiable progress rapidly, including possibly beyond the ability of top experts (e.g. math, code, amount of time spent watching videos, anything that looks like puzzles with correct answers), while many others lag by comparison (creative, strategic, tasks that combine real-world knowledge, state, context and common sense).
The above is why prompting based on “Please click the button in the website and confirm XYZ happens” works so well and “Please find companies with competitive moats and sustainable earnings growth” does not. Given this huge divide in tasks that work and tasks that don’t, I am not surprised that bull / bear debates about AI seem to be increasing in intensity and the market seems unable to make up its mind on AI staying power / impact.
To be clear, I lean more towards the bull case and think in 2026 my AI usage will be higher than in 2025. I think I speak for many people when I say I think I became better at understanding the constraints of LLMs and using them more effectively this year. I even have a little Google Doc I continually update called “AI thoughts” I add to when I find strategies and prompts that work well. As I improve, models are also improving and a whole cohort of companies are improving that outperform the base models in their domains (ex. check out Fintool’s benchmarks against ChatGPT-5 ).
I mentioned earlier I would discuss where I am finding AI useful. The short answer - and not surprising based on the earlier Karpathy quote - is I find it useful for anything verifiable that has strong supporting documentation / definition of correctness to verify answers. This is an investing Substack, so below I want to detail some investing use cases I am leaning into:
Building standard reports off SEC and ROIC.ai data. After completing a few hours of research on a company, I have a Deep Research prompt (publicly accessible Google Doc here) I run against the major LLMs. I also have a Python script that pulls in all SEC filings from the last ten years and a good amount of historical per-share data from the ROIC.ai API. After the data is there, I ask Claude to only use the HTML and JSON files in a specific folder to do analysis. While not exactly a RAG model, the benefits I’m going for are similar in that I’m constraining Claude to a set of documents to use for analysis. Once the data is there, the script basically looks like this:
def do_analysis(ticker, folder_with_data): ask_claude_about_proxies(ticker, folder_with_data) ask_claude_about_financial_statements(ticker, folder_with_data) ask_claude_about_risk_factors(ticker, folder_with_data) ...Using standard reports as understanding check and jumping off points for more research. Generally, I expect the report to be 75% information I already know. I view this not as redundant but positive as it shows I’ve done my work as an investor. The other 25% tells me where I might go next for research.
The proxy statement. I am finding this 25% often is in the DEF14A (definitive proxy statement). Over the years, I’ve become more obsessed with Charlie Munger’s “Show me the incentive, and I’ll show you the outcome”, and I went through a phase where I religiously read NonGaap’s work. I watched the entire TURN-MLCI saga unfold and that experience convinced me I needed to understand management incentives and corporate governance better. LLMs are excellent at picking out useful nuggets in a proxy and among other things have given me detailed adjusted / incentive EBITDA explanations for how management is compensated, how much stock directors must own relative to base annual salary (I have seen anywhere from no requirement to 3x to 6x+) and unexpected objectives management is comped on. ChatGPT picked up for one company I analyzed that the C suite was rewarded based on investor interactions! And in the proxy, the company actually put that the goal was at least 400 and they had 519. If you ever wonder why management took a meeting with you, now you know why…
Control F / document search on steroids. LLMs IMO are significantly better than searching a company’s filings for a specific key term. In the thinking steps, it has been encouraging to see ChatGPT and Claude searching for many variations of a key term (ex. for buybacks, share repurchase, share retirement, buyback authorization, etc.) and then stitching context together to provide timelines / more complete pictures.
Marrying commentary to financials. What management says and what management does we all know are two different things (especially on buybacks). I have been impressed with LLMs’ ability to look at earnings commentary and check commentary against financial statements.
Overall, this year with AI felt very similar in many respects to my first years using Google Search or Excel. I know that take is too reductive and that AI will be more transformative than either, but the similarity is slowly getting better at a new tool. I remember learning quotes in Google Search or VLOOKUP / OFFSET in Excel. Once you know these things, you get a lot more bang for your buck.
The “verifiable” / “definition of done” / “specify constraints” parts of AI that are on the user to get right have been a process for me. I can say I’m confident my process is getting better. Better yet, I’m getting less frustrated with AI output as I’m avoiding my mistakes of early 2025.
Happy Holidays and here’s to a great 2026 with AI being a tailwind to better investing decisions.

