Data Science Weekly - Issue 647

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

Sponsor Message

Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more.

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

And now…let’s dive into some interesting links from this week.

Estimating π with a Coin
We describe a simple Monte Carlo method for estimating π by tossing a coin. Although the underlying Catalan-number series identities appear implicitly in the probability theory literature, the interpretation of π/4 presented here seems to be new…

UniGeoCLIP - Unified Geospatial Contrastive Learning
Geospatial understanding requires reasoning across fundamentally different kinds of data -- a satellite view from above, a street photo at ground level, a 3D elevation map, a text description of a neighborhood, and a pair of GPS coordinates. These modalities are complementary: each one captures something the others miss…UniGeoCLIP is the first contrastive framework to jointly align all five modalities into a single unified embedding space, enabling seamless retrieval and reasoning across any combination of inputs, without relying on a privileged “pivot” modality…
Getting Into AI Infra
My next blog post will be about my 10-year Silicon Valley AI tour of duty — NVIDIA -> Tesla AI -> X -> DeepMind -> OpenAI…but I wanted to start with something more practical: how to get into AI infra. I joined NVIDIA in 2016, back when the Deep Learning Systems team still fit in one room. “AI infra” was not really a ‘thing’ yet. When clmt started the ‘AI Infra’ org I was the first engineer to join. This post is my version of how I made my way there, and a fun way to catch up…

Are predictive models enough?
In one of the social media discussions about causal inference, the suggestion was made that predictive models are all you need: a good predictive model gives you all the conditional distributions you could want, and you don’t need any special causal inference stuff…I think there’s something to this point of view, but there are a few limitations…
How many production ML/AI projects do you complete in a year? [Reddit]
Wondering what it looks like at other companies. I usually deliver around 3 or 4 ML/AI projects each year. I’m also expected to do multiple analyses separate from this so I’m not only focused on ML/AI. We have a small team of 7 people and we rarely collaborate on projects. What is it like at your company?…
On creating ‘new knobs of control’ in biology
How can we regain more control over our poorly built physiology? Or, in other words, how we install more ‘knobs of control’? I had this question too! It turns out there’s a lot of new emerging therapeutic modalities that fit this criteria, and I decided to turn my research over them into an essay…
A Better R Programming Experience Thanks to Tree-sitter
R tooling around Tree-sitter is how you get
- reformatting through Air and linting through Jarl;
- auto-completion or help on hover in the Positron IDE;
- better search for R on GitHub;
- and more!
In this post, we’ll explain what Tree-sitter is, and how tools built on Tree-sitter can benefit your R development workflow…
Why Traditional Statistics Still Matters in the Age of AI
In this episode, Prof. Rob Hyndman joins Dr. Genevieve Hayes to make the case for why rigorous statistical thinking remains indispensable in the age of AI, and what data scientists are giving up when they abandon it….In this episode, you’ll discover:
1. Why throwing data at an LLM is no substitute for building a model that understands the problem
2. How combining classical statistics and machine learning can produce better forecasting results than either approach alone
3. What data scientists lose when they stop thinking probabilistically - and why it matters for decision making
4. Where to start if you want to strengthen your statistical foundations…
Raft is so fetch: The Raft Consensus Algorithm explained through “Mean Girls”
Raft is a consensus algorithm used in distributed systems to ensure that data is replicated safely and consistently. That sentence alone can be confusing…In fact, I’ve seen conversations recently on social media in which actual technical leaders of infrastructure companies demonstrate a lack of understanding (!). Point being, you’re not alone. Get in, losers, we’re going back to (Hollywood) high school…
Succinct Data Structures: Cramming 80,000 words into a Javascript file
Let’s continue our short tour of data structures for storing words. Today, we will over-optimize John Resig’s Word Game. Along the way, we shall learn about a little-known branch of computer science, called succinct data structures….
Components of A Coding Agent
In this article, I want to cover the overall design of coding agents and agent harnesses: what they are, how they work, and how the different pieces fit together in practice. Readers of my Build a Large Language Model (From Scratch) and Build a Large Reasoning Model (From Scratch) books often ask about agents, so I thought it would be useful to write a reference I can point to…In this article, I lay out six of the main building blocks of a coding agent…
State of AI: April 2026 newsletter
Welcome to the latest issue of the State of AI, an editorialized newsletter that covers the key developments in AI policy, research, industry, and start-ups from February 1 to April 7, 2026…
Regression should predict full distributions
While approaches like linear regression can output full predictive distributions, these often come with (too) strong distributional assumptions. What if we always worked with machine learning models that produce the full predictive distribution? With classification, we are already at this point: Modern machine learning approaches output not just the majority class, but a probability for each class. Whether this probability is calibrated is another question. With regression, we are a bit stuck with a point-based mindset. However, this could change with tabular foundation models. At least in theory: While these models produce the full predictive distribution (or at least a discretized approximation over a fixed support), it’s not the default and the output is a bit hidden…
Maze Classification
Mazes in general (and hence algorithms to create Mazes) can be organized along seven different classifications. These are: Dimension, Hyperdimension, Topology, Tessellation, Routing, Texture, and Focus. A Maze can take one item from each of the classes in any combination…
What is a data analysis mistake you made early in your career that you will never make again? [Reddit]
For those who are working as data analysts or learning analytics, what’s one mistake you made early on that taught you a big lesson? Could be technical, communication, dashboards, SQL, Excel, anything. I think beginners like me could learn a lot from real experiences…
What I Learned Building Attention Residuals from Scratch
Naively reimplementing a paper in PyTorch changed how I think about how transformers route information, and about the gap between academic math and physical silicon…I wanted to understand how transformers actually route information between layers. Not at the level of “attention computes weighted averages,” but at the level of what physically happens to a tensor as it moves through the network. What gets preserved, what gets overwritten, and why….

.
* Based on unique clicks.
** Please take a look at last week's issue #646 here.

Go deeper each week (paid subscription)
Get 3 additional posts per week designed to help you:
- Statistics → understand the math behind ML
- AI Agents → build with modern AI tools
- Career → become more valuable at your job
👉 Upgrade for $10/month — cancel anytime
Looking to get a job?
A practical guide to landing your first (or next) data science role, based on thousands of reader questions.
👉 Check out our “Get A Data Science Job” Course
Promote your organization/project/event to ~68,500 subscribers
Sponsor this newsletter and reach a highly engaged data science audience (30–35% open rate).
👉 Reply to this email to learn more

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly - Issue 647

Sponsor Message

Discussion about this post

Ready for more?