A Basket of Eggs

9 min read Original article ↗

When a pull request arrives from an unfamiliar name, maintainers do an informal evaluation: click the profile, scan the contribution history, look for projects they recognize. That process worked when communities were small. It breaks down when coding agents can generate plausible-looking PRs at near-zero cost and Hacktoberfest alone produces 4,600+ PRs in a single month across just the repos we sampled.

Good Egg is a trust scoring tool we built to automate that evaluation. It looks at a contributor’s track record of merged PRs across the GitHub ecosystem and computes a score relative to the target project. It runs as a GitHub Action, a CLI, a Python library, and an MCP server. Every input is a decision a human maintainer already made (merge or reject). The first post covers the original model and validation in detail.

The v1 model scored contributors using a weighted graph over their contribution network. v2 added merge rate and account age via logistic regression. Both worked, but building them raised harder questions: can you detect bad actors who’ve already gotten code merged? What does the scoring model actually need? How should it handle brand-new accounts?

We spent several months investigating. v2.0.0 ships a new default scoring model (v3, which we call Diet Egg), a fresh-account advisory (Fresh Egg), and the lessons of a failed-but-informative attempt at suspension detection (Bad Egg).

Some accounts with merged pull requests later get suspended by GitHub. We wanted to know if we could detect them before GitHub does.

We checked 12,898 authors via the GitHub API. 323 were suspended (2.5%). We focused on multi-repo authors (3,208 authors, 61 suspended) because those are the contributors Good Egg actually scores: people with merged PRs in more than one repository. The 1.9% base rate in this population set the ceiling for what any classifier could do.

We tested behavioral features (merge rate), network centrality from the bipartite author-repo graph, temporal patterns (inter-PR timing regularity), TF-IDF on PR titles, and LLM-based scoring (~31K Gemini API calls across Flash and Pro to classify PR title portfolios as organic or suspicious).

Multi-repo results:

\(\begin{array}{lcc} \textbf{Approach} & \textbf{AUC} & \textbf{P@25} \\ \hline \text{Merge rate} & 0.565 & 0.00 \\ \text{Network centrality} & 0.523 & 0.08 \\ \text{TF-IDF title patterns} & 0.595 & 0.16 \\ \text{LLM scoring} & 0.619 & 0.44 \\ \end{array} \)

LLM scoring was the strongest individual signal, but an AUC of 0.619 on a 1.9% base rate means barely better than chance. Of the 25 authors the LLM ranked most suspicious, 11 were actually suspended (44% precision), with the remaining 14 as false positives. A KNN proximity detector achieved perfect AUC by mapping existing suspension networks, but it requires seed accounts and doesn’t generalize. The combined model (AUC 0.928) is dominated by that proximity signal; strip it out and the remaining features fall to near-chance levels.

The core insight: the merged-PR population is too homogeneous for suspension detection. These accounts passed human code review. Whatever signals they originally exhibited (formulaic PR titles, burst-pattern timing, concentrated repo targeting) were either absent from the ones that got merged, or present at rates indistinguishable from legitimate contributors. Once an account clears the merge filter, its behavioral profile looks like everyone else’s.

That’s actually encouraging. The suspension rate among merged-PR contributors is low (1.9%), and even campaign-associated authors (Hacktoberfest surges, etc.) show a relatively low rate (16 of 609, or 2.6%). The merge-and-review process is a real filter. The threat to open source quality is the volume of low-effort contributions that consume reviewer attention. Code slop, not code sabotage.

This shaped how we think about Good Egg. The same merged-PR lens that makes contributor scoring work is why suspension detection on that population doesn’t. And it pushed us to ask: if the graph-based model can’t distinguish suspended accounts from legitimate ones here, which components of the scoring model actually carry signal?

The Bad Egg investigation made us re-examine the v2 model. v2 combined graph score, merge rate, and account age via logistic regression. We asked: does each component earn its place for the users Good Egg actually scores?

The typical scored contributor has a handful of merged PRs across a few repos, with no prior relationship with the target project. If they already had merged PRs in the target repo, Good Egg short-circuits scoring entirely.

Graph score hurts for unknowns. The graph is powerful for prolific contributors. José Valim’s 93-repo graph is rich and informative. But a maintainer encountering José Valim in their PR queue already knows who he is. The tool is least necessary for the people it scores best.

The contributors who actually need scoring produce thin graphs. Two or three nodes, a few edges, no meaningful community structure. The graph score for these users is dominated by initialization artifacts and normalization noise. Merge rate, a simple ratio with no graph dependency, provides a cleaner signal. We measured this across contributor tiers:

\(\begin{array}{lccc} \textbf{Tier} & \textbf{merge_rate only} & \textbf{merge_rate + graph} & \textbf{Delta} \\ \hline \text{All medium+} & 0.516 & 0.408 & -0.108 \\ \text{Large (500-1999 PRs)} & 0.553 & 0.484 & -0.069 \\ \text{XL (2000+ PRs)} & 0.533 & 0.405 & -0.128 \\ \end{array}\)

Adding the graph score reduced predictive accuracy at every tier. The graph was the most complex part of the system: bipartite construction, personalized graph scoring, language normalization, edge weighting, anti-gaming penalties. All of that infrastructure was actively hurting predictions for the primary use case.

Account age adds nothing once you have merge rate. DeLong test p > 0.07 at every temporal cutoff from 30 days to 10 years. Once you condition on merge rate, account age is redundant. An account with a 78% merge rate has already demonstrated sustained productive activity; the merge rate implicitly encodes the longevity information that account age provides explicitly.

v3 (Diet Egg): score = merged / (merged + closed). No graph construction, no regression coefficients. ~20 lines of code. Now the default.

merge_rate = merged_count / (merged_count + closed_count)

The result is directly interpretable: 0.78 means 78% of this person’s PRs got merged. When a maintainer asks “why did this person score X?”, the answer is always the same: that’s their merge rate across all public repos.

The original post described how v2 “rescued” Guillermo Rauch using his 17-year account age. v3 scores him 78% HIGH on merge rate alone. The rescue wasn’t needed.

The v1 and v2 scoring models remain available via configuration for users who want graph-based analysis of prolific contributors. v3 is the recommended default because the typical use case (evaluating an unknown contributor on a new PR) is where the simpler model wins.

Account age carries real information (the original post showed it was LRT-significant). But baking it into the score muddies what the score means. A score of 0.7 could mean “good merge rate” or “mediocre merge rate from an old account.” Factoring it out preserves the score’s clarity: the number always means one thing.

So we moved account age to an informational advisory. The Fresh Egg advisory flags accounts younger than 365 days with a notice like “Account created 47 days ago.” It’s context, not a penalty.

Changepoint analysis found a merge-rate transition at ~3.9 years of account age, but that threshold would flag too many active contributors. A 1-year threshold catches genuinely new accounts where the low-merge-rate pattern is strongest. Fresh accounts under 365 days show merge rates ~16 percentage points lower than the population average.

The advisory appears across all output formats without affecting the trust score. For bots, the BOT classification takes precedence, so the advisory is omitted. A reviewer who sees “HIGH trust, fresh account” knows the contributor has a strong track record that happens to be recent. “LOW trust, fresh account” gives two independent reasons for caution. In both cases, clearer than a single blended number.

Contributor scoring answers a specific question well: is this person an established open-source contributor? v3 answers it more simply and more accurately than the graph-based models for the users who need it most. But contributor history is one dimension of a multi-dimensional problem.

Content scoring is where the leverage is. We’ve seen much stronger predictive power from understanding repo-specific fingerprints: what patterns of contribution succeed in a specific codebase. A contributor with a great track record can still submit a PR that doesn’t fit the project’s conventions, and a newer contributor can submit one that does. We’re building tools that evaluate contributions against the norms of the target repo, scoring the PR itself rather than (or in addition to) the person who submitted it.

New contributors need better on-ramps. Good Egg’s cold-start problem (new contributors score LOW by design) reflects a real gap in the ecosystem. Reputation scores measure established contributors well, but they offer nothing to someone making their first PR. That’s a problem if coding agents lower the barrier to generating a contribution while the trust model still penalizes unfamiliar names. We’re building tools to help newer contributors demonstrate competence, going beyond measuring established reputation toward actively guiding first-time contributors toward high-quality submissions. If this resonates with how your team reviews PRs, we’d love to hear from you; feel free to reach out.

The bigger picture. The low suspension rate among merged-PR authors is encouraging. Open source’s review process works as a filter for genuinely malicious actors. The challenge now is scaling review workflows to handle the volume that coding agents produce, while maintaining quality and keeping the door open to new contributors. That requires better tools at every layer: contributor context, content analysis, and workflow automation. Good Egg handles the first layer. We’re building the rest.

Good Egg v2.0.0 is on PyPI. pip install good-egg or try without installing:

GITHUB_TOKEN=... uvx good-egg score username --repo owner/repo

Source and methodology: github.com/2ndSetAI/good-egg

Cover image: Eggs in basket by George Chernilevsky, CC BY-SA 4.0, via Wikimedia Commons