Google Algorithm Leaked

90 points by certifiedloud 2 years ago · 11 comments

Reader

It's not clear to me whether the leak is actually for Google Search or one of the products around search that isn't "Search", like Document Warehouse [1]. Is there anything definitive one way or the other in all this? Nobody seems to even questioning this

[1] https://cloud.google.com/document-warehouse/docs/overview

9dev 2 years ago

If you read the original publication on this[1], they mention there’s a stray commit publishing the internal variant of the SDK intended for the actual Google warehouse database. So the code bases probably live close enough together for someone to accidentally pass the wrong folder name or something.
This has been fixed, but the commit and all it’s changes are out there—and tragically, published alongside a copy of the Apache 2.0 license (intended for the document warehouse API SDK), which officially sanctioned freely copying and using the code. So there is really nothing Google can do about it.
[1] https://ipullrank.com/google-algo-leak
- lambdaxyzw 2 years ago
  
  >published alongside a copy of the Apache 2.0 license (intended for the document warehouse API SDK), which officially sanctioned freely copying and using the code. So there is really nothing Google can do about it.
  This sounds good to us, tech nerds, but I'm pretty sure law doesn't work like this.
  At minimum, even if (clearly accidentally) putting a code next to a open source license world be legally binding, the person doing the leaking was not the intellectual property owner. Google will just - correctly - say, that the person that leaked the code had no right to do so, especially under free license.
  - natpalmer1776 2 years ago
    
    To continue this train of thought, what is stopping a company from saying "This team did not ask for permission before they published XYZ under this license" and just rescinding access to a widely used library?
    At what point can you no longer say "oopsie" ?
    
    bqmjjx0kac 2 years ago
    
    The law is JIT. We don't need an answer now, just ask a judge when it happens.
    
    natpalmer1776 2 years ago
    
    The law is JIT is honestly the best computer-nerd analogy for legal precedent I've ever heard.

avallach 2 years ago

The post title is misleading. The algorithm did not leak, only the documentation listing all the signals that can possibly be used as inputs for that algorithm. It doesn't reveal which ones are actually used and how.

atonse 2 years ago

This looks like it's written in Elixir (the docs are using ExDocs, Elixir's documentation toolset).

This can't possibly be the actual search index rules (which is probably code that's decades old, my guess is either in Python or Java?) – unless they rewrote all of it in the past few years?

Can anyone else confirm this?

9dev 2 years ago

It’s not. Google uses a content warehouse database internally that holds all stored web page content, and to access this vast database, they have an API. The code discovered here is a generated SDK for Elixir for this content warehouse API.
Apparently, Google had a now deprecated product (who would have guessed that? Consider me shocked!) that provided customers with a trimmed-down version of this database for their own purposes, but mistakenly published the internal SDK code instead of that intended for Google Cloud customers to GitHub.
So while this doesn’t directly show the search index source code, it describes the data schema of the index in great detail, so there are at least some interesting educated guesses on the workings of the actual index to draw from it.

ChrisArchitect 2 years ago

[dupe]

Some more discussion: https://news.ycombinator.com/item?id=40496967

Settings

Google Algorithm Leaked

Keyboard Shortcuts