Google Algorithm Leaked
seroundtable.comIt's not clear to me whether the leak is actually for Google Search or one of the products around search that isn't "Search", like Document Warehouse [1]. Is there anything definitive one way or the other in all this? Nobody seems to even questioning this
[1] https://cloud.google.com/document-warehouse/docs/overview
If you read the original publication on this[1], they mention there’s a stray commit publishing the internal variant of the SDK intended for the actual Google warehouse database. So the code bases probably live close enough together for someone to accidentally pass the wrong folder name or something.
This has been fixed, but the commit and all it’s changes are out there—and tragically, published alongside a copy of the Apache 2.0 license (intended for the document warehouse API SDK), which officially sanctioned freely copying and using the code. So there is really nothing Google can do about it.
>published alongside a copy of the Apache 2.0 license (intended for the document warehouse API SDK), which officially sanctioned freely copying and using the code. So there is really nothing Google can do about it.
This sounds good to us, tech nerds, but I'm pretty sure law doesn't work like this.
At minimum, even if (clearly accidentally) putting a code next to a open source license world be legally binding, the person doing the leaking was not the intellectual property owner. Google will just - correctly - say, that the person that leaked the code had no right to do so, especially under free license.
To continue this train of thought, what is stopping a company from saying "This team did not ask for permission before they published XYZ under this license" and just rescinding access to a widely used library?
At what point can you no longer say "oopsie" ?
The law is JIT. We don't need an answer now, just ask a judge when it happens.
The law is JIT is honestly the best computer-nerd analogy for legal precedent I've ever heard.
The post title is misleading. The algorithm did not leak, only the documentation listing all the signals that can possibly be used as inputs for that algorithm. It doesn't reveal which ones are actually used and how.
This looks like it's written in Elixir (the docs are using ExDocs, Elixir's documentation toolset).
This can't possibly be the actual search index rules (which is probably code that's decades old, my guess is either in Python or Java?) – unless they rewrote all of it in the past few years?
Can anyone else confirm this?
It’s not. Google uses a content warehouse database internally that holds all stored web page content, and to access this vast database, they have an API. The code discovered here is a generated SDK for Elixir for this content warehouse API.
Apparently, Google had a now deprecated product (who would have guessed that? Consider me shocked!) that provided customers with a trimmed-down version of this database for their own purposes, but mistakenly published the internal SDK code instead of that intended for Google Cloud customers to GitHub.
So while this doesn’t directly show the search index source code, it describes the data schema of the index in great detail, so there are at least some interesting educated guesses on the workings of the actual index to draw from it.
[dupe]
Some more discussion: https://news.ycombinator.com/item?id=40496967