Wikipedia data engineering practices with Nuria Ruiz

What is the company about? What are you building?

The data engineering team at Wikimedia maintains what would be a traditional data pipeline for product teams to gather data to aid in product execution. What is different in the open knowledge ecosystem is that much of the metadata about Wikipedia is also public. This means that you can get a myriad of datasets with pageviews for every article, unique devices, editors per country, etc.

All this data is delivered every hour and has been so for more than 15 years. It is not an overstatement to say that Wikipedia's open datasets have played a key role in the development of many technologies we take for granted, like NLP. Maintaining all the pipelines for data delivery for a system that at peak gets 200,000 requests per second is needless to say, quite a task.

‍

It is not an overstatement to say that Wikipedia's open datasets have played a key role in the development of many technologies we take for granted, like NLP

Can you describe the high level data architecture? Wikipedia is pretty open and I know you use Cassandra, Druid, Kafka, how those work together. In other words, what’s the data path from a GET request to a wikipedia page to something you can render on a graph.

Something that is not well-known is that the whole puppet repo that describes Wikipedia's is public and you can take a look at it on github. To this day I find this impressive, that every single piece of infra of all Wikipedia's systems is described there. We also provide pretty pictures of the stack.

The data stack is very standard. As you mentioned we have Kafka for data intake, Hadoop as a persistent storage and from those two data gets ingested into Cassandra or Druid. There are different pipelines with different bells and whistles, some of them have a JSON schema registry and some others parse data out of the HTTP requests directly. In Wikipedia, 97% of requests at all times are cached and thus served from the CDN via Varnish. So, in order to gather all this data, we need Varnish to be able to talk to Kafka, we have a custom piece called varnishkafka that is a source of joy (#not) for our SRE team. The migration to Apache Traffic Service will eventually render this piece of infra obsolete.

There are two distinct sources of data: data from readers and editors. While data for readers is very high volume is quite "simple". It can be thought as simple pageviews. Edit data is however a lot more complicated and in order to harvest it properly we developed a Lambda-ish architecture. We source this data two ways, once a month from the MediaWiki database directly and via Spark, after 2 days of processing in a ~60 nodes hadoop cluster we create large denormalized tables. And at the same time we have event streams of data that publish every edit event to wikipedia as a json blob real time and those also get persisted on hadoop. We made a very conscious decision to use JSON versus AVRO in most of our data pipelines and, since then, the Kafka ecosystem has moved more towards being a lot more friendly towards JSON. Makes sense, because JSON is a lot easier to debug.

The visualization layer is Superset, a nifty UI tool that can be used against druid but also Presto, a very fast data query engine developed by Facebook that can be deployed on top of hadoop and it is fully ANSI SQL compliant.

Wikipedia does not use any cloud provider, how does that change the way architecture is designed?

Everything is developed from the ground up so we favor open source solutions tried and tested at scale. Much of the data pipeline of Wikipedia exists thanks to Facebook and LinkedIn open source efforts.

‍

Also, you only use open source software, how do you pick the right tool for the job? Are software licenses a limitation? Could you use Mongo or Elastic for example?

If we need a piece of software that, for example, pulls data from Kafka we survey the ecosystem and look at what exists that is tried and tested and fully open source. We do couple small prototypes and evaluate results. See for example our recent spike on Airflow and others. Licenses that do not provide the same level of freedom as CC0 are a problem. We cannot, for example, use the licenses for Kafka Connect which are too restrictive. We have been using Elastic since 2014 but that, with their new license, might need to change.

‍

What are some of the hardest data problems you’ve had to solve so far?

Something that I have learned in these 20 years of work is that most of the real hard problems have to do with people, rather than technology.

Now, I would say the hardest problems in the data realm on 2021 have to do with Privacy. At Wikipedia Privacy is paramount (there cannot be truly free access to knowledge without a guarantee of privacy) and we had to "invent" methods to calculate in privacy-conscious ways metrics that are the norm for web properties, like Monthly Active Users. You can see how we did it. Still, the hardest problem was probably communicating effectively how much we care about Privacy.

‍

What are the hard ones still unsolved?

Data quality issues are kind of a "fractal" problem, you are never done eliminating those completely. Here is an example of a problem with data quality that was invisible to data throughput alarms and here is the first idea we had on how to partially solve issues like these (spoiler: entropy counts). Now, these examples make apparent how other quality issues of more complex nature would slip by.

How did the architecture change during your 6 years at wikipedia?

I would say it is recently that streaming or event-based architectures have become an achievable reality, little by little we are moving towards a world with more streaming services.

Tell us some interesting numbers (traffic, rows, gigabytes you store, process, infra costs and so on)

Wikipedia's budget is public, the site and the 400 people on staff plus a number of projects and chapters are run with a budget of about $100 million dollars per year. The whole data stack hardware costs are minimal, less than a million dollars per year for about 3PB of data. We process and aggregate a lot of data but also, according to our privacy policy, a lot of it is discarded.

What’s the most interesting use case solved with data. I know you have pretty interesting blog posts about it.

My two favorite projects are more data scienc-y than data engineer-y and in both I collaborated with our team of Researchers which are the ones that really came into gauge whether the engineering ideas are mathematically sound.

Bot detection: It took us a bit but we finally were able to flag a lot of the automated traffic that Wikipedia gets and tag it as such on our data releases and internal reports. Some of this traffic is benign (in the sense of it only using resources). An example. Other traffic, however, is not so innocuous. There are many attempts to manipulate Wikipedia's top pageview lists, for example. Most recently someone was trying to add a bunch of obscene terms to the top pageview list of Hungarian Wikipedia.

While on first instance, this seems like a simple classification problem, getting a labeled dataset to do prediction is actually not that simple. It requires building pseudo-sessions on top of data that has no common identifiers to know that two requests came from the same entity. We ended up using heuristics rather than ML and things worked out pretty well. You can read about the details of it here.

Censorship Alarms: We wanted to identify events in which active censorship of Wikipedia sites is ongoing. Wikipedia was blocked in Turkey for years and it is today blocked in Mainland China. Besides these countries like Iran block Wikipedia on and off. Looking at that problem in detail we realized that it had a lot in common with problems with data quality so we used some of the same techniques to alarm when "it seemed" a country had an anomalous traffic pattern. Again, this is less easy than it seems at first sight because you do not want to alarm unnecessarily and uniform traffic drops do not constitute an event. Anomalous traffic drops do. See how we did it here.

‍

What are the most interesting, unexpected, or challenging lessons that you have learned? I guess there are plenty of them related to user privacy. How that changes in comparison to a “regular company”, what things you need to be careful when dealing with data.

Lessons learned are many: in environments where the work is principle-based rather than profit based defining metrics is not easy. When everyone cares a lot about the larger mission of their work there is going to be a lot of strong discussions, once consensus is reached execution is fast with zero management oversight. Also, principles are very useful when choices are needed so in a way, working on a strong principle based organization makes some choices easy (ex: only use open source, preserve the right to fork) even if those choices imply a lot of work.

How was your team organized: people, roles. What was your objective as a team? I guess one of them would be to create internal tools but I know you have some external tools and data.

The larger Technology team reports to the CTO, there are several parallel teams: Security, SRE, Data Engineering, Performance... Those teams work independently but coordinate among themselves. The mission of the Data Engineering team is to serve internal customers but also the large external community. I think for a team in an organization as large in impact as Wikimedia is crucial to have a mission statement, otherwise it is easy to get lost in the many (infinite, really) requests for work. Our mission was to "empower and support data informed decision making across the Foundation and the Community". We make Wikimedia related data available for querying and analysis to both Wikimedia employees and the different Wiki communities and stakeholders."
This mission, notice, involves serving a large group of stakeholders, the community of editors that do not work at the Wikimedia Foundation.

‍