Ask HN: How do you deploy and scale ML (±DL) models?
Broad answers to the broad question are OK too, everyone will benefit from them, but more specifically:
1. Why is there so little "unbiased" info about production deploying/serving ML models? (I mean except the official docs of frameworks like eg. TensorFlow which obviously suggest their mothership's own services/solutions.)
2. Do you hand code microservices around your TF or Pytorch (or sklearn / homebrewed / "shallow" learning) models?
3. Do you use TensorFlow Serving? (If so, is this working fine for you with Pytorch models too?)
4. Is using Go infra like eg. Cortex framework common? (Keep reading about it, love the point and I'd love using static language here but not Java, but talked with noooone who's actually used it.)
5. And going beyond the basics: is there any good established recipe for deploying and scaling models with dynamic re-training (eg. the user app expose something like a "retrain with params X + Y + Z" API action, callable in response to user actions - eg. the user control training too) that does not break horribly with more than tens of users?
P.S. Links to any collections of "established best practices" or "playbooks" would be awesome! 1. By unbiased, do you mean opinionated? The MLOps industry is still in the very early stages and there’s no single standard. Every dev and company has come up with an implementation but there are so many tiny little use cases that sometimes forces new implementations to spring up. The closest standard you get is a Docker/Kubernetes flavour. 2. Handcoding to begin with is fine but as you start to scale the number of production models and actually productionalize the model at scale, it’s unfeasible and leads to plenty of maintenance issues. There are a few model infrastructure tools that help with this but again, many are homegrown because the market is still new. Algorithmia, Seldon are pretty good starts. 3. Rarely use serving options provided as the challenge is integrating it with the rest of engineering. Service monitoring gets handled by different teams. 4. Depends on the industry and usecase. Again integrating and maintenance comes into play. Go/Cortex might make sense but a lot of companies leverage Spark so Scala/Java could be the choice for production models. 5. We’re working on creating this recipe for enterprises. I believe Seldon (open source) might contain this capability. The challenge as you pointed out is ensuring things don’t break! Cortex contributor here/the guy who wrote that article about using Go. The project is on the young side, so we don't have the "footprint" of older projects yet, but if you want to talk to people deploying models with Cortex I'd recommend checking out our Gitter channel: https://gitter.im/cortexlabs/cortex All of our core contributors + a good number of users are in there, and we're all happy to chat. Thanks! Will digg more into Cortex and/vs MLflow now ;) We're building an internal platform for that. Description in my profile. Please get in touch if you'd like to know more or play with it, we'd love your feedback. We have given access to about thirty students to prepare their final year projects in vision, NLP, etc. We've been doing consulting for more than six years and we're building a platform precisely to solve the problems we have encountered and you are writing about. We have learned some things that we are encoding in the platform, in case you want to build your own. We have started doing this because we hit a ceiling on the projects we could do, and we were under stress. We're a tiny, tiny team. The problems are in interfaces between different roles, with each role having a stack with a gazillion tools, and a different "language" they speak and universe they live in. The stitching of people's interaction together, the workflow, the business problems, and the fragmented tooling is problematic. The inflexibility of said tooling and frameworks that you addressed also made us not be able to use them, or other platforms. This is why we are working hard to build a coherent, integrated experience, while still trying to bulid abstractions that allow us to substitute tools and view the tools as simple components, not to be tied. For now, it allows you to create a notebook from several images with most libraries pre-installed. The infra it's deployed on allows Tesla K80 which you can use. You can of course install additional libraries. This solves the problem of setting the environment, CUDA, docker engine, runtime versions, and the usual yak shaving. We're only using JupyterHub and JupyterLab for Python notebooks for now, as it is what our colleagues use, but we plan to support more. It also solves the problem of the "it works on my machine" and running a colleague's notebook. You can click on a button and publish an AppBook and share it with a domain expert right away to play with. It is automatically parametrized for you so you don't play with widgets, and automatically generates form fields for parameters. The parameters, metrics are tracked behind the scenes without you doing anything, and the models are saved to object storage. Again, one role we target is the ML practitioner who does not necessarily remember to do these things, so we do it for them. Here's a video from a very early version: https://app.box.com/s/mwsw79g3d5b974o625f1mw979cc4znf0 We're using MLFlow for that, but plan to support GuildAI, and Cortex. We think hard to make things loosely coupled and configurable, so you get to pick the stack and easily integrate the platform with existing stack. The AppBook is super useful in that you can publish it and then use it to train the model, or share it with a domain expert so they can play with different parameters. One of the problems we've seen was that some features are considered unimportant for an ML practitioner, but are critical to domain experts. Thightening that feedback loop from notebook to domain expert makes the one click AppBook important because it saves you scheduling meetings and how to "show" the domain expert the work, while allowing them to interact with it. You can also deploy models you choose with one click and it will give you an endpoint and generate a tutorial on how to hit that endpoint to invoke the model with curl or Python requests. You can generate a token and invoke the model in other places or services. This self service feature is important because it allows an ML practitioner to "deploy" their own model, without asking a colleague to do so who might be doing other things. Self service is super important through this. Right now, we're focusing on fixing bugs and improving tests and have added monitoring before going back to feature development. Some features we were working on were a more flexible and scalable model deployment strategies, monitoring, collaboration, retraining, and data streams, and building the SDK. One of the problems is that the demos in PyData or SparkSummit and what not do not survive first contact with reality. For really simple things. For example, some libraries expect a filepath for their data. Say you want to use Keras from a notebook and your data is somewhere else than on disk (like if your job isn't to write blog posts on ML deployments, but you have real clients who expect you to explore data, build, deploy, and manage models, then build applications that use them that also look pretty, with money on the line, not toy projects), you suddenly have to dive into the framework internals to make it work with say, object storage. Another example, say your project is for image classification and you have 100+k images. Min.io does not support pagination because it's not really "S3", so you have to build pagination for the users because you're displaying it like a directory, and it must act like a directory. The way Min.io does it in their front end is they download the whole list recursively, and then do an infinite scroll. This can be 20MB+ of data through the network. It works great if you have great internet bandwidth, but for a lot of parts in the world having maybe 1Mbps (notice Mbps, not MBps), this won't work when a user just wants to "explore the directory structure". Heck, one of our colleagues was not using the product and when pressed, she said the notebook is taking forever to load. There were 30 megabytes of static files being downloaded and she had a 5kB/second during confinement. We dug into rebuilding it, then compressing static files, and caching. And she's having trouble using the AppBook doing real projects in vision, for example, specifying a data source and having the boxes display properly. One way we're developing the product is by going through the real projects we have worked on, with real data and doing them retrospectively on our platform to make sure it works with real problems. We're not optimizing for a demo in an event, we're optimizing for something that really works for us because we don't have teams of "data scientists", "ML engineers", "deployment engineers" and we want to be able to allow the couple of ML practitioners we have to get data projects running in a self service way, which means that by definition you have to inherit of all the complexity you're trying to spare users. The same problems when you can't trivially create an "empty bucket". Users don't care that S3 is not the same as a filesystem, you're pretending it is by having a "folder" icon and you damn better get it to work like a "folder" where one can create structures for image classes, and then traverse them. The API does not allow that, so you have to write the code to give it the look and feel of a directory and you must thus write something that make "pagination" work to display hundreds of thousands of images. And that's just 100K+ images, not millions or billions. But you wouldn't have that problem with a hello world example or the talk you give. The deployment problem, for instance. Yes, you see the example and it looks great. Then you try to reproduce the example in the repo, and it does not work. Let's say you use MLflow to "deploy". It has a client and a server. As far as you'd expect, the client makes a request to the server, and the server does "things". But let's say you're deploying a model that's in object storage: object storage credentials must be put server side and client side. You can't just make a request from the client to save a model and then the server handles it in the backend with whatever solution you're having. No, you must specify the object storage URL, and credentials in the client code. Which means, if you don't want to play house, you have to proxy requests and then authenticate them in a "Man in the middle" fashion between the mlflow client and the mlflow server itself, just so that your credentials do not leak. This would be mitigated if you're using Min.io in a multi tenant mode so each user has their own "object storage", but Min.io does not have an API with which you can can do that (user creation, etc), and you must do it with their `mc` client. Which means you have to create this on the fly for each user and wrap these. There's also the problem of work load scheduling, notebook collaboration and versioning. You give 2GB or RAM? OK. Users need way more. What do you do next? You give 100GB of RAM? You make it elastic? How do you deal with "runway models" (as opposed to Instagram models) that are hemorraging your resources? You have to think about resource management and workload management. Do you instaure quotas so that one user, doesn't monopolize all the resources? How do you deal with real time collaboration and versioning? Because you know, you're working on real projects with real people? Do they have to version their notebooks? They don't know how to use Git when they do ML. Do you hack on the Contents API and have a custom ContentsManager? Do you dig through operational transformation or CRDT to give it the look and feel people expect now for collaboration? It is that stitching and managing these fragmented tools idiosyncrasies that make it that the posts I read on some data science medium blog posts or watch talks about machine learning lifecycle management completely shock me, as I really would love it to be that way, but it simply isn't. Maybe it is when you're toying with a jupyter notebook or on Kaggle and, training a model on data on your disk, and wrapping a Flask application on it, then writing a blog post on how easy it is. Let's then say that you have "deployed" your model with the super ml lifecycle management library, which really just starts a process and launches a flask application. How do you shut it down or manage it? Drifting? How do you retrain it? Do you use Airflow or NiFi or the like? Who configures them, the use? What's the schedule? So, yes.. I understand why your question is: "Since everybody has it figured and blogs about it and demos in conferences, am I that stupid or is everyone full of baloney? Is there something everyone knows that I don't or what?"