Running VSCode in Docker

125 points by binalpatel 7 years ago · 81 comments

Reader

> Last - and most important for me - in industries like my own (healthcare), you work with highly regulated data that has to be stored securely, where having multiple copies of data on multiple laptops can pose an unacceptably large risk.

um... please tell me devs don't have access to production data in a healthcare environment (of all places!).

I mean, I understand the need for a representative dataset to develop and test against, but this is people's lives they're playing with!

And, you know, if you had a decent set of anonymised or fictitious customer data to work with, you wouldn't need to run your IDE in docker, and there would be less surface area for attackers to get to the data.

kuzehanka 7 years ago

> um... please tell me devs don't have access to production data
If developers don't have access to production data, then the solution is useless. How do people not understand this in 2019?
Millions of dollars have been dumped into various products centred around the idea of synthesizing 'production-like' data and all have failed. Because synthesizing fake data destroys the signal that makes the original data useful in the first place. If the engineers don't have access to it, then they can't extract the value from it, then what the hell are you doing in the first place?
You think if you give engineers a synthetic dataset and they build a blind solution around it, that the users will be extract value out of that data? That myth was dispelled a decade ago and there doesn't exist a single synthetic data success story since then.
I've had clients coming to us with the notion that they can give us fake imaging data and we can generate diagnostic insights from it. This crap needs to stop. If you can't trust engineers with your data to extract value out of it, then go ahead and munge it in excel.
- srndh 7 years ago
  
  I too share your same view.
  When I had this same discussion, the response I got was its not a matter of trust. But apparently in the terms & conditions (that consumers do not read anyways, including myself). There is a part of the data not being viewed by anyone not using it for diagnosis or treatment of the patient. Basically, only anyone with a medical degree and part of the team treating the patient can access the data & does not mention about the engineer using the data to build a better system to improve the system used for healthcare. So, ideally they want a candidate with a double degree in software and medicine. This term is applicable to other domains also.
  I do not accept the reason, but that is what it is.
- MuffinFlavored 7 years ago
  
  > If developers don't have access to production data, then the solution is useless. How do people not understand this in 2019?
  ... what?
  - jldugger 7 years ago
    
    Think machine learning.
    
    inetknght 7 years ago
    
    I am thinking machine learning. I am thinking I don't want a developer to run machine learning algorithms in a (unsanitized) development environment with access to production data.
    
    kuzehanka 7 years ago
    
    How do you train validate and market test an ml model without access to production data? Please teach us all, this is literally a billion dollar industry question.
    
    ggregoire 7 years ago
    
    You duplicate the data in dev but remove the names, SSNs, emails, addresses, phone numbers, etc?
    
    jldugger 7 years ago
    
    That's not enough even, in previous cases we've seen researchers augment datasets with other data to de-anonymize. https://www.wired.com/2007/12/why-anonymous-data-sometimes-i...
    But I think in general the idea OP was dismissing was generating synthetic data rather than attempting to anonymize prod data. In that case you have a risk of modeling the generator rather than your users.
    
    kuzehanka 7 years ago
    
    PII is only one type of sensitive information, there is a lot more.
    The bottom line that if you want to draw insights from data, that particular data must make its way to the engineer unadulterated. If you de-risk some fields in your data via removal/masking/entropy, then you are excluding those fields from ml.
    You are not talking about blinded machine learning, you're talking about not doing machine learning on sensitive data in the first place. The whole discussion is moot.
    
    inetknght 7 years ago
    
    If it's literally a billion dollar industry question, then hire me. I'm not interested in continuing to feed corporations for free.
- nevir 7 years ago
  
  FWIW, as an example, Google developers don't have direct access to production PII data.
  The best they get are aggregations (sufficiently large enough that you can't identity a person). Specific user information is only available if the user explicitly opts in to share it, and is scoped to the specific case for analysis.
  It makes training/diagnosing ML models a challenge, to be sure
  - kuzehanka 7 years ago
    
    Let's be really clear here. It doesn't just make ML solutions a challenge. It makes them impossible.
    Even with all the compute power and automation of google, you can't blindly create an ML model and say yes it works without a data scientist actually looking at the guts with row-wise access to the training/validation data.
    You can't do analytics without giving engineers access to prod data. Not at even at Google.
    
    nevir 7 years ago
    
    And yet, they do.
    Source: worked there, on a ML focused team.
    
    kuzehanka 7 years ago
    
    Go on, tell us how it works so we can all start doing the same and save millions of dollars in risk/compliance crossing enablement.
    
    nevir 7 years ago
    
    https://cloud.google.com/solutions/sensitive-data-and-ml-dat... has some high level guidance on different techniques that can be applied, depending on the type of data you're working with.
    
    kuzehanka 7 years ago
    
    I actually know this article, so I'll summarise it for the readers.
    It covers de-risking data if it isn't required for ML models or is of low impact. Through removal of it, masking of it, or entropy injection to it. Every single one of these approaches is detrimental to an ML model and just make it useless unless the fields being controlled are't correlated to the outcome in the first place.
    This article is an acknowledgement of the issue and the fact that solutions don't exist. Only risk management strategies which disable potential solutions.
    Their ultimate answer is 'protect the data and give it to only a small restricted set of people'. Right back where we started.
    
    Fireflite 7 years ago
    
    Injecting entropy into your data in order to generate differentially private access actually typically improves model performance, rather than degrades it.
    This behaviour is counterintuitive, but effectively you're enforcing a degree of regularization in a way similar to how data augmentation works. You'll get reduced training set performance and improved test set performance, as with other forms of regularization until you make it too strong.
hjk05 7 years ago

If you only work with fictitious data, the conclusions of your analysis and models... will be fictitious.
Even anonymised date (which most healthcare data is btw.) has strict requirements on access control and copying.
As for “develop and test against”. There’s the entire field of data science and modeling that’s not focused with developing systems to expose the data, but doing analysis on the data.
- onion2k 7 years ago
  
  If you only work with fictitious data, the conclusions of your analysis and models... will be fictitious.
  That doesn't have to mean it's worse though. Testing with 10,000,000 fictitious accounts generated to cover lots of permutations of user data is a great idea even if the real data won't ever have some of those permutations. It's a testing technique called "fuzzing", and it's quite common.
  - IanCal 7 years ago
    
    It entirely depends on what you're trying to deal with. You may solve some problems there but looking at real data is a must for solving others.
  - achompas 7 years ago
    
    How do you ensure your fuzzed data has statistical signal that approximates those present in the true data?
  - hjk05 7 years ago
    
    That’s testing not modeling. If someone asks you to do analysis of a drug trial and you come back with a report based on fictitious user data because “devs should not have access to real data” you’ll be laughed out the room.
- Tenoke 7 years ago
  
  >If you only work with fictitious data, the conclusions of your analysis and models... will be fictitious.
  Not necessarily, there is a lot of work done on synthetic data which closely matches the properties of the real data both in academia and increasingly in industry.
- marcus_holmes 7 years ago
  
  sure, but are data scientists analysing vast amounts of actual customer data really using VSCode?
  - hjk05 7 years ago
    
    Yes... What do you think they’d use? Those who prefer python are using PyCharm and VSCode, and then some use Jupyther, while others use Rstudio, matlab or Mathematica. Of cause you also have some doing emacs or vim, but that’s a minority.
  - jldugger 7 years ago
    
    They tend to just use a browser IDE like Jupyter... =(
bmj 7 years ago

And, you know, if you had a decent set of anonymised or fictitious customer data to work with, you wouldn't need to run your IDE in docker, and there would be less surface area for attackers to get to the data.
I see this particular development environment as an advantage for this particular situation. I work in health care, and in R&D we often have to help debug issues with a client implementation, which means having an anonymized snapshot of their data in our reporting system. Typically, we end up passing around DB backups and zip files of the client-specific code. It would be significantly easier to fire up a Docker container that was ready to go.
- fs111 7 years ago
  
  so then everybody has a copy of the docker container and access to the data too. How does that improve things?
  - IanCal 7 years ago
    
    Unless I misunderstood, the idea is to connect to a single place running the docker container, not that you bundle it and share the image.
0xDEFC0DE 7 years ago

>um... please tell me devs don't have access to production data in a healthcare environment (of all places!).
Have seen it quite a few times, but there needed to be a good reason and other options had to be exhausted. A few examples were due to zero-width characters which were sanitized by whatever tool that anonymized data, but broke something in the program.
There are brittle systems in healthcare.
- stevehawk 7 years ago
  
  sometimes it feels like there are only brittle systems in healthcare.
FrozenTuna 7 years ago

They do. In my experience, every query containing PII was logged. I trust it since I got a stern talking to when I emailed my private email some code I wrote for an unrelated side-project while at work. Lesson learned.
- Kuraj 7 years ago
  
  > I got a stern talking to when I emailed my private email some code I wrote for an unrelated side-project while at work.
  I made the same mistake once. We should we glad we weren't fired.
munchbunny 7 years ago

You don't want standing access to this data. That would be a very bad idea.
On the other hand, you do need devs to be able to look at real data when it's absolutely necessary (generally with a code-enforced gatekeeper and an audit trail). And you need to make sure a dev can SSL into a node for repair/maintenance (again with an audit trail).
That's the argument for using one of the many Identity and Access Management tools/providers out there, including the systems that come baked into the cloud providers.
binalpatelOP 7 years ago

You certainly read a lot into that statement. Who said the data isn't anomymized and deidentifed? It's not a matter of just stripping identifiers off data and sending it off left and right, even then it's better to control access in a central, secure area. Even if you've deidentifed to an individual level there's inherent risk, that's why accessing even "public" datasets like Mimic III is still process heavy.
bazylion 7 years ago

I work for healthcare and you are right, this article is bullshit. 90% of time we work on data that comes from expensive medical simulators. When we need specific real cases (e.g. for feeding to machine learning algorithms), we use real data from patients but this data goes through process of full anonymisation done by special team. Devs dont have access to production medical data, regulators would kill us otherwise
johnmurray_io 7 years ago

Not sure about VSCode, but I use CLion's remote-development workflow with a local docker container. Works well enough for my purposes.

laughingman2 7 years ago

People who want to work on remote access with restricted setups, checkout Emacs. Tramp mode allows you to edit files over ssh, docker, adb etc without you worrying about anything.

And if you haven't tried about Org Mode, it is not exaggeration if I say it is life changing. It can help you organize notes, todos, agendas etc.

nine_k 7 years ago

Emacs is great in many ways; I'm saying it as an avid user.
For ultimate multi-workplace setup, you can run Emacs in server mode on a cloud instance, and allow network connections to it, or port-forward to its Unix socket via SSH.
Now you can run Emacs in client mode from whatever machine you may have, several of them, or ssh to the cloud box and run Emacs in terminal mode in a crunch. All your sessions will share the same set of files, but workspace layout is per client, so you can work comfortably both from an 11" laptop screen and from a 27" 4K screen.
As said above, you can use tramp to access whatever other remote files accessible via ssh, and also run a decent (though a bit limited) terminal right from Emacs, to say nothing of running REPLs of all kinds directly, and excellent git integration with Magit.
This can even give you a sort of VPN-like access, when the cloud box where the Emacs server runs has access to machines that are not directly accessible to you from the machine you're connecting from.
OTOH VS Code likely can be run in a similar setup.
In general I very much like the modularization of IDEs: instead of a monolith form 1990s you can mix and match your favorite editor with language servers, REPLs, build servers, etc, all separate and in many cases running remotely.
- jacobush 7 years ago
  
  For me (20 years of Emacs), Emacs + Keybase has been what made it into a bit of a productivity tool for me, instead of just an editor.

banana_giraffe 7 years ago

I was just playing with Coder's VS Code fork (what this solution uses for VS Code) the other day [1]

I want to love it. It makes a very specific use case I use much nicer. I can leave code on a remote server with all the compute power I need to build and run my project, and edit the file I'm working on with VS code's editor without having to sync files around. It does, however, have a few big caveats that killed it for me.

It doesn't block any of the browser things that would leave the webpage. Notably, if you hit Ctrl-W to close a file tab because of muscle memory, you'll close the browser tab. Also, if you hit back on accident like I apparently do all the time, you'll go back to the blank tab page. In both of these cases, you'll lose any unsaved state.

Also, the extension repo it's pointing at isn't MS's live repo. There are apparently reasons for this, but it means you don't get the latest version of extensions, which was annoying for a specific extension I've gotten used to.

I also had issues with VS Code getting confused about state when my connection to the remote box was less than ideal.

All in all, I really wanted to like it, but for truly remote cases, I'm back to using Mosh to interact with the remote box, and a simple tool I wrote ages ago to handle rsyncing the local files to the remote box to build and run them there.

[1] https://coder.com/

larrywright 7 years ago

I played with this too, and was really excited about it... until I discovered that it doesn't work on the iPad Pro (known issue with one of VS Code's core components).
It's promising, but it's got a little ways to go.
znpy 7 years ago

Check eclipse che out, it’s so much better.

batmansmk 7 years ago

Interesting case for cloud based IDEs.

I really don't understand the localhost use case though. I'm on MacOS. Why would I spawn a VM (docker for macos) with limited access to my system (container promise) to run an editor already running in a VM?

I only end up having a resources and disk space hungry, slow and inconvenient editor?

alias_neo 7 years ago

The title of the article is ever so slightly misleading.

It would lead one to believe that VSCode and thus by extension VSCodium could be run in Docker and accessed from a web browser

In fact, you can run "Coder" (https://coder.com/); a product, which according to their GitHub had some non-trivial effort put into it to make it run as such.

Not least of all, looking through their issue list, is the fact they compile extensions themselves and they are therefore somewhat outdated (according to issue comments from their users).

It's nice, but it's not VSCode per-se and sadly means no dice for Codium users.

cheesedoodle 7 years ago

I'd love to use the VSCode IDE launched from the host and compile C++ code within docker. Is this possible? Currently, I write code in the IDE and compile in the container from the terminal. Imagine that, cross compile from any host in a contained c++ environment. :)

Edit: I use CMAKE_TOOLCHAIN_FILE to describe the target env.

nacs 7 years ago

Code-server, which OP is building on, should support any language.
I've only used it briefly for some node.js work but as the whole thing is just sitting in a normal linux docker container, you should be able to do anything docker/linux can do.
https://github.com/codercom/code-server

paulcarroty 7 years ago

Flatpak is much more interested than docker for GUI apps, especially for his sandbox features.

znpy 7 years ago

Didn't anyone know it is possible to run a full blown Eclipse in the browser?

https://www.eclipse.org/che/

quaffapint 7 years ago

Whats the 'best' way people are using to have a total portable development environment that I can reach from home or work? Meaning having vs code, node, all the various cloud local emulators/etc that you would normally install for doing that kind of dev work.

laughingman2 7 years ago

Emacs has tramp, which will allow you to remote edit files through ssh with your own setup. You have emacs installed in your computer and you can open any file in any system with ssh authentication inside it.
I am running emacs with spacemacs.org
- irth 7 years ago
  
  I think they meant the other way - accessing the editor/IDE remotely?
  edit: or even not necessarily remotely - just a way to have the same setup everywhere you go
- Tistel 7 years ago
  
  tramp is amazing. it even works with docker containers now!

jlu 7 years ago

Does the keyboard shortcuts for vscode still work inside browser?

horyzen 7 years ago

Those that do not conflict with the browser still works. e.g. 'Ctrl + `' to open terminal, 'Ctrl + p' quick open. I'm using Firefox and 'Ctrl + Shift + p' will open a new private browsing page instead of the command palette.
- jlu 7 years ago
  
  Thanks, glad to know that!
billconan 7 years ago

there is a bug that you can't config the keyboard shortcuts https://github.com/codercom/code-server/issues/150

herohamp 7 years ago

This seems very promising. With C9 being moved to AWS soon, I might look into building an internal version of it powered by launching VSCode instances.

black-tea 7 years ago

Why on earth would you do this? Docker is such a misunderstood technology.

reilly3000 7 years ago
From Coder.com
```
  Code on your Chromebook, tablet, and laptop with a consistent dev environment.

  If you have a Windows or Mac workstation, more easily develop for Linux.

  Take advantage of large cloud servers to speed up tests, compilations, downloads, and more.

  Preserve battery life when you're on the go.

  All intensive computation runs on your server.

  You're no longer running excess instances of Chrome.
```
I imagine not everybody is going to want to run this on some Kubernetes cluster. The ability to do this locally seems that it could be really productive, actually. And having it in Docker can provide snapshotting via `docker commit` as well as the ability to cap its cpu/ram resources.
I might actually try this and a Docker registry to get some semblance of an editor per project. In some contexts I want to run many, many extensions,but for other work I'd rather not have that bloat to contend with. Also I've been really feeling the pain of navigating a PC running Unraid (lots of bare metal VMs) and a Mac laptop, trying to do development on each. My desktop is beefy, but I need to work on the go sometimes, and at times I need to use a Windows box. Right now they all have different VSCode setups. I've been meaning to get around to setting up some scheme of making my config portable, but with different paths across Ubuntu, MacOS, and Windows that seems a bit daunting to get all of my dep paths straight, like eslint and phpcs.
Okay, enough comment writing, I'm giving this a go.
- reilly3000 7 years ago
  
  Some updates for those who are interested:
  1. I first tried to install this on my Win10 VM, which needed to have Docker installed. That was a terrible idea. I completely broke my VM as Docker tried to enable Hyper-V. Friends don't let friends attempt nested virtualization. I should have just run the container on the host instead, which it supports quite well.
  2. The repo worked as the blog post described on my mac. Its quick and has been able to run some tricky extensions. I need to experiment with running some external dependancies still.
  3. Docker commit worked nicely, making a layer for the changes I made. Still playing with this, but wow that could be very productive if it enabled me to roll back to a tested base environment, or share a full IDE image with somebody on my team.
  - dalore 7 years ago
    
    Docker for windows uses hyper-v for it's virtualization. It works.
    But if you're running win10 virtually then it's not going to be able to run hyper-v, that's a limitation of window and hyper-v and not docker.
    But if you're running win10 via a vm why don't you just already have the host start a linux docker vm instead of hyper-v in hyper-v?
  - viraptor 7 years ago
    
    > I completely broke my VM as Docker tried to enable Hyper-V. Friends don't let friends attempt nested virtualization.
    Is that not supported by hyper-v? It's not common, but as long as your hardware and kernel support it, Linux/KVM work with nested virtualisation out of the box.
  - binalpatelOP 7 years ago
    
    Glad it worked for you! Hadn't even thought about Docker commit, but the latter was one of the ideas I had in mind for this (not just share code with teammates, but also a fully working environment, IDE included, that they can jump right into).
- flukus 7 years ago
  
  So it's basically installing an IDE on a server and remoting into it? Are docker and VSCode bringing anything to the table?
  - reilly3000 7 years ago
    
    Yes and there are a few self-hosted web based IDE options out there. I just happen to like VSCode, mostly for its flexibility and extensions.
    Docker just makes it convenient to deploy with a run command vs making a local server on multiple machines. I haven’t tried it yet, but being able to snapshot a running machine also seems incredibly useful.
coldtea 7 years ago

To have the exact environment you are building / tar-getting in.
E.g. you can't (easily) have autocomplete in C++ on VS Code on your Mac if your project doesn't target Mac and can't build there (or doesn't have the dependencies etc).
But you can do it inside a Docker image.
- binalpatelOP 7 years ago
  
  100% right - part of what I wanted to experiment with was being able to share not just code but everything related to the environment (including IDE), and be able to check that into version control as well. So there's little to no friction in mundane setting up tasks like linters/installing packages/messing with config and so on.
  That being said it's still an idea, but here's hoping it works!
- black-tea 7 years ago
  
  So you're going to run a separate copy of an IDE inside a container just so you can target that platform? I think you need better tools.

kkarakk 7 years ago

>Why is this useful?

    You can develop all your code in a fully specified environment, which makes it much easier to reproduce and deploy models and analysis.
    You can (after enabling security) move your IDE to the data. Instead of transferring data back and forth you can develop where your data is stored.

    Last - and most important for me - in industries like my own (healthcare), you work with highly regulated data that has to be stored securely, where having multiple copies of data on multiple laptops can pose an unacceptably large risk.

    Running containers like this within a secure environment with access to the data helps us to have an ideal development environment, while ensuring the protected data remains in a secure, single location with no unnecessary duplication.

Article says right there, whereas you haven't explained why this would be a bad usecase? maybe it's wasteful but if a person wants additional security via ephemerality then it seems fine

nineteen999 7 years ago

And if you need additional security you can run the docker container inside another docker container.
https://blog.docker.com/2013/09/docker-can-now-run-within-do...
Repeat ad infinitum until you feel secure and ephemeral enough.

oaiey 7 years ago

I also do not see the purpose in running the frontend in a docker container ... but once you consider the backends I would recommend a read of the Eclipse Che Architecture.
colechristensen 7 years ago

Linux cgroups is such a misunderstood technology.
- coldtea 7 years ago
  
  And so deservedly so.

Settings

Running VSCode in Docker

Keyboard Shortcuts