Live demo: Cloud-based dev environments for data science teams

Overview

Cloud-based development environments aren't just for software developers! In this demo, learn how Coder can simplify and streamline data science workflows by allowing data scientists and data engineers to create fully configured workspaces with the click of a button.

We'll cover some of the reasons data science teams love Coder, including:

Easy-to-access, fully-configured, shareable workspaces hosted on your cloud
Access to a variety of IDEs, including VS Code, PyCharm, RStudio, and Jupyter
GPU support for intensive compute power when crunching large data sets
Co-located with your data for faster access
All intellectual property and customer data is hosted on your infrastructure

Transcript

I'm pretty excited to show you again just how Coder is helping data scientists with simplifying their environments, letting them run their experiments, do their trainings, and hopefully in a more efficient way that's a little bit easier to use and, you know, less complicated setup process.

So I'm Thomas Hughes, Senior DevOps Engineer at Coder. I wear a few hats over here I guess. But that's that's one of the ones that I have for sure. On camera, I'll be walking you through your demo today, and I have on the line with me as well, Ben Potter, our Developer Advocate and we're happy to take any questions in the chat during the entire process here. But we also have a Q&A at the end. We're going to send out these slides afterwards. We're gonna send out a link to the recording, and we're actually also going to include some information about getting a free virtual pass to KubeCon which is going on next week, if you're interested in that. So stick around, and we'll make sure we we send it out to all the attendees afterwards.

So we're just gonna do a quick overview of what Coder is, just to kind of give a baseline to everyone of, you know, what our product is and and how we see it kind of fitting into this data science world. I'm going to do a live demo -- that's the point of this live demo, right -- of showing a Jupyter Notebook and some data analysis around IMDB movie ratings, something kind of fun that we can relate to. We'll have a little bit of wrap up discussion and the Q&A section like I said. And then there's a resource slide at the end that has some information from like a blog post that we did and things like that.

So, you know, at a high level, what is Coder? Again, I want to level set everyone here so what Coder is doing is we're taking the developer experience of the data science experiment experience off of your workstation and moving it into the cloud. So cloud-powered development environments that are built from a standardized container image, containerized image that gives you all the tools that you need, and it doesn't matter what you connect to because everything is running in the cloud on your Kubernetes infrastructure.

So what this means is that all of my development is happening on the same network, very close to the data warehouse where my models are, for example, or my data is that's being passed through my neural networks and things like that. All that compute power is available too, so I don't need to worry about a really beefy machine locally and having a whole GPU farm that I'm running locally here. But instead I can have like, let's say, a Chromebook and connect to my Kubernetes cluster that has 128 cores and 256 gigs of RAM and a whole bunch of hard drive space and maybe five or six GPUs attached to it or something like that. I can actually do all my computational modelling that way and do my training and stuff like that. So when it comes to speed and efficiency it's great for that.

And then because everything is built from a containerized model and I'll show you also like a template feature that we have to build out these workspaces. All of your teammates are going to be using the exact same to the exact same base as well. So you don't have I guess you eliminate variants in your experiments. One common problem in data science is kind of tracking experiments and collaborating across teams and things like that and having this repeatable workspace that we can all build from, and then having the power of cloud compute to help speed the whole process up. This helps kind of normalize our experiments and help out with that piece of it.

So yeah so basically other things to call out, we support VS Code of course, but there's also a Jupyter Notebook, RStudio, IntelliJ, PyCharm, any JetBrains IDE, including an early access preview for DataSpell, the new data science IDE that Jetbrains put out there. So lots of cool stuff.

We support all those in the browser like you'll see here, and you can also connect local editors and things like that. I did something using Spyder IDE, as an example. So there's a lot of things you can do within Coder and a lot of cool ways you can be creative with this.

So that said let's jump over to the fun part and get out of the slides, move over to our demo. So yeah, so basically I wanted to show the idea of getting started on a data science project and like I said, this one is IMDB movie ratings. So I actually took this notebook from a data science article that I found on a blog. Basically what it is, it's a csv of IMDB movies and it's taking a look at the rating that the movie had based on the budget. So we're going to go through and kind of just run some Matplot stuff and things like that to kind of see how this works.

So typically if I was getting started on this type of project, I might need to install Jupyter Notebook locally. I'd have to maybe clone down this repository, get all of my data set here. So in this case the csv would have to be local, make any changes and stuff like that.

But with Coder, it's really easy. I can go down to this getting started by and click open in Coder. This is going to redirect me over to Coder itself. It's going to prompt me for just a couple of things to create my workspace. So if I go in here and I'll call this IMDB data set, I can call it anything I'd like. It's gonna ask me which organization I wanted to go into. So this is Coder's way to logically separate teams, so you can see I have quite a few available in here. We have a data science team, for example, Enterprise team, DevSecOps, etcetera. I'll just leave it in our default Coder organization here. Which provider? So we do have this idea of being able to have multiple Kubernetes clusters available, and I can talk about that a little bit more later. But one of the key things is maybe I have some specific hardware that this model set needs to run on, so I want to make sure I pick the right provider that has that hardware underlying. For now I'll use our default built in and I'll click create workspace.

So what this is doing is it's reading from a template file that was in the repository. It is building out--here we go--it is building out my workspace right now by pulling a container image that I wrote that has all the editors that I'm going to need. It's going to have all the tooling I need, the Python modules, things like that. And because it's a container, everyone else that builds us from the template is going to have the exact same thing. The template will tell you how many cores to add to this workspace, how much RAM to allocate how much hard drive space to allocate. And I could even allocate GPS as well like I mentioned. So really cool. Ben actually helped me out with part of the Dockerfile too.

So I'll go over that and kind of what that looks like behind the scenes in a moment here.

But the last step of this is it's going to be assigning a persistent volume claim which will basically preserve all of my in-flight work. So if that image updates, I can rebuild this and not lose anything that I've been working on. And then I can also clone down that repository that's going to have all my data here. So again, all of this is happening behind the scenes. I haven't installed anything, I'm connecting over a browser.

And because of this, I can do this from an Ppad, I can do it from, again, a Chromebook or a thin client. I don't need to have, you know, a $2000 Macbook Pro locally to me. You know going over the whole like let's issue you a laptop or let's issue you some really expensive hardware and get you all set up so we can get your data science lab going. All of this can happen on the cloud instead. So it saves us a lot of set up time and all of that.

So while I was speaking there, just, you know, about a minute or two, I guess at most, we were able to get this full workspace set up and we're actually ready to start working on our project.

So we got a few applications that I put into this image here. Weve got VS Code PyCharm, Jupyter Notebook, and then we actually have a terminal as well. So Coder does have a command line and you can connect all of that locally to your remote command, I'm sorry, to your remote workspace and and issue commands that way. But you can just click on this terminal here and and get all the same stuff you would be used to. One of the things with data scientists is that you probably aren't always using the terminal; you're used to doing things like notebooks and Excel sheets and stuff like that. So one of the other cool things again, without having to install or set up anything extra, I can just click on Jupyter here and we'll actually see Jupyter Notebook, which will look very familiar to anyone that's used it before.

And you can see I have a few files in here, including that repository that got cloned down, you can see my csv files in here, some other information, but more importantly, that specific Jupyter Notebook file itself and let me make that just a tad bit bigger. There we go. So here we go. Here's my Jupyter Notebook. So this is the file that's been source controled that, you know, my team's been using on this repository here. You can see that this data set came from Kaggle, which if you're into data science, you've probably seen that before. But we're basically looking at top ranked movies from IMDB specifically. And again, we're gonna be doing a comparison on budget versus rating and, you know, kind of doing that type of work here.

So yeah, basically you can click through and I wanted to show that you can basically run the same stuff that you would normally just like you're used to. So I can run that import statement, run this head command and make sure I check out the top values of my data set. I can do some normalization in here as well. Check out my types and everything of course drop values needed and replace things to normalize that as well. Do some extra stuff here with string replacements and you know, even come on down and check if I keep running through here, even doing some scattered matrix.

So let's say, you know, I have Matplot that's installed on this container inside of this workspace. So I can go ahead and do my scatter plot here as well and just say, okay cool, let me do this, baseline modeling and check that out, generates it for me and I'd be able to do this. So any tweaks that I make to this notebook obviously would also show up in here which is great. So you know, I can definitely do that. These are here because Jupyter Notebook's been used before, so obviously we have the same version here, but I can make my tweaks to this data set and rerun this stuff if I needed to. So if I need to update my csv file, I could do that. You know, basically do the whole analysis here all the way through, and you can see as we keep kind of scrolling down, you know, I'm doing again trend analysis here, through another scatter plot, can do some secondary modeling in there as well, and that type of thing. So kind of a cool you know, practical use I guess of just saying, you know, is there a relation between budget and IMDB movies and the rating that they receive on IMDB, which is kind of neat.

So with that said, I talked about being able to edit your csv, and you know, again you can see that we're just reading that file directly here. But this could be pulling from, you know, a data warehouse or something like that instead. This could be a huge model set. You'll see if I actually go back to my workspace, I'll open VS Code I guess for this, just for quickness and ease. But if I edit that csv file, we'll see. It's not too large. It's a pretty small dataset. So you know, in terms of the capacity or whatever, I mean it's yeah, 119 lines in here. So it's not too too much on this case, but in a general sense though, maybe you know, I'm working with thousands and thousands and thousands of models, right? Most data science isn't going to be just like 100 lines. It's not necessarily the best data for us to do as an input. So, you know, when you have that huge data set, you want that to be located close to your workspace. So the time it takes to do all this computation is, you know, a lot quicker, you're not doing something locally, you're not bringing it over a VPN and then back over the VPN as you're sending data back and forth.

So if we take a look at the architecture diagram for Coder, just to kind of give an idea of how this is all working, the users workspaces are represented in this top block here and, again, you can connect your command line or just do the browser connection like I'm doing. Everything happens over https. So it's all secure over ssl and within the Coder damon that's running on your cluster. That's actually going to connect to all of your data. So your PostgresSQL database for Coder, of course, but any other database that your workspace needs. You can have your own container registry. So I happen to be using Docker Hub, but it could be a private container registry as well that has all my images on it. And then of course I can have those additional providers like I talked about as well, that says, you know, this is the cluster I want to deploy into and maybe that cluster has again, specific hardware configurations or something like that. And then finally we have our data plane where all our workspaces are. So this is all in a secure behind your firewall environment where the interconnections between those databases are just the data warehouse in general and your workspace is going to be a lot quicker. And this should help improve model training time and things like that as well as giving you that cloud compute.

So my workspace specifically here has -- that's the wrong one--this one here has four cores and 8 gigs of RAM. But I could have very easily built this out to have 128 cores and 256 gigs of RAM, basically like whatever the highest allotment is. So I can do all of this and and train my models and and really have that beefy system. And then the great thing is I can rebuild these as many times I want. So that's obviously expensive to have a GPU farm in the cloud and to have all those VMs spun up that have so much compute power, so I can just hit the stop workspace button and it basically puts it on a standby mode where I can rebuild it and I'll be able to jump right back in where I was afterwards to save on infrastructure costs and everything there.

So, to kind of wrap this up, I do want to show like I said how this template worked specifically. So does say "built from a template." Now I'm going to jump over to my GitHub repository again. And within my repository we had this coder.yaml file that specifies our workspace as code. So it's a workspace template. It's going to give us information in here on how we want these developer or data science workspace to look. So in this case we have the image specified again, this one's coming from Docker Hub and it's a data science sample that I did specifically. I have my CPU, memory, disk and then again, GPUs, I could add in there as well. So, you know, if I said I wanted that 128 cores instead. I would have to basically just change its value in source control. And when I build from this template everyone will get 128 cores instead.

So the neat thing about this is this is taking that DevSecOps practice or just the DevOps practice of infrastructure as code basically. And we're basically normalizing it.

So again, every data scientist that builds a workspace here is getting the exact same base image and then we're getting the exact same base hardware to run that image on. So it's a really neat way to be able to kind of keep track and normalize that piece and eliminate some of the variables in our experiments.

We can do some labelling, Kubernetes allows that. So we can do things like chargeback groups or label specific Python versions or things like that. Maybe I need to add a specific label for this workspace that has to do with a special project or something like that as well.

And then of course you can run a bunch of different commands. So my Dockerfile has a bunch of packages and stuff in it, which I'll show you in just a second, but just for showing some of the use cases here, I'm also doing a lot of pip installs here so we can get again that Matplot you know, and then I added some other stuff, Flask, Django, things like that. So it's pretty neat and the nice thing about this is this is fully auditable. Our ops team can handle updating this. So the data scientist doesn't really have to worry about any of this piece and as long as this image is up-to-date and has everything we need for a project. We're constantly able to build from this, like you saw, just by clicking that "Open in Coder" button

So let's go and jump over to that Dockerfile real quick to just to show what's in there. So on this side we can see very standard Dockerfile, if you've ever worked with that before. Again, you Ops team might handle this versus the data scientists themselves. But as Ben was saying at the beginning, sometimes it's kind of being the other way around. Like software developers are being told they need to now do some data science or something like that. So I want to make sure we covered this too. So this Dockerfile specifies all the tooling that we're going to use. So things like obviously Python are in there and stuff like that. I do some user management for Coder's user, doing things like installing Jupyter, installing PyCharm, and then you can see I actually had DataSpell in here as well commented out right now. It is an early access preview. So I figured for live demos maybe let's comment it out and not not risk a preview I guess. But in a general sense though it's that easy.

So we have a Dockerfile that specifies everything I need for my workspace and we have a workspace template that specifies everything I need to run that container. Now, everything that is built from this template is going to be using the exact same base. So onboarding the new data scientist is really just as easy as going, like you saw, and clicking "Open in Coder" and they get the exact same base as everyone else. So pretty cool.

I think we got-- let's see -- about nine minutes left here. So I think timing wise probably be a good time for me to jump back to these wonderful slides and go over to my Q&A piece and just kind of leave this up here.

So what questions do you have for us? Or Ben, if you have questions for me, too. Feel free to reach out in the chat. I don't see anything just yet in there.

[Ben] Yeah, I'll -- go ahead.

[Thomas] I was just gonna say, and then also we're, of course, going to be again sending out these slides that have some resource links on the next page down, we'll send out this recording and all of that too. So you'll definitely be able to reach back out to us about all of that if you have later.

[Ben] Well, while some questions are coming and I just I just wanted to mention that for a data scientist or a developer it's really just that "Open in Coder" button and then not only do you have all the dependencies you need such as Matplotlib, which is a requirement for your your data science example. So that's something that normally each person would have to install. Maybe instructions differ across like Mac, Windows, and Linux. I always run into weirdness trying to install stuff like that, so having that same environment and hardware is great. As well as when the project scales and maybe you're working with like a larger data set or something like that. I know data science teams often when they're working locally work with like a small subset of the data where the large dataset is in a data farm or something with Coder you could quite literally just connect to that in the cloud. Or maybe even like where your data set lives is in the same cloud as your Coder workspaces, so you can kind of interact with that full data set with very low latency delay just because of the resources that you have in the cloud. It's pretty exciting to see and your presentation really kind of helped me like kind of get those ideas together.

[Thomas] Yeah. Yeah, I mean again, there's literally data scientists are using like Excel still and everything too and kind of dealing with those types of trouble. So like you said, any type of configuration locally that we can remove some of the puzzle pieces to I guess and make that a little bit easier and standardized is great. It's not uncommon in the data science world for someone that wrote an experiment and had some model that they decided to go with for you know whatever.

Let's just use the Uber example, I don't want to say that this is Uber, but let's use the Uber example. Maybe they have some model that worked really well for them on getting the next driver available to a person's location. Maybe that data scientists left or is on a different team now, so then someone else has to go in there and say okay this model was built like a year ago and I need to go through and figure out how they even set up their local environment to run this model so I can get like for like data and then once I do how can I improve this experiment, make my tweaks and everything like that. So as you said, you know, basically a lot of setup process locally to even get to that point. Hopefully there's some documentation if you're lucky on how to set up the the experiments again, so you can start doing your own new experiments and making those tweaks and you know. There's there's just a lot more involved with it. So having the standardized template that builds out this base image, in theory a year from now, someone could build out this project and if I haven't made any tweaks to it, it would be the exact same thing we just saw today, so pretty cool.

[Ben] One other thing that I that I saw you mentioned, but I kind of wanted to that wasn't sure like fully understood was that you mentioned that like you're doing this based off of like a source controlled repository. I know oftentimes like source control might be something new to data science, so what would that experience be like in Coder?

[Thomas] Yeah. Yeah, so the good thing there is Coder does integrate directly with the three major Git providers, so GitHub, GitLab, and Bitbucket. Whether it's on premise or in the cloud, we have tightly integrated with all of those, so your Coder administrator would set up the initial bit. You link your account and then you're able to do all of your cloning and pushing and things like that too, so you don't really have to worry about too much there.

the nice thing about some of the editors like PyCharm, VS Code, stuff like that, is they have an upload to VCS button that you can just click usually and it will just push it to your repository. So you don't necessarily need to hop in the terminal and and learn all the Git commands and stuff like that, which is pretty cool. So yeah, so I think that would be great there and then if you're using a tool like Common ML, or ML Ops, or MLflow or something like that for tracking experiments, you can still use the exact same tooling that you're using now for that. So those all integrate with the providers as well. So we would basically have that same integration and it would totally just work.

Yeah, so I do see a question that came in here from Fernando so does it work well for web development, for example, building a complex Django app? So yeah, actually that was probably the original use case I would say for Coder is, you know, taking our developer environment in general and just moving into this cloud and having this repeatable environment. So last month's demo I believe we talked about securing remote development and I think that's on YouYube, right Ben?

[Ben] Yes.

[Thomas] Yeah. So we we have another 30 minute demo that we did last month that basically covers more of a web development. I happened to be doing a React app, I believe, in that one, but again, it could have very well been, you know, Django, Flask, or whatever. Totally works for that. It's one of those things that you just basically specify your Docker image what tools you need, you do, you know, whatever pip installs and things like that you would want to do for Django and its dependencies, and then you launch your workspace and we can actually expose a port internally to listen to the web server, see the front end changes happen. Have the whole back end set up as well, you know, spin up a database and all that fun stuff.

[Ben] Great. Yeah, I just linked that video in the chat. One final thing, I wanted to call out is that we have the blog post. You can go to next slide Thomas, on Coder with data science. There's a link there as well. So if there's more information that you like, we have links to the Dockerfile Thomas used there or one of the example docker files for data science, how it works across teams, stuff like that.

We use cookies to make your experience better.