How-to-build-Python-Data-Science-Docker-container-based-on-Anaconda

How to build Anaconda Python Data Science Docker container

In this article, we’ll build a Docker container for Machine Learning (ML) development environment. This image is useful if you’re developing ML models or need a pre-configured Jupyter notebook with some of the most useful libraries.

Recently we published the article Quick And Simple Introduction to Kubernetes Helm Charts in 10 minutes, where you can find instructions on how to use Helm to deploy this container to your Kubernetes cluster.

Update for 2020

  • Upgraded to Python 3.6.
  • Fixed lots of build issues.

Last time we created a Docker container with Jupiter, Keras, Tensorflow, Pandas, Sklearn, and Matplotlib. Suddenly, I understood that I missed OpenCV for Docker image and video manipulations. Well, I spent the whole day preparing a new image build. And in this article, I’ll show you how to create Docker containers much faster using Anaconda’s official Docker Image.

There’re two ways to do that.

Simple way

Before you start, make sure to install Docker to follow this process. This process takes ~7 minutes to build a container of 3.11 Gb in size.

Anaconda way

When I started playing with my awesome ML workspace set up in 2018, Anaconda was the fastest and easiest way to create multiple containers from the latest docker version for ML experiments as it supports Docker. It was much faster than compiling OpenCV 3 for Ubuntu 16.04 or any other operating system. Today it’s vice versa.

I’m using the same sources but changing the docker commands in the Dockerfile(no file extension). This file is used to build the Docker image list(blueprint for creating containers).

Here how it looks like:

FROM continuumio/anaconda3
MAINTAINER "Andrei Maksimov"
RUN apt-get update && apt-get install -y libgtk2.0-dev && \
    rm -rf /var/lib/apt/lists/*
RUN /opt/conda/bin/conda update -n base -c defaults conda && \
    /opt/conda/bin/conda install python=3.6 && \
    /opt/conda/bin/conda install anaconda-client && \
    /opt/conda/bin/conda install jupyter -y && \
    /opt/conda/bin/conda install --channel https://conda.anaconda.org/menpo opencv3 -y && \
    /opt/conda/bin/conda install numpy pandas scikit-learn matplotlib seaborn pyyaml h5py keras -y && \
    /opt/conda/bin/conda upgrade dask && \
    pip install tensorflow imutils
RUN ["mkdir", "notebooks"]
COPY conf/.jupyter /root/.jupyter
COPY run_jupyter.sh /
# Jupyter and Tensorboard ports
EXPOSE 8888 6006
# Store notebooks in this mounted directory
VOLUME /notebooks
CMD ["/run_jupyter.sh"]

As you can see, the command describes that we’re installing only libgtk2.0 for OpenCV support and all the other components like Terraform, Pandas, Scikit-learn, Matplotlib, Keras, and others using Conda package manager.

Running container

Now you have the working data science docker container created, and it’s time to start it. Create a folder inside your project’s folder where we’ll store all our Jupyter Notebooks with the source code of our projects:

mkdir notebooks

And start the container with the following run command:

docker run -it -p 8888:8888 -p 6006:6006 \
    -d -v $(pwd)/notebooks:/notebooks \
    python_data_science_container:anaconda

The run command will start the container and expose Jupyter on port 8888 and Tensorflow Dashboard on port 6006 on your local system or server, depending on where you executed this command. You can now proceed to push your Docker container, after letting docker build your container, to the Docker hub whenever you like,

If you don’t want to create and maintain your container, please feel free to use my container:

docker run -it -p 8888:8888 -p 6006:6006 -d -v \
    $(pwd)/notebooks:/notebooks amaksimov/python_data_science:anaconda

Installing Additional Packages

As soon as you’ve launched Jupyter, some packages may be missing for you, and it’s OK. Feel free to run the following command in a cell of your Jupyter notebook to install all the tools:

!pip install requests

Or for the conda:

!conda install scipy

I hope, this article was helpful for you. If so, please like or repost it. See you soon!

Summary

Using Anaconda as a base image makes your data science projects heavy. I mean REALLY heavy.

For example:

docker images
REPOSITORY                          TAG                 IMAGE ID            CREATED             SIZE
amaksimov/python_data_science       anaconda            7021f28dfba1        29 minutes ago      6.36GB
amaksimov/python_data_science       latest              3330c8eaec1c        2 hours ago         3.11GB

Installing all the components inside the Ubuntu 20.04 LTS container image, including OpenCV 3, takes ~7 minutes, and the final image takes ~3.11 Gb.

At the same time, the Anaconda3 container creation process takes x2 times longer, giving you an x2 times bigger image (~6.36 Gb). The building process is much more complicated than in 2018 but Docker makes the job a lot easier, although it still took me a while to update the configuration to a working state.

We hope you found this article helpful. If so, please, help us to spread it to the world!

Similar Posts