LeftRight

Category : Blog


Nvidia GPUs, Kubernetes and Tensorflow – the (not so) final AI Frontiers

So you want to intelligently use your infrastructure, don’t you?  And you want to use it cross-platform?  How about cross data-center?  We all do.  I have been around the tech industry for over 20 years and have seen many paradigms come and go as well as a series of devops tools written in almost every language imaginable.  I used to personally write my own and keep it up to date (mostly in bash and python) and then came the age of Big Data, Apache, Netflix and Google, and so did their tools.

Some of the best around are Mesos / Marathon / DCOS (an Apache project) and Kubernetes (a Google product).  Although they are not mutually exclusive, and there is a case for using Kubernetes in DCOS, let’s hold off for a second or two and compare them apples to apples (the section that follows was borrowed from Platform9 as I do not want to rewrite what has already been well written here).

Overview of infrastructure orchestration options

Kubernetes (K8s)

Kubernetes was built by Google based on their experience running containers in production over the last decade. See below for a Kubernetes architecture diagram and the following explanation.


The major components in a Kubernetes cluster are:

  • Pods – Kubernetes deploys and schedules containers in groups called pods. A pod will typically include 1 to 5 containers that collaborate to provide a service.

  • Flat Networking Space – The default network model in Kubernetes is flat and permits all pods to talk to each other. Containers in the same pod share an IP and can communicate using ports on the localhost address.

  • Labels – Labels are key-value pairs attached to objects and can be used to search and update multiple objects as a single set.

  • Services – Services are endpoints that can be addressed by name and can be connected to pods using label selectors. The service will automatically round-robin requests between the pods. Kubernetes will set up a DNS server for the cluster that watches for new services and allows them to be addressed by name.

  • Replication Controllers – Replication controllers are the way to instantiate pods in Kubernetes. They control and monitor the number of running pods for a service, improving fault tolerance.

Mesos and Marathon (DCOS)

Apache Mesos is an open-source cluster manager designed to scale to very large clusters, from hundreds to thousands of hosts. Mesos supports diverse kinds of workloads such as Hadoop tasks, cloud native applications etc. The architecture of Mesos is designed around high-availability and resilience.

The major components in a Mesos cluster are:

  • Mesos Agent Nodes – Responsible for actually running tasks. All agents submit a list of their available resources to the master.

  • Mesos Master – The master is responsible for sending tasks to the agents. It maintains a list of available resources and makes “offers” of them to frameworks e.g. Hadoop. The master decides how many resources to offer based on an allocation strategy. There will typically be stand-by master instances to take over in case of a failure.

  • ZooKeeper – Used in elections and for looking up address of current master. Multiple instances of  ZooKeeper are run to ensure availability and handle failures.

  • Frameworks – Frameworks co-ordinate with the master to schedule tasks onto agent nodes. Frameworks are composed of two parts-

    • the executor process runs on the agents and takes care of running the tasks and

    • the scheduler registers with the master and selects which resources to use based on offers from the master.

As you can likely guess from the title of this blog post, we obviously went with Kubernetes for our needs, since we are (mostly) all more comfortable with Docker containers and wanted to use those anyway. Further, we didn’t require the built-in Hadoop’iness that is provided with Mesos, so we decided against it (oh, and we don’t particularly love Zookeeper, so we use it sparingly where possible). Let’s examine exactly what we did and how we did it.

*** Note: the next bit of this blog post will include a bit of code, in multiple languages ***

Deep(er) (Learning) of Kubernetes

The beauty of Kubernetes is its ease of deployment.  Quite literally, when you are ready to launch a cluster, it really can be just as easy as:

$ kubectl create -f some_config_file.yaml

where kubectl is the driver / command-line program for working with the Kubernetes API. Now let’s introduce some of the key concepts.

The Cluster

As the name may suggest, a cluster is a logical grouping of nodes.  Whether that group of nodes be physical or virtual machines is irrelevant, as Kubernetes can utilize either.  As a very simple example, see below:

View Drawing

Within the diagram you’ll notice the following components, which are further explained below:

  • Pods

  • Containers

  • Label(s)

  • Replication Controllers

  • Service

  • Nodes

  • Kubernetes Master

Pods

Pods (green boxes) are scheduled to Nodes and contain a group of co-located Containers and Volumes. Containers in the same Pod share the same network namespace and can communicate with each other using localhost. Pods are considered to be ephemeral rather than durable entities.

How do you persist data across restarts?

Kubernetes supports the concept of Volumes so you can use a Volume type that is persistent.

Do I have to create each Pod individually?

Pods can be created manually and you can also use a Replication Controller to deploy multiple copies using a Pod template.

So how can I reliability reference my backend container from a frontend container?

Just use a Service as explained below.

Labels

Pods can have labels label. A Label is analogous to a user-defined attribute and is simply a key-value pair. For example you might create a ‘microservice’ and an ‘application’ tag to tag your containers (i.e. microservice=analytics, application=webui) to your analytics cluster of Pods and tag other containers (i.e. microservice=dashboard, application=webui) to your view/display Pods. You can then use Selectors to select Pods with particular Labels and apply Services or Replication Controllers to them.

Replication Controllers

Replication Controllers are the key concept that allows for a distributed system to be created as Pod “replicas” (i.e creating a Replication Controller for a Pod with 3 replicas will yield 3 Pods and continuously monitor them). It will ensure that there are always 3 replicas available and will replace any missing ones to maintain the total count.Kubernetes Replication Controller

View Drawing

If the Pod that died comes back then you have 4 Pods, consequently the Replication Controller will terminate one so the total count is 3. If you change the number of replicas to 5 on the fly, the Replication Controller will immediately start 2 new Pods so the total count is 5. You can also scale down Pods this way, a handy feature performing rolling updates.

When creating a Replication Controller you need to specify two things:

  1. Pod Template: the template that will be used to create the Pods replicas.

  2. Labels: the labels for the Pods that this Replication Controller should monitor.

Services

Service defines a set of Pods and a policy to access them. Services find their group of Pods using Labels. Imagine you have 2 analytics Pods and you defined an analytics Service named ‘analytics-service’ with label selector (microservice=analytics, application=webui). Service analytics-service will facilitate two key things:

  • A cluster-local DNS entry will be created for the Service so your display Pod only need to do a DNS lookup for hostname ‘analytics-service’ this will resolve to a stable IP address that your display application can use.

  • So now your display has an IP address for the analytics-service, but which one of the 2 analytics Pods will it access? The Service will provide transparent load balancing between the 2 analytics Pods and forward the request to any one of them (see below). This is done by using a built-in proxy (kube-proxy) that runs on each Node.

More technical details here

Kubernetes Service

This animated diagram illustrates the function of Services. Note that this diagram is overly simplified. The underlying networking and routing involved to achieve this transparent load balancing is relatively advanced if you are not into network configurations. Have a peek here if you are interested in a deep dive.

View Drawing

There is a special type of Kubernetes Services called ‘LoadBalancer’, which is used as an external load balancer to balance traffic between a number of Pods. Handy for load balancing Web traffic for example.

Nodes

A node (the orange box) is a physical or virtual machine that acts as a Kubernetes worker, used to be called Minion. Each node runs the following key Kubernetes components:

  • Kubelet: is the primary node agent.

  • kube-proxy: used by Services to proxy connections to Pods as explained above.

  • Docker (or Rocket). The container technology that Kubernetes uses to create containers.

Kubernetes Master

The cluster has a Kubernetes Master (the box in purple). The Kubernetes Master provides a unified view into the cluster and has a number of components such the Kubernetes API Server. The API Server provides a REST endpoint that can be used to interact with the cluster. The master also includes the Replication Controllers used to create and replicate Pods.

Kubernetes Dashboard

Finally, optionally (as an ease of deployment) you can also add on Kubernetes Dashboard available here. This dashboard gives you all the power you have with kubectl, while being able to control the entire cluster from a GUI.

Code Snippets, Deployment and GPUs

So what else did we need?  Well, DeepLearni.ng is, by its nature, a machine learning / neural network / predictive insight company that deploys anywhere (any data centre, cloud or bare metal, or combination thereof) using our AnyStaxTM technology.  As you may have guessed, a lot of this technology is borne from the use of containers and orchestration software such as Kubernetes and Mesos, however this is not the place or time to speak in depth about it.

The real problem remained, Mesos had no issues using GPUs, however Kubernetes did not have support for it yet.

So what to do?

Dive into the code, github, issues, pull requests, etc… and make sure that it all works super easily.

But what was missing?

First and foremost, Kubernetes still does not support GPUs, so this is a bit of a hack and early pull of a bunch of experimental code.  It works fine, but I would warn you about using it in production unless you really understand what it’s doing.  Here is what we did…

Deploy (anywhere) using Ansible

Yes, yet another technology.  Most of you have likely heard of Ansible and if you have not — well, get out from underneath your rock — just kidding.  Here is a light primer.

Ansible

Ansible is one of several configuration management tools including Puppet, Chef, Salt.  So how does Ansible differ?  Well, shell scripts have ruled the world of server configuration for quite some time, whereas Ansible uses ssh and python (modules) to achieve what the others do. Benefits of Ansible, when compared to the mentioned tools are:

  • no need for agent software on the configured server;

  • very easy to learn and work with;

  • written in Python 2 which is installed on almost every Linux server by default.

In the end, what we ended up using was a hodge-podge of Ansible, Kubernetes, Docker and a ton of secret sauce.  However the part that will interest you most is what we did to make it all work with GPU, right?

Journey out of the CENTR(al processing unit)

To the GPU and beyond!

The state of the GPU in Kubernetes can be found here. There is definitely a proposal, but it has been slow on the uptake, and that just didn’t meet our needs.  What we didn’t bother with (since we’ve done it umpteen times) is to instruct you on how to install CUDA and ensure that the OS is ready for you to use the GPUs.  There are several good tutorials on that, so just Google them, you’ll find it.

So, by default, K8s will not activate GPUs when starting the API server and the Kubelet on workers. We need to do that manually.  If you are using Ansible, you’ll have to alter your startup scripts (which you’ll push to your nodes/containers) to do the following (and assuming you are strictly on Ubuntu Xenial – 16.04 LTS)…

On the master

On the master node, update /etc/systemd/system/kubelet.service.d/XX-kubeadm.conf to add:

# Allow some security stuff

Environment=”KUBELET_SYSTEM_PODS_ARGS=--allow-privileged=true"

# GPU Stuff
Environment=”KUBELET_GPU_ARGS=--experimental-nvidia-gpus=1”

Then restart the API service via

$ sudo systemctl restart kubelet

So now the Kube API will accept requests to run privileged containers, which are required for GPU workloads.

Worker nodes

On every worker, /etc/systemd/system/kubelet.service.d/XX-kubeadm.conf to to add the GPU tag, so it looks like:

# Allow some security stuff

Environment=”KUBELET_SYSTEM_PODS_ARGS=--allow-privileged=true"

# GPU Stuff
Environment=”KUBELET_GPU_ARGS=--experimental-nvidia-gpus=1”

Then restart the service via

$ sudo systemctl restart kubelet

Testing the setup

Now that we have CUDA GPUs enabled in k8s, let us test that everything works. We take a very simple job that will just run nvidia-smi from a pod and exit on success.

The job definition is

---
apiVersion: v1
kind: Pod
metadata:
 name: tf-0
spec:
 volumes:
 - name: nvidia-driver
   hostPath:
     path: /var/lib/nvidia-docker/volumes/nvidia_driver/367.57
 containers:
 - name: tensorflow
   image: tensorflow/tensorflow:0.11.0rc0-gpu
   ports:
   - containerPort: 8888
   resources:
     limits:
       alpha.kubernetes.io/nvidia-gpu: 1
   volumeMounts:
   - name: nvidia-driver
     mountPath: /usr/local/nvidia/
     readOnly: true
---
apiVersion: v1
kind: Pod
metadata:
 name: tf-1
spec:
 volumes:
 - name: nvidia-driver
   hostPath:
     path: /var/lib/nvidia-docker/volumes/nvidia_driver/367.57
 containers:
 - name: tensorflow
   image: tensorflow/tensorflow:0.11.0rc0-gpu
   ports:
   - containerPort: 8888
   resources:
     limits:
       alpha.kubernetes.io/nvidia-gpu: 1
   volumeMounts:
   - name: nvidia-driver
     mountPath: /usr/local/nvidia/
     readOnly: true

What is interesting here is

  • We do not have the abstraction provided by nvidia-docker, so we have to specify manually the mount points for the char devices

  • We also need to share the drivers and libs folders

  • In the resources, we have to both request and limit the resources with 1 GPU

  • The container has to run privileged

Now if we run this:

$ kubectl create -f gpu-test.yaml
$ # Wait for a few seconds so the cluster can download and run the container
$ kubectl get pods -a -o wide

NAME READY STATUS RESTARTS AGE IP NODE
...

tf-0 0/1 Completed 0 5m 10.1.14.3 gpu-node01

tf-1 0/1 Pending 0 5m 10.1.14.3 gpu-node01

We see the last container has run and completed. Let us see the output of the run

$ kubectl logs tf-0

Perfect, we have the same result as if we had run nvidia-smi from the host, which means we are all good to operate GPUs!

Finally

Obviously, the war is not over, this is merely one battle.  Next we will want to enable more GPUs per machine, which means that the hack’y way in which this has been coded in GO, needs to change to a discovery model rather than a hardcoded model:

// makeDevices determines the devices for the given container.
// Experimental. For now, we hardcode /dev/nvidia0 no matter what the user asks for
// (we only support one device per node).
// TODO: add support for more than 1 GPU after #28216.
func makeDevices(container *v1.Container) []kubecontainer.DeviceInfo {
nvidiaGPULimit := container.Resources.Limits.NvidiaGPU()
if nvidiaGPULimit.Value() != 0 {
return []kubecontainer.DeviceInfo{
{PathOnHost: "/dev/nvidia0", PathInContainer: "/dev/nvidia0", Permissions: "mrw"},
{PathOnHost: "/dev/nvidiactl", PathInContainer: "/dev/nvidiactl", Permissions: "mrw"},
{PathOnHost: "/dev/nvidia-uvm", PathInContainer: "/dev/nvidia-uvm", Permissions: "mrw"},
}
}

return nil
}

We will likely be contributing code to this, although it’s well underway.  Stay tuned!


Overview of DeepLearni.ng

Who we are
Blend of ML experts, developers and industry experts

  • Team of +10 people including 3 PhDs a a NEXT alumni with several years of deep learning experience.  
  • Strong leadership team includes former CEO of national bank, McKinsey alumni, and seasoned entrepreneurs.
  • Enterprise Big Data Expertise including our CTO, Canada’s only 2-time DataStax MVP and co-founder of Data for Good, a national Big Data and Predictive Insights Not For Profit.

What we do
Inject ML capabilities into large enterprises, and democratizing AI

  • Build custom AIs and deep learning models (e.g. propensity models, offer engines).  
  • Use our platform technology to rapidly build, test and deploy AI into production.
  • Empower your current staff to make these builds as rapid as possible based on their domain expertise.
  • Deployments and Security managed by our Enterprise class, deploy anywhere – AnyStax(TM) technology