Personal Blog

Nvidia GPUs, Kubernetes and Tensorflow – the (not so) final AI Frontiers

So you want to intelligently use your infrastructure, don’t you?  And you want to use it cross-platform?  How about cross data-center?  We all do.  I have been around the tech industry for over 20 years and have seen many paradigms come and go as well as a series of devops tools written in almost every language imaginable.  I used to personally write my own and keep it up to date (mostly in bash and python) and then came the age of Big Data, Apache, Netflix and Google, and so did their tools.

Some of the best around are Mesos / Marathon / DCOS (an Apache project) and Kubernetes (a Google product).  Although they are not mutually exclusive, and there is a case for using Kubernetes in DCOS, let’s hold off for a second or two and compare them apples to apples (the section that follows was borrowed from Platform9 as I do not want to rewrite what has already been well written here).

Overview of infrastructure orchestration options

Kubernetes (K8s)

Kubernetes was built by Google based on their experience running containers in production over the last decade. See below for a Kubernetes architecture diagram and the following explanation.

The major components in a Kubernetes cluster are:

  • Pods – Kubernetes deploys and schedules containers in groups called pods. A pod will typically include 1 to 5 containers that collaborate to provide a service.

  • Flat Networking Space – The default network model in Kubernetes is flat and permits all pods to talk to each other. Containers in the same pod share an IP and can communicate using ports on the localhost address.

  • Labels – Labels are key-value pairs attached to objects and can be used to search and update multiple objects as a single set.

  • Services – Services are endpoints that can be addressed by name and can be connected to pods using label selectors. The service will automatically round-robin requests between the pods. Kubernetes will set up a DNS server for the cluster that watches for new services and allows them to be addressed by name.

  • Replication Controllers – Replication controllers are the way to instantiate pods in Kubernetes. They control and monitor the number of running pods for a service, improving fault tolerance.

Mesos and Marathon (DCOS)

Apache Mesos is an open-source cluster manager designed to scale to very large clusters, from hundreds to thousands of hosts. Mesos supports diverse kinds of workloads such as Hadoop tasks, cloud native applications etc. The architecture of Mesos is designed around high-availability and resilience.

The major components in a Mesos cluster are:

  • Mesos Agent Nodes – Responsible for actually running tasks. All agents submit a list of their available resources to the master.

  • Mesos Master – The master is responsible for sending tasks to the agents. It maintains a list of available resources and makes “offers” of them to frameworks e.g. Hadoop. The master decides how many resources to offer based on an allocation strategy. There will typically be stand-by master instances to take over in case of a failure.

  • ZooKeeper – Used in elections and for looking up address of current master. Multiple instances of  ZooKeeper are run to ensure availability and handle failures.

  • Frameworks – Frameworks co-ordinate with the master to schedule tasks onto agent nodes. Frameworks are composed of two parts-

    • the executor process runs on the agents and takes care of running the tasks and

    • the scheduler registers with the master and selects which resources to use based on offers from the master.

As you can likely guess from the title of this blog post, we obviously went with Kubernetes for our needs, since we are (mostly) all more comfortable with Docker containers and wanted to use those anyway. Further, we didn’t require the built-in Hadoop’iness that is provided with Mesos, so we decided against it (oh, and we don’t particularly love Zookeeper, so we use it sparingly where possible). Let’s examine exactly what we did and how we did it.

*** Note: the next bit of this blog post will include a bit of code, in multiple languages ***

Deep(er) (Learning) of Kubernetes

The beauty of Kubernetes is its ease of deployment.  Quite literally, when you are ready to launch a cluster, it really can be just as easy as:

$ kubectl create -f some_config_file.yaml

where kubectl is the driver / command-line program for working with the Kubernetes API. Now let’s introduce some of the key concepts.

The Cluster

As the name may suggest, a cluster is a logical grouping of nodes.  Whether that group of nodes be physical or virtual machines is irrelevant, as Kubernetes can utilize either.  As a very simple example, see below:

View Drawing

Within the diagram you’ll notice the following components, which are further explained below:

  • Pods

  • Containers

  • Label(s)

  • Replication Controllers

  • Service

  • Nodes

  • Kubernetes Master


Pods (green boxes) are scheduled to Nodes and contain a group of co-located Containers and Volumes. Containers in the same Pod share the same network namespace and can communicate with each other using localhost. Pods are considered to be ephemeral rather than durable entities.

How do you persist data across restarts?

Kubernetes supports the concept of Volumes so you can use a Volume type that is persistent.

Do I have to create each Pod individually?

Pods can be created manually and you can also use a Replication Controller to deploy multiple copies using a Pod template.

So how can I reliability reference my backend container from a frontend container?

Just use a Service as explained below.


Pods can have labels label. A Label is analogous to a user-defined attribute and is simply a key-value pair. For example you might create a ‘microservice’ and an ‘application’ tag to tag your containers (i.e. microservice=analytics, application=webui) to your analytics cluster of Pods and tag other containers (i.e. microservice=dashboard, application=webui) to your view/display Pods. You can then use Selectors to select Pods with particular Labels and apply Services or Replication Controllers to them.

Replication Controllers

Replication Controllers are the key concept that allows for a distributed system to be created as Pod “replicas” (i.e creating a Replication Controller for a Pod with 3 replicas will yield 3 Pods and continuously monitor them). It will ensure that there are always 3 replicas available and will replace any missing ones to maintain the total count.Kubernetes Replication Controller

View Drawing

If the Pod that died comes back then you have 4 Pods, consequently the Replication Controller will terminate one so the total count is 3. If you change the number of replicas to 5 on the fly, the Replication Controller will immediately start 2 new Pods so the total count is 5. You can also scale down Pods this way, a handy feature performing rolling updates.

When creating a Replication Controller you need to specify two things:

  1. Pod Template: the template that will be used to create the Pods replicas.

  2. Labels: the labels for the Pods that this Replication Controller should monitor.


Service defines a set of Pods and a policy to access them. Services find their group of Pods using Labels. Imagine you have 2 analytics Pods and you defined an analytics Service named ‘analytics-service’ with label selector (microservice=analytics, application=webui). Service analytics-service will facilitate two key things:

  • A cluster-local DNS entry will be created for the Service so your display Pod only need to do a DNS lookup for hostname ‘analytics-service’ this will resolve to a stable IP address that your display application can use.

  • So now your display has an IP address for the analytics-service, but which one of the 2 analytics Pods will it access? The Service will provide transparent load balancing between the 2 analytics Pods and forward the request to any one of them (see below). This is done by using a built-in proxy (kube-proxy) that runs on each Node.

More technical details here

Kubernetes Service

This animated diagram illustrates the function of Services. Note that this diagram is overly simplified. The underlying networking and routing involved to achieve this transparent load balancing is relatively advanced if you are not into network configurations. Have a peek here if you are interested in a deep dive.

View Drawing

There is a special type of Kubernetes Services called ‘LoadBalancer’, which is used as an external load balancer to balance traffic between a number of Pods. Handy for load balancing Web traffic for example.


A node (the orange box) is a physical or virtual machine that acts as a Kubernetes worker, used to be called Minion. Each node runs the following key Kubernetes components:

  • Kubelet: is the primary node agent.

  • kube-proxy: used by Services to proxy connections to Pods as explained above.

  • Docker (or Rocket). The container technology that Kubernetes uses to create containers.

Kubernetes Master

The cluster has a Kubernetes Master (the box in purple). The Kubernetes Master provides a unified view into the cluster and has a number of components such the Kubernetes API Server. The API Server provides a REST endpoint that can be used to interact with the cluster. The master also includes the Replication Controllers used to create and replicate Pods.

Kubernetes Dashboard

Finally, optionally (as an ease of deployment) you can also add on Kubernetes Dashboard available here. This dashboard gives you all the power you have with kubectl, while being able to control the entire cluster from a GUI.

Code Snippets, Deployment and GPUs

So what else did we need?  Well, DeepLearni.ng is, by its nature, a machine learning / neural network / predictive insight company that deploys anywhere (any data centre, cloud or bare metal, or combination thereof) using our AnyStaxTM technology.  As you may have guessed, a lot of this technology is borne from the use of containers and orchestration software such as Kubernetes and Mesos, however this is not the place or time to speak in depth about it.

The real problem remained, Mesos had no issues using GPUs, however Kubernetes did not have support for it yet.

So what to do?

Dive into the code, github, issues, pull requests, etc… and make sure that it all works super easily.

But what was missing?

First and foremost, Kubernetes still does not support GPUs, so this is a bit of a hack and early pull of a bunch of experimental code.  It works fine, but I would warn you about using it in production unless you really understand what it’s doing.  Here is what we did…

Deploy (anywhere) using Ansible

Yes, yet another technology.  Most of you have likely heard of Ansible and if you have not — well, get out from underneath your rock — just kidding.  Here is a light primer.


Ansible is one of several configuration management tools including Puppet, Chef, Salt.  So how does Ansible differ?  Well, shell scripts have ruled the world of server configuration for quite some time, whereas Ansible uses ssh and python (modules) to achieve what the others do. Benefits of Ansible, when compared to the mentioned tools are:

  • no need for agent software on the configured server;

  • very easy to learn and work with;

  • written in Python 2 which is installed on almost every Linux server by default.

In the end, what we ended up using was a hodge-podge of Ansible, Kubernetes, Docker and a ton of secret sauce.  However the part that will interest you most is what we did to make it all work with GPU, right?

Journey out of the CENTR(al processing unit)

To the GPU and beyond!

The state of the GPU in Kubernetes can be found here. There is definitely a proposal, but it has been slow on the uptake, and that just didn’t meet our needs.  What we didn’t bother with (since we’ve done it umpteen times) is to instruct you on how to install CUDA and ensure that the OS is ready for you to use the GPUs.  There are several good tutorials on that, so just Google them, you’ll find it.

So, by default, K8s will not activate GPUs when starting the API server and the Kubelet on workers. We need to do that manually.  If you are using Ansible, you’ll have to alter your startup scripts (which you’ll push to your nodes/containers) to do the following (and assuming you are strictly on Ubuntu Xenial – 16.04 LTS)…

On the master

On the master node, update /etc/systemd/system/kubelet.service.d/XX-kubeadm.conf to add:

# Allow some security stuff


# GPU Stuff

Then restart the API service via

$ sudo systemctl restart kubelet

So now the Kube API will accept requests to run privileged containers, which are required for GPU workloads.

Worker nodes

On every worker, /etc/systemd/system/kubelet.service.d/XX-kubeadm.conf to to add the GPU tag, so it looks like:

# Allow some security stuff


# GPU Stuff

Then restart the service via

$ sudo systemctl restart kubelet

Testing the setup

Now that we have CUDA GPUs enabled in k8s, let us test that everything works. We take a very simple job that will just run nvidia-smi from a pod and exit on success.

The job definition is

apiVersion: v1
kind: Pod
 name: tf-0
 - name: nvidia-driver
     path: /var/lib/nvidia-docker/volumes/nvidia_driver/367.57
 - name: tensorflow
   image: tensorflow/tensorflow:0.11.0rc0-gpu
   - containerPort: 8888
       alpha.kubernetes.io/nvidia-gpu: 1
   - name: nvidia-driver
     mountPath: /usr/local/nvidia/
     readOnly: true
apiVersion: v1
kind: Pod
 name: tf-1
 - name: nvidia-driver
     path: /var/lib/nvidia-docker/volumes/nvidia_driver/367.57
 - name: tensorflow
   image: tensorflow/tensorflow:0.11.0rc0-gpu
   - containerPort: 8888
       alpha.kubernetes.io/nvidia-gpu: 1
   - name: nvidia-driver
     mountPath: /usr/local/nvidia/
     readOnly: true

What is interesting here is

  • We do not have the abstraction provided by nvidia-docker, so we have to specify manually the mount points for the char devices

  • We also need to share the drivers and libs folders

  • In the resources, we have to both request and limit the resources with 1 GPU

  • The container has to run privileged

Now if we run this:

$ kubectl create -f gpu-test.yaml
$ # Wait for a few seconds so the cluster can download and run the container
$ kubectl get pods -a -o wide


tf-0 0/1 Completed 0 5m gpu-node01

tf-1 0/1 Pending 0 5m gpu-node01

We see the last container has run and completed. Let us see the output of the run

$ kubectl logs tf-0

Perfect, we have the same result as if we had run nvidia-smi from the host, which means we are all good to operate GPUs!


Obviously, the war is not over, this is merely one battle.  Next we will want to enable more GPUs per machine, which means that the hack’y way in which this has been coded in GO, needs to change to a discovery model rather than a hardcoded model:

// makeDevices determines the devices for the given container.
// Experimental. For now, we hardcode /dev/nvidia0 no matter what the user asks for
// (we only support one device per node).
// TODO: add support for more than 1 GPU after #28216.
func makeDevices(container *v1.Container) []kubecontainer.DeviceInfo {
nvidiaGPULimit := container.Resources.Limits.NvidiaGPU()
if nvidiaGPULimit.Value() != 0 {
return []kubecontainer.DeviceInfo{
{PathOnHost: "/dev/nvidia0", PathInContainer: "/dev/nvidia0", Permissions: "mrw"},
{PathOnHost: "/dev/nvidiactl", PathInContainer: "/dev/nvidiactl", Permissions: "mrw"},
{PathOnHost: "/dev/nvidia-uvm", PathInContainer: "/dev/nvidia-uvm", Permissions: "mrw"},

return nil

We will likely be contributing code to this, although it’s well underway.  Stay tuned!

Overview of DeepLearni.ng

Who we are
Blend of ML experts, developers and industry experts

  • Team of +10 people including 3 PhDs a a NEXT alumni with several years of deep learning experience.  
  • Strong leadership team includes former CEO of national bank, McKinsey alumni, and seasoned entrepreneurs.
  • Enterprise Big Data Expertise including our CTO, Canada’s only 2-time DataStax MVP and co-founder of Data for Good, a national Big Data and Predictive Insights Not For Profit.

What we do
Inject ML capabilities into large enterprises, and democratizing AI

  • Build custom AIs and deep learning models (e.g. propensity models, offer engines).  
  • Use our platform technology to rapidly build, test and deploy AI into production.
  • Empower your current staff to make these builds as rapid as possible based on their domain expertise.
  • Deployments and Security managed by our Enterprise class, deploy anywhere – AnyStax(TM) technology

Viafoura has been trending toward a heightened awareness of its team members, their needs, goals and career growth.  All of this is wonderful and much of the groundwork required is borne of lessons learned while at other companies and organizations that did some things right, but most things wrong.

I don’t want to start off by blaming those organizations, but rather much of the blame goes to the lack of training, foresight and knowledge that many leaders possess. I really am not very different in that I have many shortcomings, however one thing I’m good at is to realize that something has gone awry and that it should be fixed.  This has lead to a pleasant experience at Viafoura for our Engineering teams and an overall team satisfaction that is VERY respectable.

How did we get here then?  What are some of the methods we use that differentiate us from other companies? Why are people happy?

All great questions indeed, let’s explore them a little.

How did we get here?

I was tasked with taking over Viafoura Engineering and growing the team from 4 to its current size.  There have been stumbling blocks, in hiring, managing and delivery of deadlines, but we are a pretty good place these days.  So what were some of the mistakes I (and we) made?


We have had issues with prioritizations in the past.  This lead to some of our team members, the brightest of the bright, being upset about not being able to “do the right thing” (in Engineering terms) and/or not having the time to think about these things.

How did we solve it?

It’s taken some time, but we have attacked this problem by making it increasingly difficult to NOT do the right thing. It’s become ingrained in our philosophy and culture that taking the time to do the right thing is crucially important.  Further, we’ve instituted weekly architecture meetings and tea parties for the frontend team that allow us all to meet and discuss ideas, issues and possible improvements that are available.

We often find these things not only based on our own knowledge of the platform, but also through exploratory looks at what other companies are doing, what technologies are gaining adoption and by challenging all the paradigms we assume to be correct.  Afterall, we’ve gone from a standard LAMP (well LEMP) stack to a very well-tuned Event Based Architecture (for which I will have a write-up in a few weeks).

Technical debt

This one is sort of related to above.  Not only do prioritizations matter, but once they are out of whack, you start incurring technical debt.  This isn’t the kind of debt that will cause the bank to foreclose on your home, however like a financial debt, the technical debt incurs interest payments. These happen in all the extra effort that we have to do in future development because of the quick and dirty design choice(s). We can choose to continue paying the interest, or we can pay down the principal by refactoring the quick and dirty design into the better design. Although it costs to pay down the principal, we gain by reduced interest payments in the future.

How did we solve it?

This issue made us totally re-think how we were doing things at Viafoura.  We knew that we would eventually end up with a system that would fall flat on its face and no longer be able to handle the requests we threw at it (for example – one story today cause a spike of well over 1500 requests/second – which we now handled without batting an eyelash).

We took the opportunity to re-factor absolutely everything about our system, slightly delaying product delivery and small features that had been requested. Yes, that does suck for our clients, however they’ll be so much better served going forward and we’ll be able to iterate against our series of micro-services (read Service Oriented Architecture) that they’ll (eventually) be happy we went down that road.  (Well, at least that’s the hope – right?)

Lack of expectations

One of our major issues was that no one had set out expectations for projects, people, product, etc… This year at Viafoura was the year of Accountability. All our departments were tasked with figuring out ways in which we would measure our success and minimize our failures.  During the year, we also hit large growth spurts which lead to rethink much of what we knew and reassess all the things we didn’t know (including those we didn’t know that we didn’t know).

How did we solve it?

This took some work!  It has been ongoing and will never cease to exist.  We attacked it from many angles to set expectations across the board and to communicate those to our team(s). The first change we made was to set expectations on a departmental level.  From here, we then further reduced these to goals and objectives that led to alignment on the part of everyone within those departments.

Everyone was now on the hook for something, but they still didn’t quite understand how they were being measured.

I then set up a performance review that runs quarterly, measuring all kinds of aspects about our team, outside of just the typical – completes his work, plays well with others, does not wet the carpet!

We work with all our team members to figure out what matters to them and how that will align well toward their career growth, personal growth and Viafoura’s growth and bottom line.  We then have them set some stretch goals that will take longer than one quarter to fulfill and measure their progress against it quarterly.  Once these have been achieved, we add more to it.  It’s really an ever evolving process.

We have a great range of talent, which lends itself well to cross-team, cross-functional, multi-disciplinary work.  As such, no one is actually bound to any one job title or description.  This is yet another one of those ever evolving, living documents that we take a look at with our team during review periods.

With all that being set, we now have the groundwork for expectations at Viafoura, which has led to better timelines, better delivery of work and happier people.

The lessons learned:





What’s in the Viafoura stack anyway?

Viafoura prides itself on its engineering practice and as such is happy to constantly share back to the community.  Much of our work can be found on github and I personally am an avid speaker, presenter and teacher of everything I know.


  • Java
  • Node.JS
  • JavaScript
  • PHP
  • Python
  • Erlang
  • Clojure
  • Scala


  • Competitive Salary
  • Viafoura University
  • Education reimbursement
  • Lots of vacation
  • Free food, snacks, beer
  • Learn from the best in Toronto
  • Collaborative space where many meetups are held
  • Monthly hack days where you “scratch an itch”
  • Have a voice – employee feedback genuinely taken into account




  • AWS/EC2
  • NetflixOSS
    • Karyon
    • Priam
    • Archeus
    • Eureka
    • ICE
    • Buri
    • Aminator
    • Hystrix
    • Turbine
    • Zuul
  • Cassandra
  • HAProxy
  • Pagerduty
  • Loggly
  • Percona XtraDB
  • Nginx
  • Kafka
  • Storm
  • Spark
  • Spark streaming
  • Lucene