In July 2017, Streamroot migrated its infrastructure to Kubernetes running on Google Cloud. We also took this opportunity to move from a VM-based to a container-based architecture. This change was motivated by our growth, availability requirements and need for faster time-to-production. After a bit more than 6 months of running Kubernetes in production, we wanted to share our experience.
Streamroot provides services for the video streaming industry, where users tend to watch their videos at the same time of day, for instance when they come back from work. We have an acute traffic pattern: every day we need 4 times more resources during rush hours than during lulls, so autoscaling is a must in order to control infrastructure costs and manage traffic spikes.
Our microservices (~10) receive an aggregated ~40.000 HTTP requests per second and over a million persistent websocket connections at peak time. Interestingly, a bit less than half of our overall CPU count in our previous VM-based setup was dedicated to SSL processing (we used Nginx as a reverse proxy).
Things we wish we’d known earlier: what’s not in the doc
When we started with Google, the Google Cloud Load Balancer (GCLB for short) didn’t support — at least not publicly — SNI for SSL, which meant that we had to deal with a lot of load balancers (we don’t always have the luxury of wildcard certificates). Luckily SNI has been made generally available since. Unfortunately, it is not (yet?) possible to configure a Kubernetes Ingress that creates a GCLB with support for multiple SSL certs for different hostnames. In other words, we can’t manage certificates like Kubernetes Secrets and we have to create GCLBs manually, which is a pain.
Another limitation of the GCLB is that it’s not flexible in terms of configuring “stickiness”, that is, how you can route an HTTP request to a specific server. The only options are based on the client’s IP or some HTTP cookie that the GCLB manages for you. It’s nowhere near what Nginx is capable of, where you can use any Nginx variable ($request_uri, a capturing group in a rewrite regex, etc.) to be the basis on which an upstream server is chosen. Because our services have some in-memory states, we have to do some advanced sharding that the GCLB doesn’t support. Basically we have to run Nginx behind the GCLB, the latter being used only for SSL offloading.
On the bright side, the GCLB performs SSL offloading for free, in the sense that one only pays for the processed traffic regardless of whether it is encrypted or not.
The canonical way to autoscale with K8s seems to use the Horizontal Pod Autoscaler (HPA) in conjunction with the GKE node autoscaler. We find that it’s cumbersome. Besides the HPA and the NodeAutoscaler being quite new, this requires us to precisely set the pods’ resource requests and to run Heapster to monitor actual resource usage. Also, it’s not very dynamic in the event of traffic surges because the pod and Node autoscalers run at independent periods and both have to kick in to really scale out.
Instead, we run high-load services such as Kubernetes daemon sets and we rely on the managed instance groups’ (the ones created under the hood by GKE) autoscaler. When the CPU consumption target is crossed:
- A new VM is added to the managed instance group.
- That VM automatically joins the K8s cluster as a K8s node.
- That node gets scheduled some Daemon Set pods which are automatically registered in the load balancer that effectively spreads the load and reduces the average load per managed instance.
Note: for some reason, Google doesn’t recommend running the instance group autoscaler with a GKE cluster.
Because our workload is network intensive, we had to really understand how network packets would flow in and out of pods. We wanted to minimize the number of network hops, which is easy to lose count of with the stacking of load balancers (GCLB, Nginx, Kube-proxy). In the end, in addition to running some services as daemon sets, we also configured high-load pods to be running with “Host Network”. We found it an effective way to limit the amount of Network Address Translation and IPTables rules.
While speaking of network, we rather painfully found out that we need to tweak the K8s nodes kernel via sysctl commands in order to accommodate the high number of websocket connections. For us, the most important parameter is “net.ipv4.netfilter.ip_conntrack_max”. Ideally the appropriate value for this parameter should either be hard-coded in a custom Linux image or set while the instance is starting up via cloud-init or a startup script. But, as GKE provides managed Kubernetes upgrades (which is a relief!) we don’t want to deviate too much from the “standard” setup with the default supported “Container-Optimized OS” and default startup script for fear that the automatic upgrades would fail. So we decided to go with a new Daemon Set with a privileged pod that run the “sysctl” commands. The main downside of this approach is that we really depend on that privileged pod to be spawned quickly, so that the sysctl tuning applies before other pods start to accept traffic. Otherwise we would see the usual ‘’nf_conntrack: table full, dropping packet’ in the host kernel logs, and that’s not good…
On a side note, we would like to thank Manuel Alejandro de Brito Fontes (@aledbf on Github) for the excellent work he is doing on the Kubernetes Nginx Ingress Controller. We found this software to be reliable across updates and flexible with a lot of Kubernetes annotations. Additionally, the source code is pleasant to read and the maintainer very responsive.
While we may use Kubernetes in unusual ways (lots of DaemonSet and HostNetwork), we’ve been happy users for now. We really appreciate how easy it is to deploy new versions of our microservices and how auto-scaling allows us to pay for exactly — and only — what we need.
If you are interested in hearing more about Kubernetes from us, we regularly attend Cloud Native Computing Foundation meetups in Paris or stay tuned for our upcoming blog post about multi-region deployments and operation.