Introduction

I’ve been using Rancher for one year now with Cattle orchestrator.

From ci to review env to production I everything with Rancher.

It provides great value and is really stable.

Though we experienced some issues that I’d like to share here.

Gotchas

Containers orchestration

It doesn’t enforce best practicies such as applying constraints on containers. Why is it important? well what if you had a server with 4G of ram, then you deploy multiple containers on this server.

Everything looks fine at first until one container start consuming more and more memory until the host has no more memory available. This is the begining of a bad moment.

The host will then send kill signals to your docker containers, eventually killing ipsec containers or rancher-agent container. So in the end you will see a Disconnected host. Most of the time you will need to manually evacuate the host.

Beware the cascade effect.

Be evacuating containers from the disconnected server they will be scheduled on some other hosts that can experience the same thing as the Disconnected server.

Rancher database pressure

After the evacuation of a host or updating some stack you can find some host in the Reconnecting state. This is bad.

Your servers are known by Rancher but Rancher cannot display their state, then you’ll see your stacks and containers becoming Unhealthy.

But these are false status. Check your services before taking actions in Rancher usually they are in Good state.

The Reconnecting state appeared on our Rancher and was due to too much pressure on the Rancher database. Someone had a really hard night try to bring everything back up to only see all new hosts and stacks go to Reconnecting state.

If your Rancher database is 100% CPU then Rancher will display wrong stack/host/services status.

Containers logs

By default there is no logging limits on containers, what this means is that you can see the container log history, and all the logs are stored in one file on the server.

What can go wrong is that one container could generate a lot of logs until the disk is full.

So always apply logging limits such as: max-size, max-file on the container

Always use a service to aggregate logs, such as: Graylog

Conclusion

Monitor your Rancher servers and Rancher Database.

Apply memory restriction

Apply logging limits