ETCD Quotas error, troubleshooting

Last updated April 22nd, 2021.

Objective

At some point during the life of your Managed Kubernetes cluster, you may encounter one of the following errors which prevent you from altering resources:

rpc error: code = Unknown desc = ETCD storage quota exceeded
rpc error: code = Unknown desc = quota computation: etcdserver: not capable
rpc error: code = Unknown desc = The OVHcloud storage quota has been reached

This guide will show you how to troubleshoot and resolve this situation.

Requirements

  • An OVHcloud Managed Kubernetes cluster
  • The kubectl command-line tool installed

Instructions

Background

Each Kubernetes cluster has a dedicated quota on ETCD storage usage, calculated through the following formula: Quota = 10MB + (25MB per node) (capped to 200MB)

For example, a cluster with 3 b2-7 servers has a quota of 85MB.

The quota can thus be increased by adding nodes, but will never be decreased (even if all nodes are removed) to prevent data loss.

The error mentioned above states that the cluster's ETCD storage usage has exceeded the quota.

To resolve the situation, resources created in excess need to be deleted.

Most common case: misconfigured cert-manager

Most users install cert-manager through Helm, and then move on a bit hastily.

The most common cases of ETCD quota issues come from a bad configuration of cert-manager, making it continuously create certificaterequest resources.

This behaviour will fill the ETCD with resources until the quota is reached.

To verify if you are in this situation, you can get the number of certificaterequest and order.acme resources:

kubectl get certificaterequest.cert-manager.io -A | wc -l
kubectl get order.acme.cert-manager.io -A | wc -l

If you have a huge number (hundreds or more) of those resources requests, you have found the root cause.

To resolve the situation, we propose the following method:

  • Stopping cert-manager
kubectl -n <your_cert_manager_namespace> scale deployment --replicas 0 cert-manager
  • Flushing all certificaterequest and order.acme resources
kubectl delete certificaterequest.cert-manager.io -A --all
kubectl delete order.acme.cert-manager.io -A --all
  • Updating cert-manager

There is no generic way to do this, but if you use Helm we recommend you to use it for the update. Cert Manager official documentation

  • Fixing the issue

We recommend you to take the following steps to troubleshoot your cert-manager, and to ensure that everything is correctly configured: Acme troubleshoot

  • Starting cert-manager

Other cases

If cert-manager is not the root cause, you should turn to the other running operators which create kubernetes resources.

We have found that the following resources can sometimes be generated continuously by existing operators: - backups.velero.io - ingress.networking.k8s.io - ingress.extensions

If that still does not cover your case, you can use a tool like ketall to easily list and count resources in your cluster.

Then you should delete the resources in excess and fix the process responsible for their creation.

Go further

To learn more about using your Kubernetes cluster the practical way, we invite you to look at our OVHcloud Managed Kubernetes doc site.

Join our community of users.


Did you find this guide useful?

Please feel free to give any suggestions in order to improve this documentation.

Whether your feedback is about images, content, or structure, please share it, so that we can improve it together.

Your support requests will not be processed via this form. To do this, please use the "Create a ticket" form.

Thank you. Your feedback has been received.

OVHcloud Community

Access your community space. Ask questions, search for information, post content, and interact with other OVHcloud Community members.

Discuss with the OVHcloud community

In accordance with the 2006/112/CE Directive, modified on 01/01/2015, prices incl. VAT may vary according to the customer's country of residence
(by default, the prices displayed are inclusive of the UK VAT in force).