Running at Scale
Introduction
Even though Kubernetes has emerged as a powerful foundation for deploying and managing containerized workloads at scale, achieving scale with Cloud Development Environments (CDEs), particularly in the range of thousands of concurrent workspaces, presents significant challenges.
Important
|
CDE workloads are complex to scale mainly because of the fact that underlying IDE solutions, such as Visual Studio Code - Open Source ("Code - OSS") or JetBrains Gateway, are designed as single-user applications rather than multitenant services. |
Such a scale imposes substantial infrastructure demands and introduces potential bottlenecks that can impact the performance and stability of the entire system. Addressing these challenges requires meticulous planning, strategic architectural choices, monitoring, and continuous optimization to ensure a seamless and efficient development experience for a large number of users. In this article, you can find a few important takeaways worth considering when running Eclipse Che at scale on Kubernetes.
Planning your environment according to object maximums
While there is no strict limit on the number of resources in a Kubernetes cluster, there are certain considerations for large clusters to keep in mind.
Tip
|
More insights about the Kubernetes scalability can be found in the "Scalability, with Wojciech Tyczynski" Kubernetes Podcast. |
OpenShift Container Platform, which is a certified distribution of Kubernetes, also provides a set of tested maximums for various resources, which can serve as an initial guideline for planning your environment:
Resource type | Tested maximum |
---|---|
Number of nodes |
2000 |
Number of pods |
150000 |
Number of pods per node |
2500 |
Number of namespace |
10000 |
Number of services |
10000 |
Number of secrets |
80000 |
Number of config maps |
90000 |
Table 1: OpenShift Container Platform tested cluster maximums for various resources.
Tip
|
You can find more details on the OpenShift Container Platform tested object maximums in the official documentation. |
For example, it is generally not recommended to have more than 10,000 namespaces due to potential performance and management overhead. In Eclipse Che, for instance, each user is allocated a namespace. If you expect the user base to be large, consider spreading workloads across multiple "fit-for-purpose" clusters and potentially leveraging solutions for multi-cluster orchestration.
Calculating Resource Requirements
When deploying Eclipse Che on Kubernetes, it is crucial to accurately calculate the resource requirements and determine memory and CPU / GPU needs for each CDE to come up with the right sizing of the cluster. In general, the CDE size is limited by and can not be bigger than the worker node size. The resource requirements for CDEs can vary significantly based on the specific workloads and configurations. For example, a simple CDE may require only a few hundred megabytes of memory, while a more complex one will need several gigabytes of memory and multiple CPU cores.
Tip
|
You can find more details about calculating resource requirements in the official documentation. |
etcd
The primary datastore of Kubernetes cluster configuration and state is etcd. It holds the cluster state and configuration, including information about nodes, pods, services, and custom resources. As a distributed key-value store, etcd does not scale well past a certain threshold, and as the size of etcd grows, so does the load on the cluster, risking its stability.
Important
|
The default etcd size is 2 GB, and the recommended maximum is 8 GB. Exceeding the maximum limit can make the Kubernetes cluster unstable and unresponsive. |

Figure 1: Metrics with the progression of the etcd storage growth that leads to cluster instability.
The size of the objects matters
Not only the overall number, but also the size of the objects stored in etcd is a critical factor that can significantly impact its performance and stability. Each object stored in etcd consumes space, and as the number of objects increases, the overall size of etcd grows too. The larger the object is, the more space it takes in etcd. For example, etcd can be overloaded with just a few thousand of relatively big Kubernetes objects.
Important
|
Even though the data stored in a
ConfigMap cannot exceed 1 MiB by design,
a few thousand of relatively big
ConfigMap objects can overload etcd
storage.
|
In the context of Eclipse Che, by default the operator creates and manages the 'ca-certs-merged' ConfigMap, which contains the Certificate Authorities (CAs) bundle, in every user namespace. With a large number of TLS certificates in the cluster, this results in additional etcd usage.
In order to disable mounting the CA bundle using the
ConfigMap
under the
/etc/pki/ca-trust/extracted/pem
path,
configure the CheCluster
Custom Resource by
setting the
disableWorkspaceCaBundleMount
property to
true
. With this configuration, only custom
certificates will be mounted under the path
/public-certs
:
spec:
devEnvironments:
trustedCerts:
disableWorkspaceCaBundleMount: true
Tip
|
You can find more details about importing untrusted TLS certificates in the official documentation. |
DevWorkspace objects
For large Kubernetes deployments, particularly those
involving a high number of custom resources such as
DevWorkspace
objects, which represent CDEs,
etcd can become a significant performance bottleneck.
Important
|
Based on the load testing for 6,000
DevWorkspace objects, storage consumption
for etcd was around 2.5GB.
|
Starting from
DevWorkspace Operator
version 0.34.0, you can configure a pruner that would
automatically clean up DevWorkspace
objects
that were not in use for a certain period of time. To set
the pruner up, configure the
DevWorkspaceOperatorConfig
object as follows:
apiVersion: controller.devfile.io/v1alpha1
kind: DevWorkspaceOperatorConfig
metadata:
name: devworkspace-operator-config
namespace: crw
config:
workspace:
cleanupCronJob:
enabled: true
dryRun: false
retainTime: 2592000 # By default, if a workspace was not started for more than 30 days it will be marked for deletion
schedule: “0 0 1 * *” # By default, the pruner will run once per month
Tip
|
You can find more details about DevWorkspace Operator Configuration in the official documentation. |
OLMConfig
When an operator is installed by the
Operator Lifecycle Manager (OLM), a stripped-down copy of its CSV is created in every
namespace the operator is configured to watch. These
stripped-down CSVs are known as “Copied CSVs” and
communicate to users which controllers are actively
reconciling resource events in a given namespace. On
especially large clusters, with namespaces and installed
operators tending in the hundreds or thousands, Copied
CSVs consume an untenable amount of resources; e.g. OLM’s
memory usage, cluster etcd limits, networking, etc. In
order to eliminate the CSVs copied to every namespace,
configure the OLMConfig
object accordingly:
apiVersion: operators.coreos.com/v1
kind: OLMConfig
metadata:
name: cluster
spec:
features:
disableCopiedCSVs: true
Tip
|
Additional information about the
disableCopiedCSVs feature can be found in
its original
enhancement proposal.
|
The primary impact of the
disableCopiedCSVs
property on etcd is related
to resource consumption. In clusters with a large number
of namespaces and many cluster-wide Operators, the
creation and maintenance of numerous Copied CSVs can lead
to increased etcd storage usage and memory consumption. By
disabling Copied CSVs, the amount of data stored in etcd
is significantly reduced, which can help improve overall
cluster performance and stability.
This is particularly important for large clusters where the number of namespaces and operators can quickly add up to a significant amount of data. Disabling Copied CSVs can help reduce the load on etcd, leading to improved performance and responsiveness of the cluster. Additionally, it can help reduce the memory footprint of OLM, as it no longer needs to maintain and manage these additional resources.
Tip
|
You can find more details about "Disabling Copied CSVs" in the official documentation. |
Cluster Autoscaling
Although cluster autoscaling is a powerful Kubernetes feature, you cannot always fall back on it. You should always consider predictive scaling by analyzing load data on your environment to detect daily or weekly usage patterns. If your workloads follow a pattern and there are dramatic peaks throughout the day, you should consider provisioning worker nodes accordingly. For example, if you have a predictable load pattern where the number of workspaces increases during business hours and decreases during off-hours, you can use predictive scaling to adjust the number of worker nodes based on the expected load. This can help ensure that you have enough resources available to handle the peak load while minimizing costs during off-peak hours.
Tip
|
Consider leveraging open-source solutions like Karpenter for configuration and lifecycle management of the worker nodes. Karpenter can dynamically provision and optimize worker nodes based on the specific requirements of the workloads, helping to improve resource utilization and reduce costs. |
Multi-cluster
By design, Eclipse Che is not multi-cluster aware, and you can only have one instance per cluster. However, you can run Eclipse Che in a multi-cluster environment by deploying Eclipse Che in each different cluster and using a load balancer or DNS-based routing to direct traffic to the appropriate instance based on the user’s location or other criteria. This approach can help improve performance and reliability by distributing the workload across multiple clusters and providing redundancy in case of cluster failures.
One example of a multi-cluster architecture is workspaces.openshift.com, which is part of the Developer Sandbox.
From the infrastructure perspective, the Developer Sandbox consists of multiple ROSA clusters. On each cluster, the productized version of Eclipse Che is installed and configured using Argo CD. Since the user base is spread across multiple clusters, workspaces.openshift.com is used as a single entry point to the productized Eclipse Che instances. You can find implementation details about the multicluster redirector in the following GitHub repository.

Figure 2: Multi-cluster solution for running the productized version of Eclipse Che on the Developer Sandbox.
Tip
|
The solution for workspaces.openshift.com is a Developer Sandbox-specific solution that can not be reused as-is in other environments. However, you can use it as a reference for implementing a similar solution well-tailored to your specific multicluster needs. |
Conclusion
Running Eclipse Che at scale on Kubernetes presents unique challenges that require careful planning and consideration of various factors. By understanding the limitations of Kubernetes, accurately calculating resource requirements, and implementing best practices for managing etcd and OLM, you can build a robust and scalable CDE platform that meets the needs of your users. Additionally, leveraging predictive scaling and considering multicluster architectures can further enhance the performance and reliability of Eclipse Che deployments. By following these guidelines and continuously monitoring and optimizing the environment, you can provide a reliable and efficient CDE, ensure a seamless and efficient development experience for a large number of users while maintaining responsiveness, and prevent performance degradation of the cluster.