Tuesday, March 22, 2022

Kubernetes scalability bottlenecks: can Kubernetes scale to manage one million objects?

While originally Kubernetes was designed as a platform for orchestrating containers, lately Kubernetes is becoming a platform for orchestrators or for applications for management of other entities. Kubernetes can run applications that manage VMs, multiple Kubernetes clusters, edge devices and applications, and can also orchestrate provisioning of Bare Metal machines and of OpenShift clusters, among other things.

In this blog post I describe common characteristics and a generic architecture of management applications that use the Kubernetes API for representing application state. I classify the management applications that use the Kubernetes API as either Kubernetes-native or as fully-Kubernetes-native (I invented the latter term). Lastly, I examine scalability bottlenecks of Kubernetes that can hinder applications that manage a large number of objects using the Kubernetes API. I outline possible workarounds for some of the bottlenecks.

Kubernetes-native applications

Some of the management applications mentioned above describe themselves as Kubernetes-native or Kubernetes-style: they are implemented and operated using Kubernetes API, CLI and tools from the Kubernetes ecosystem. I did not find a precise definition of what Kubernetes-native is. Sometimes, a more broad term cloud-native is used. In the following list I provide common characteristics of Kubernetes-native applications. The list is not meant to be exhaustive and not all Kubernetes-native applications have all the characteristics.

Applying the controller pattern, or a special kind of the controller pattern called the operator pattern.
Representing control and configuration information as Kubernetes ConfigMaps, Kubernetes Secrets and Kubernetes Custom Resources.
Using Kubernetes software libraries, such as client-go, controller-runtime, OperatorSDK.
Application components report Prometheus metrics, emit Kubernetes events, define Kubernetes liveness, readiness and startup probes.
The applications use CNCF-landscape tools: such as Prometheus for monitoring and alerts, Grafana for observability, Service Meshes such as Linkerd for advanced traffic control and security, Open Policy Agent for policy-based control, Crossplane for infrastructure management, and many others.
The applications are operated using GitOps and Continuous Deployment tools for Kubernetes, such as Flux and ArgoCD.

In my opinion, a Kubernetes-native application does not have to run on a Kubernetes cluster. It can run elsewhere, communicate with the Kubernetes API of some Kubernetes cluster and have all of the characteristics above or a subset of them.

Fully-Kubernetes-native applications

Some of the applications even store all the configuration data for the objects they manage using the Kubernetes API. I call such applications fully Kubernetes-native.

There are multiple advantages of the fully-Kubernetes-native applications:

since the applications store all their data in the same database Kubernetes stores its data (etcd), there is no need to operate an additional database, to define its schema and to program access to it. The applications get schema validation and versioning of their data for free. The application developers only need to learn the Kubernetes API and Kubernetes Custom Resource Definitions. There is no need to learn SQL or other database query and data-definition languages.
the applications get for no cost an API server (the Kubernetes API Server) and a CLI (kubectl) for their data. kubectl can be extended for the needs of the application using kubectl plugins.
the applications use Kubernetes authentication and Kubernetes authorization mechanisms. In particular, the application admins can define Kubernetes Role-Based Access Control (RBAC) rules to specify the operations the users can perform on managed objects, without being required to use external authorization systems.

Kubernetes-native applications that are not fully Kubernetes-native, store some of their state in a database, other than etcd. They have to implement data access to the database, authorization for the data access and some API server on top of the database.

Implementing applications in the fully-Kubernetes-native way might spare development and operational effort, and reduce skill requirements. The question, though, is: can such applications scale to manage a large number of objects? If one wants to manage 5G network infrastructure or edge devices, one might need to manage tens of thousands or hundreds of thousands or, in the future, maybe even millions of objects. In the remainder of this blog post I describe Kubernetes scalability bottlenecks that can hamper fully-Kubernetes-native applications in managing large numbers of objects. These are the bottlenecks I encountered when working with fully Kubernetes-native applications, the readers are welcome to specify more Kubernetes bottlenecks in the comments.

Fully-Kubernetes-native management applications

To facilitate explanation of scalability requirements of the management applications, I present a generic architecture for fully-Kubernetes-native management applications in the following diagram:

In the diagram above, a management application manages some managed entities (sorry for too many "management" words). Examples of managed entities are VMs, bare metal machines, edge devices, other Kubernetes clusters. Each managed entity might have managed subentities. In the case of a VM, examples of managed subentities are network interfaces and disks. In the case of an edge device, an example of managed subentities is mobile applications that run on the device. The management agents run on the managed entities and watch for desired configuration for their managed entities. The desired configuration resides in the application CRs (Kubernetes Custom Resources). The management agents act according to the desired configuration, for example, add a disk or a network interface, deploy a mobile application or configure an existing application. The management agents report the status of the managed entities and of the managed subentities by updating the status field of the CRs or by using dedicated CRs. The management agents get and update the application CRs through the Kubernetes API.

The users of the application can provide the desired configuration for the managed entities and subentities by creating or updating application CRs using kubectl create, apply, edit commands. The users can get the status of the managed entities and subentities by reading application CRs using kubectl get. GitOps and other tools such as Web consoles and dashboards, from the Kubernetes ecosystem or custom ones, can operate the managed entities and subentities through the Kubernetes API. Note that while the Kubernetes API for management and the Kubernetes API for agents is represented in the diagram as different objects, they are served by the same Kubernetes API server and may be identical.

The users can use a custom Web console to perform management operations by GUI instead of by CLI. (I omitted the Web console on the diagram above for brevity.) The management application may be implemented as a set of controllers/operators that watch the application CRs. The management application performs reconciliation of the desired state (of managed entities or of managed subentities), and of the actual state reported by the management agents. Note that while the management application and the external tools are depicted on the diagram outside of the Kubernetes cluster, they may run inside the cluster. (They will consume the same Kubernetes API in both cases: running inside or outside the cluster). The management agents, the management application and the external tools may be implemented using standard Kubernetes libraries and may operate according to the controller pattern.

One of the important aspects of management is security. A compromised agent on one managed entity might access or modify data of other managed entities in the application CRs. In Kubernetes such attacks can be prevented by restricting the permissions of the agent to allow access only to the CRs of the agent's entity. Restricting the permissions of an agent can be accomplished in Kubernetes declaratively by allocating a service account to represent the agent and by specifying Kubernetes access control rules for the service account. Alternatively, authorization WebHooks or admission controllers (for control of updates only) might be used, but those require implementing and maintaining additional components.

Notice an additional important point in this architecture: the management agents initiate network connections to the Kubernetes cluster that hosts the application CRs, the management application does not initiate network connections to the managed entities. The management agents pull the configuration data, the management application does not push the configuration data to the managed entities directly. Such design facilitates network communication in the case where the managed entities are deployed on the edge:

The managed entities may be intermittently connected to the network. When a managed entity becomes connected, its management agent connects to the Kubernetes cluster, pulls the configuration data for its entity and reports the status back. If the management application would initiate connections to the managed entities, it would have to handle network disconnections in an environment with limited network connectivity.
On the edge, there could be a firewall that prevents initiating network connections from the outside and allows initiating network connections by the managed entities only.

If you managed to read until this point, you know what I mean by fully Kubernetes-native management applications. In the following paragraphs I describe Kubernetes scalability bottlenecks and their relevance to the fully Kubernetes-native management applications.

Kubernetes scalability bottlenecks

The most obvious scalability bottleneck is storage. The etcd documentation recommends 8GB as the maximum storage size. It means that an application that manages one million objects can store only up to 8K of data per object. Practically, this limit is lower since etcd stores other cluster data and also a limited history of the objects.

If authorization is handled by creating a service account per management agent as desribed above, then the management application needs to allocate one million service accounts to be able to manage one million managed entities in a secure way. The size of a service account in Kubernetes can be on the order of 10KB, which means the management application cannot create one million service accounts using the recommended etcd storage limit of 8GB (10KB*1,000,000 = 10 GB). Compare this storage limit with storage limits of a leading SQL database, PostgreSQL. The database size is virtually unlimited, a single relation size can be 32TB while a single field size can be 1GB (!).

A possible solution to the storage limitations of etcd is Kine. Kine is an etcd shim, an adapter of etcd API to other databases, like MySQL and PostgreSQL. However, even if the storage problem of Kubernetes is solved, there is another bottleneck for Kubernetes scalability, namely Single-resource API. The mutating API verbs, such as CREATE, UPDATE and DELETE, support single resources only. A client that wishes to apply such a verb to many resources must make a separate request for each of those resources.

Consider the case where the management application needs to rotate one million certificates or update common configuration for one million managed entities. Another case is when a management agent has to update status of multiple subentities of the managed entity. If the management application or a management agent need to update at once a large number of application CRs, tough luck, they must perform the updates one-by-one. Performing updates one-by-one involves a network roundtrip per application CR, including handling network failures and retries per each CR. The issue can be especially acute when connectivity between a management agent and the Kubernetes cluster is limited or there is high latency, for example in case of managing edge devices.

Compare Kubernetes API with a SQL database: in SQL you can insert multiple rows in a single INSERT statement, you can update multiple rows by an UPDATE ... WHERE statement selecting multiple rows to be updated by the WHERE clause. There are also batch operations for sending multiple INSERT and UPDATE commands in a batch. Some SQL databases even have binary bulk operations, where multiple inserts can be parsed into a binary blob on the client side and sent to the server as binary. See for example batch queries and COPY protocol support for faster bulk data loads in the pgx Go driver for PostgreSQL.

Yet another Kubernetes API problem related to scalability is lack of ability to specify sort order of the returned lists of objects. The users of kubectl can specify sort order using --sort-by flag. However, sorting is performed on the client side by kubectl. In a similar way other tools may implement client-side sorting for various sort orders. Fetching one million objects and sorting them by clients may strain computational resources (memory and CPU) of the clients and waste bandwidth, especially in the case where sorting is used with pagination. Consider the case where some user wants to see top ten objects out of one million, according to some criteria. In this case the client (for example a Web browser) must fetch all the million and find the first ten objects in the requested sorting order. Alternatively, the clients could use some proxy on top of Kubernetes API, and let this proxy component perform caching, sorting and pagination for the clients. This would require additional effort of development and maintenance of the proxy component, and would waste computational resources required to run the proxy component.

Another possible performance bottleneck of Kubernetes API is lack of protocol-buffers support for CRDs. Protocol buffers with gRPC may provide 7 to 10 times faster message transmission comparing to a REST API.

One more scalability issue with Kubernetes ecosystem is lack of a built-in mechanism for load balancing between replicas of controllers. There is no built-in mechanism in Kubernetes to distribute processing of changes of CRs to multiple replicas of the same controller. The controller replicas usually perform leader election and the elected leader performs all the reconciliation.

A side note on controller concurrency: by default, the controllers that use controller-runtime do not perform reconciliation concurrently (the MaxConcurrentReconciles option is 1). Also, the work queue retry rate limit (in case of multiple errors) is by default 10 QPS only. To understand rate-limiting in controller runtime and Kubernetes go client, check this great article.

While the defaults mentioned above can be changed by controller developers to increase concurrency, reconciling one million application CRs can strain the computational resources of a single controller. The developers of the management application must implement custom load balancing solutions, wasting development and maintenance effort, if they want to use the controller pattern with one million CRs. Note that various tools in Kubernetes ecosystem are implemented as controllers, for example ArgoCD and Flux.

A general problem with Kubernetes clients, commonly but not exclusively appearing in controllers, is the local caches maintained by informers (commonly used in building controllers). An informer maintains a local cache of all the objects in its purview, which is problematic when that data volume is large, as reported here and here.

Another major scalability bottleneck in Kubernetes is authorization. Consider the following use case: a user of an application that manages one million entities must be allowed to access only ten entities out of the million. Such requirement can be specified using Kubernetes RBAC as one of the Kubernetes authorization modes, by providing GET access to the ten CRs of the allowed managed entities and by forbidding LIST access to the managed entity CRD. (If the LIST access is granted, the user will be able to GET all the managed entities). However, with such authorization, the user cannot GET the ten CRs they are allowed to GET, without specifying all the ten CRs explicitly by their names. This is because Kubernetes API's LIST operation does not perform filtering based on authorization. The clients can either get all the CRs or some specific CRs. Moreover, there is no API to ask Kubernetes which objects some user is allowed to access. Kubernetes clients can query Kubernetes authorization API and inquire whether a given user can access some specific object or whether a given user can access all the objects of some kind. Practically it means that if some client tool, like GitOps or UI, needs to process objects on behalf of a user who has access to ten objects out of one million, this tool must query Kubernetes API one million times, once per each object, to filter the objects the user is allowed to access. That would be a major performance bottleneck (one million network calls of the authorization API). A security issue with the scenario above is that the tool must be authorized to LIST all the objects.

A solution to the problem above could be creating an authorization cache that continuously calculates all the authorization decisions and caches them. Such a cache can be combined with the proxy mentioned above, used for sorting and pagination. Implementing and maintaining such custom authorization/sorting/pagination cache and proxy would require significant development effort.

Summary

To summarize, let me list the Kubernetes scalability bottlenecks examined above:

Storage
Single-resource API
No sort order for server-side sorting
No protocol-buffers support for CRDs
No load balancing between replicas of controllers
Extensive client-side caching
No filtering by authorization
Unknown unknowns

The last item in the list above relates to unknown bottlenecks that could be discovered once all other problems in the list are solved. There is an evidence of large-scale fully-Kubernetes-native management from Rancher Fleet that managed to (sorry again for too many "management" words) import one million managed entities (Kubernetes clusters in that case). This blog post describes that experiment and, in particular, using Kine to overcome the storage problems. However, the authors of the blog post did not provide details on how effective was performing the actual management tasks after importing, initial discovery of deployments and reporting the status back. It would be interesting to know how much time did it take to deploy a new application to one million managed clusters, was the UI able to perform sorting and pagination of one million managed clusters, how effective was using kubectl to work with one million managed clusters.

In my opinion, in order to support fully-Kubernetes-native management at large scale, without requiring custom solutions for storage, caching, sorting, load balancing and authorization, Kubernetes must be significantly changed.

I would like to thank Mike Spreitzer for reviewing this blog post and for providing enlightening comments, and for great discussions we had about Kubernetes. Thanks to Maroon Ayoub for his review.

Tuesday, March 26, 2019

Checklist: pros and cons of using multiple Kubernetes clusters, and how to distribute workloads between them

Here is a list of pros and cons I found for using multiple clusters vs. a single one.

Reasons to have multiple clusters

Scalability limits, for example a Kubernetes cluster has a limit of 150,000 pods. An OpenShift cluster has a limit of 10,000 services.
Separation of production/development/test
especially for testing a new version of Kubernetes, of a service mesh, of other cluster software.
Compliance
according to some regulations some applications must run in separate clusters/separate VPNs.
Multi-vendor
to prevent vendor lock-in running clusters of multiple providers.
Cloud/on-prem
to split the load between on-premise services.
Regionality for latency
run clusters in different geographical regions to reduce latency in those regions.
Regionality for availability
run in clusters in different regions/availability zones to reduce damage of a failing datacenter/region.
Better isolation for security
Isolation for easier billing/resource allocation

Reasons to have a single cluster

Reduce setup, maintenance and administration overhead
Improve utilization
Reduce latency between applications in multiple clusters
Cost reduction

How to allocate workloads to clusters

Compliance
some applications must run on separate clusters.
Locality for latency
allocate the applications according to the regions, to reduce latency.
Billing/Quotas
allocate applications together per billing account, to facilitate billing/quota enforcement.
Maintainability
put the applications in the same cluster when it makes sense to perform maintenance of the cluster for all them (upgrading Kubernetes version, etc.).
Hardware requirements
allocate high-performance applications to clusters with hardware for high performance.
Dependencies
reduce the need in intra-cluster service registries by allocating dependent applications together.
Identity and Access management
allocate applications in such a way that in-cluster identity and access management would suffice
Monitoring, tracing, logging
allocate applications to reduce the need for distributed monitoring, tracing, logging.

Sources:

Saturday, October 29, 2011

On the difference between Linked Data and Semantic Web

After being confused for some time about the difference between Linked Data and Semantic Web and after reading some resources about the both concepts, I would like to share my interpretation of what I read.

Semantic Web is a vision of (among some other things) creating a Web of Data. Linked Data is a concrete means to achieve (a lightweight version of) that vision. I will explain later what I mean by the lightweight version. For now, Linked Data can be seen like a reference implementation of the Semantic Web, one of several possible implementations. Semantic Web is What and Linked Data is How.

According to the Linked Data book : "Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data."

According to the W3C Linked Data page: "The Semantic Web is a Web of Data... to make the Web of Data a reality, it is important to have the huge amount of data on the Web available in a standard format, reachable and manageable by Semantic Web tools. Furthermore, not only does the Semantic Web need access to data, but relationships among data should be made available, too, to create a Web of Data (as opposed to a sheer collection of datasets). This collection of interrelated datasets on the Web can also be referred to as Linked Data.
...
Linked Data lies at the heart of what Semantic Web is all about: large scale integration of, and reasoning on, data on the Web."

According to the article of Chris Bizer, Tom Heath and TimBL, Linked Data - the Story so far : "... while the Semantic Web, or Web of Data, is the goal or the end result of this process, Linked Data provides the means to reach that goal. ... Over time, with Linked Data as a foundation, some of the more sophisticated proposals associated with the Semantic Web vision, such as intelligent agents, may become a reality."

So Linked Data constitutes a paradigm of publishing data sets on the Web in order to achieve the goal of creating a Web of Data - part of the vision of the Semantic Web. The published interrelated data sets themselves are also referred as Linked Data.

There are, however, at least two differences between the original vision of Semantic Web and the vision Linked Data principles facilitate to achieve.

The first difference is about the usage of URIs. According to Linked Data principles URIs have to be dereferenceable, while there is no such requirement for RDF. In the citations below, the bold font is applied by me.

RDF Primer:
In addition, sometimes an organization will use a vocabulary's namespace URIref as the URL of a Web resource that provides further information about that vocabulary... Accessing ... namespace URIref in a Web browser will retrieve additional information about the ... vocabulary... However, this is also just a convention. RDF does not assume that a namespace URI identifies a retrievable Web resource

The Linked Data book:
The primary means of publishing Linked Data on the Web is by making URIs dereferenceable, thereby enabling the follow-your-nose style of data discovery. This should be considered the minimal requirements for Linked Data publishing.

The second difference is about ontological axioms. According to the Linked Data book ontological axioms should be used sparingly:

"Only define things that matter – for example, defining domains and ranges helps clarify how properties should be used, but over-specifying a vocabulary can also produce unexpected inferences when the data is consumed. Thus you should not overload vocabularies with ontological axioms, but better define terms rather loosely (for instance, by using only the RDFS and OWL terms introduced above). "

The RDFS and OWL terms introduced in the Linked Data book are :

rdf:type
rdfs:Class
rdfs:Property
rdfs:subClassOf
rdfs:subPropertyOf
rdfs:domain
rdfs:range
rdfs:label
rdfs:comment
owl:Ontology
owl:ObjectProperty
owl:inverseOf
owl:equivalentClass
owl:equivalentProperty
owl:inverseFunctionalProperty

So this is what I meant by writing that Linked Data is a means to reach a lightweight version of Semantic Web - the Web of Data with limited use of ontologies and knowledge representation.

One might ask where the use of RDFS and OWL appears in the Linked Data principles. It is actually in the principle 3: "When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)"

Once you use URIs for RDF properties, looking up the properties should provide an information about the properties - information expressed by RDFS and OWL.

In this talk about Linked Data TimBL mentions using Ontology bits for basic inference : "Inference - smarter query" and Ontology bits for Validation and Constraining input : "... mistakes to be spotted... user input menus to be constrained".

Note the words "basic" and "bits".

The seminal paper "The Semantic Web" in Scientific American from 2001 talked about inference rules in the ontologies, for example:
Inference rules in ontologies supply further power. An ontology may express the rule "If a city code is associated with a state code, and an address uses that city code, then that address has the associated state code."

It seems that TimBL too, is now in favor of achieving (first) the limited version of the Semantic Web by the Linked Data principles - less ontological axioms, less knowledge management, less semantics. Maybe this attitude is aligned with the Rule of Least Power:

Principle: Powerful languages inhibit information reuse.
...
Good Practice: Use the least powerful language suitable for expressing information, constraints or programs on the World Wide Web.

So, using ontological axioms except from those mentioned above is probably not required for creating the Web of Data. This is why their usage is discouraged by the Linked Data book - they provide more power than needed. A more powerful use of ontologies might be labeled as the Web of Knowledge or Linked Ontologies or Linked Knowledge as opposed to the Web of Data and Linked Data. The original Semantic Web vision probably was to create both Web of Data and Web of Knowledge (Web of Ontologies). The goal of Linked Data paradigm is to achieve Web of Data only.