How NetEase Shufan uses Kubernetes "primitives" to get cloud-native middleware

How NetEase Shufan uses Kubernetes "primitives" to get cloud-native middleware

At the recent ArchSummit Global Architect Summit 2021 Shanghai Station, following the vice president of NetEase, Executive Dean of Hangzhou Research Institute, Chairman of the Internet Technology Committee, and General Manager of NetEase Shufan Wang Yuan delivered a keynote speech "Building an Open Cloud Native Operating System and System After "Software Architecture", Zhang Xiaolong, a member of the NetEase Technical Committee and Director of NetEase's Sofan Infrastructure, further explained to the attendees the thinking, implementation and experience of NetEase's Sofan on cloud-native middleware. This article is a record of the content of the speech.

Today I will share with you our practice of containerization of middleware for the production environment, which mainly includes four parts:

The first part starts from the operation and maintenance challenges faced by basic middleware, and introduces the technological evolution path of NetEase to solve these challenges, and why it needs to be containerized.

The second part introduces the requirements of middleware containerization and the overall platform architecture of Netease Shufan .

The third part gives our thoughts and best practices for some common problems in the process of middleware containerization.

The last is a summary of middleware containerization work and future plans.

The challenge of basic middleware

Before the emergence of container technology, basic middleware technologies such as MySQL, Redis, Kafka, etc. have long been open source and have become standard components of server-side architecture design. For a typical Internet application, the three middleware of database, cache, and message queue are indispensable. of.

It is very simple for architects to use these middleware to build application platforms, but operation and maintenance personnel have encountered major problems, including the following 5 aspects:

  1. Middleware itself is a relatively complex distributed system. Operation and maintenance needs to understand the working principles of these distributed systems and write operation and maintenance scripts suitable for them, which is very complex;
  2. The operation and maintenance efficiency is relatively low. Manual operation and maintenance of less than 50 MySQL instances may be no problem, but 500, 1,000 database instances, or thousands of Redis instances such as NetEase Cloud Music, if manual scripts are used for operation and maintenance, the efficiency will be inevitable Very low
  3. Insufficient stability. This is because the operation and maintenance personnel always use manual scripts for operation and maintenance, copying commands online, and accidentally copying the wrong commands may cause the middleware to go down;
  4. Traditional middleware is deployed on physical machines, and physical mechanisms cannot provide strong resource flexibility;
  5. All the more senior middleware operation and maintenance are basically in large factories on the Internet. Because these operations and maintenance are very complicated, it is difficult for ordinary enterprises to recruit a very professional operation and maintenance. We believe that the best practice to solve this challenge is to Cloud serviceability of software operation and maintenance capabilities.

There are several advantages to making these middleware into cloud services. The first is that the operation and maintenance is simple and easy to use, the second can efficiently realize the automated operation and maintenance of a large number of instances, and the third has a strong SLA guarantee, because there is no need to type too many manual commands. The fourth is the ability to quickly expand capacity with the help of IaaS elastic resource capabilities. Finally, because the entire operation and maintenance has become simple, a large number of professionals are no longer needed to help business operation and maintenance of middleware.

In fact, public cloud vendors have also seen this trend. The three major domestic public clouds have all made open source basic middleware into cloud services. I think there are two main reasons for this: 1. the competition at the IaaS resource level tends to be homogenized. Making PaaS middleware into a cloud service can consume more resources and bind users more deeply; secondly, middleware serves as a cloud service. The gross profit margin of its value-added services is much higher than that of cloud hosts and cloud hard drives. Therefore, many public cloud users don't like RDS and buy cloud hosts to build MySQL on their own.

In order to solve the challenge of the complexity of middleware operation and maintenance, NetEase developed a cloud-based middleware platform six or seven years ago. This platform has some technical characteristics. Firstly, it provides resource elasticity based on IaaS. That is to say, the computing resources that the middleware runs on are cloud hosts, the storage resources are cloud disks, and the network resources may be in the tenant's VPC.

2. it adopts the tenant isolation strategy of IaaS. If a tenant wants a middleware instance, the platform uses his cloud host and cloud hard disk to automatically set it up, which can achieve good isolation between different tenants.

At that time, we developed 6 basic middleware cloud services. The business team needs middleware to develop products. It only needs to access these cloud services, and there is no need to do it again. What we mainly do is the control management part on the left, such as instance high availability, deployment and installation, instance management, etc. At that time, we also achieved some results, which greatly improved the operation and maintenance team's ability to operate and maintain the middleware.

Over time, the first generation of basic middleware exposed three major flaws, which were difficult to solve. The first major flaw is the lack of extreme performance. Because it uses the KVM virtual machine as a computing resource, it has a very large performance loss compared to running it physically, and it cannot meet the demanding requirements for middleware performance and stability under high business load/high pressure.

The second is that the cost of resources is too high, because it is based on OpenStack to provide resource orchestration capabilities. In addition, the strong isolation of KVM virtualization technology makes it impossible to share memory resources among multiple middleware instances. These two factors make running The deployment density of middleware instances on virtual machines is very low. Even if the middleware load of a tenant is not high, he cannot release the memory because KVM is strongly isolated.

The third point is that its delivery is very inflexible. It is tied to NetEase s IaaS, and there is no way to support us in commercializing it in the future and exporting it to companies other than NetEase. The infrastructure of this company may be on the public cloud or it may be. It is in its own IDC computer room.

Thinking about middleware containerization

In recent years, container technologies such as Docker and Kubernetes have been born and developed rapidly. The containerization of stateless applications has matured. We believe that as a new infrastructure technology that has been widely implemented, containers perfectly correspond to the first generation of basic middleware. Defective capabilities weak isolation is helpful for resource sharing; lightweight virtualization can eliminate performance loss and meet business load scenarios; standardized packaging based on mirroring is conducive to efficient delivery; and powerful and flexible scheduling capabilities; The most critical point is that it is a cornerstone of the entire cloud native technology stack.

The most important thing about Kubernetes orchestration technology is that it is loosely coupled with the infrastructure, allowing us to move applications to any place, because it is designed for hybrid clouds. In addition, it is designed for a large-scale production environment, inheriting the experience of Google's large-scale production environment, so it is promising to use container technology to solve the problem of middleware servicing.

NetEase internally built a cloud-native operating system based on Kubernetes, which can adapt to various infrastructure resources downwards, and can serve as a unified provider of various application loads upwards-this is also one of Kubernetes' goals. Middleware is exactly the type of business supported by the entire cloud native operating system. From this perspective, middleware containerization is also logical.

Middleware containerization must solve its operation and maintenance problems, especially the following requirements must be considered.

1. for life cycle management, we need a containerized middleware platform that can help O&M complete various O&M operations at the middleware instance level. NetEase Shufan will implement it based on the Kubernetes Operator framework.

The second point is high-availability deployment. Middleware, especially in the pursuit of higher availability, is often deployed in multiple computer rooms. All instances in a middleware cluster should be distributed in different proportions according to what proportion. In the computer room, the standard Kubernetes scheduler can't do it. We need to extend the Kubernetes scheduler to achieve such an orchestration.

At the same time, it is necessary to improve the monitoring and alarm indicators, this indicator corresponds to the cloud-native Prometheus observability system.

Performance is a pain point of the first generation of middleware. We need to ensure that containerized middleware basically reaches the performance of physical machine deployment to support core applications. This requires targeted optimization of the performance of various middleware instances.

Another point is productization. Because we hope that middleware containerization can be used not only in NetEase, but also in commercial output. Therefore, we refer to the product forms of RDS and Redis on the public cloud. In terms of infrastructure, low-cost and flexible delivery, we must adopt a loosely coupled and highly reused architecture design.

Netease Shufan chose the Kubernetes Operator mechanism. From a deep understanding, Kubernetes constructs a "primitive" required for the deployment and maintenance of a distributed system. Its built-in objects such as Pod, Node, Deployment, StatefulSet, etc. are all designed to implement a typical stateless distributed system. from. These built-in objects cooperate with each other to make the deployment, operation and maintenance of stateless applications very efficient.

But these built-in objects in Kubernetes cannot directly solve the problem of intermediate deployment and operation. The first point is that middleware is stateful. Its state is storage, possibly network IP. 2. a middleware instance is different from a stateless application instance. The copies of the latter have no relationship with each other, while the middleware instance and the instance, the copy and the copy are related, and they must visit each other. A complex topological relationship is formed between the components. For example, during failure recovery, there is a master-slave relationship between the two copies of Redis.

The community also began to implement middleware or stateful applications more than two years ago, and proposed a set of Operator development framework. If we understand Kubernetes as an operating system, then Operator is a development framework for developing native applications on this operating system, supporting more efficient, automated, and scalable development methods.

Operator has four characteristics. 1. it needs to be developed. It is a declarative programming philosophy that follows, with object definitions, and controller deployment. Operator is actually a controller, following the closed loop of the decision chain of observation, analysis, and action. If the user defines 4 resources, Operator will analyze the inconsistencies between the current state of these 4 resources and the target state.

As you can see in the picture, there is 1 Pod in the current state. It is now version 0.0.1. The state we defined requires 0.02, and one Pod is missing. If there is an inconsistency, it will have some Actions and add another Pod. , Upgrade it to 0.0.2. We implement Operator, in fact, to write how these Actions should be done. This actually encapsulates the operation and maintenance knowledge and experience in a specific field, and can be designed to manage complex state applications.

The main body of the Operator development framework consists of three parts, the first part is operator-sdk, a scaffold developed; the second part is operator-lifecycle-manager, a lifecycle management component; the third part is operatorhub.io , since anyone can In order to develop an application, an application that can be deployed, installed, operated and maintained, he should be able to put this application in an application market, and operatorhub.io is such an application market.

Different organizations develop Operators. From the perspective of operation and maintenance, there is a certain level of maturity. Application deployment can automate operation and maintenance. This is the most desirable level for corresponding operation and maintenance. The first level of the most basic is the basic installation of Operator, how to achieve the original installation and deployment script, using the Operator this engineering model to achieve.

This is a middleware platform architecture based on Kubernetes Operator implemented by NetEase Shufan, including the control plane and the data plane. The control plane on the left is oriented to the capabilities of operation and maintenance management, including some common components that are not related to the middleware business but everyone needs, such as auditing, authentication permissions, and consoles.

The middleware is the middleware Operator. Here we use the mechanism of Operator to develop middleware such as Redis, Kafka, and MySQL.

We have implemented the lifecycle management of middleware. These Operators themselves are also running on Kubernetes, and it is a stateless application that can be run on it in Deployment mode, because its state is stored in etcd.

Below is the management and control surface of Kubernetes, some components required by the Master node.

The bottom part is the log, monitoring, and alarm components. A log management platform developed by us realizes the dynamic update of its configuration from the collection of information, and the collection of logs.

On the right is the data plane of the middleware. I have drawn three Nodes. We use a StatefulSet to implement a middleware cluster. Each instance runs on a Pod. Each Pod may declare its use for persistent volumes. There is a topological relationship between Pod and Node. It needs to synchronize data and topology with each other for state change and failure recovery. Each node will run two components of Kubernetes, Kublet, kube-proxy, and a collector for log monitoring.

We have also implemented the Pod hanging function, whether it is a local disk or a remote disk, through StorageClass, which is also the standard of Kubernetes.

Common problems and solutions of middleware containerization

Next, we will discuss the solutions to some common problems in the middleware containerization process. The biggest feature of middleware is that it is stateful. Kubernetes is only responsible for the orchestration of calculations. There are two possibilities for state storage of middleware, one is remote storage and the other is local storage.

We believe that remote storage is the best practice. If you have a remote distributed storage similar to open source Ceph in a private cloud environment, you should not hesitate to use it for storage. If Ceph performance is insufficient, you can find other better distributed storage to use directly. If you are on a public cloud, you should not hesitate to use cloud disks as middleware storage.

In many cases, local storage is a last resort. Because there is no distributed storage that is too reliable, it is possible that the performance of this distributed storage is not good, and it is very different from running with a local disk. It may also be a distributed system. The reliability of the terminal is not good, and data will be lost.

To this end, we have implemented local storage access. We have two requirements for local storage. One is to do dynamic management configuration when Pod applies for PVC. When local disks are created or deleted, corresponding operations must be done. At the same time, when the Pod is scheduled, it must be strongly bound to the local disk. Since there is a local disk on a certain Node when the Pod is created, you must ensure that the Pod will still run on that Node after failure recovery or rescheduling. To ensure that the middleware data is not lost.

In terms of technical implementation, we introduced an LVM for dynamic management of the local disks on the nodes, and also adopted Kubernetes Local PV. The disadvantage of the latter is that it requires operation and maintenance to create PVs on the nodes in advance, which is not advisable. So we have done two things. One is to expand the scheduler to implement resource preparation for local storage. When creating a Pod, declare the size of the required local disk, and it can be dynamically created and mounted to this Pod, without transportation. The dimension is prepared manually in advance.

In the scheduling process of a Pod in the picture, the user creates a Pod, which declares a PVC, we add a local storage scheduler extension, first do a pre-scheduling, and calculate the storage capacity of the local disk on each node. If it is not enough, put the Node information in the PVC as well, and then notify this Node of a local storage resource preparer, and let the resource preparer call LVM to create the storage resource when it receives the request, and put the corresponding The PV is created. Bind the PV and PVC on the resource preparer, and then notify the scheduler that the Pod can be scheduled to this node because the declared local storage is ready. Next, use Kubernetes to mount the local disk where the node is located into the Pod to complete an overall scheduling.

Regarding the middleware containerized network, there are two scenarios for the realization. In the first scenario, the middleware we designed runs on different infrastructures and corresponds to different network configurations. If it is a physical network, you can use network solutions such as Calico and Flannel, and use its CNI directly; if it is a public cloud, Just connecting to the VPC network on the public cloud, the advantage is that each public cloud provides a standard CNI for Kubernetes, so that Kubernetes running on the cloud host can access their network.

In the second scenario, we need to optimize network performance. We have introduced a container SR-IOV solution, which has the advantage of being able to achieve lower latency than physical machines. It uses the network card pass-through technology to achieve, which can reduce the delay by 50%, and can meet the needs of some ultra-high-performance tasks with high delay requirements, but PPS cannot be improved. Pass-through reduces the virtualization overhead of network transmission, but the disadvantages are also obvious. This solution can only be used on physical networks, because it completely relies on hardware network cards and cannot be used to achieve network acceleration on public clouds.

In the physical network environment, it is necessary to deal with the heterogeneous problem of network cards, including that we may use Intel network cards, and there may be Mellanox network cards, and VF (a concept of SR-IOV) needs to be finely managed. We regard VF as an extended scheduling resource, and discover and register the VF resources of the node through the standard Kubernetes Device Plugin. Combining label and taint tags, the native scheduler can manage and allocate resources.

The cluster of Qingzhou middleware is abstracted with StatefulSet. Each instance is a Pod of StatefulSet. StatefulSet can only keep the name of the Pod unchanged. When it is updated differently, or when it is hung up and resumed, it will keep the Pod. The name remains the same, but it cannot keep the Pod s IP unchanged. However, in the eyes of traditional middleware operation and maintenance, the IP deployed based on the physical machine is unchanged, and the original IP after the machine restarts, so some of their operation and maintenance habits prefer to use IP instead of domain name.

In order to allow containerized middleware to be promoted faster and take into account existing applications, we have made the function of keeping the IP of StatefulSet unchanged, which is achieved by introducing a global container address pool component to take over the allocation of Pod IP. When creating a StatefulSet, record the IP assigned to it. Even if the Pod is deleted when the Pod is updated, the IP will remain unreleased. After it is rebuilt, if the name is the same as the original one, then this The IP is reassigned to him.

For engineering, we develop containerized middleware. Compared with the first-generation virtualization-based middleware, because we reuse some of the built-in concepts of Kubernetes and some of its operation and maintenance and control mechanisms, we can develop the same basic middleware. The cost of research and development can be greatly reduced. This is reflected in the fact that the code is much less than that of the first-generation basic middleware. Of course, this code reduction also has a price. Developers must have a good understanding of the development framework of Kubernetes Operator. Only by understanding the concept of Kubernetes declarative programming can he write it.

In terms of quality assurance, we have done two things. The first is chaos testing, which is fault testing. Based on the open source ChaosBlade, we simulate the impact of Kubernetes resource failures on middleware services. In addition, we also use the Kubernetes e2e testing framework to ensure operation. Maintenance personnel can simulate whether the life cycle operations of various middleware instances are normal.

Another point is that to manage the life cycle of middleware instances, you need to monitor and alert. In many cases, its UI has something in common, and the usage mode of the UI is the same. This is a front-end page rendering we designed. , The rendering engine makes it possible to quickly develop the console with the dynamic form mechanism, and the backend can realize the development capability of the console business by configuring it, which makes the research and development cost smaller.

Performance optimization, we have adopted some strategies to make the performance of containerized middleware basically close to the level of it running on a physical machine. We opened the performance mode on the CPU to reduce the wake-up delay. In terms of memory, we turn off SWAP and transparent huge pages, and tune the synchronous memory dirty page write-back threshold. These are all parameter-level tuning.

In terms of I/O, enable the kernel blk-mq and increase the read-ahead cache. Another important thing is the network card interrupt. We isolate the physical method interrupt from the container's veth virtual network card interrupt processing from the CPU to ensure that the system performance does not jitter.

NUMA is also a point of our optimization, which is more obvious in the high load. We make container deployment aware of NUMA topology, allocate Pod as much as possible to local NUMA, and try not to let a Pod cross NUMA, so as to avoid the cost of relatively large CPU cache.

One flaw of the first-generation middleware is that it cannot be delivered out. Last year we made a containerized middleware product called Qingzhou Middleware, which has the standard capabilities of basic middleware. In the access layer, we have also added some capabilities, because we are based on Kubernetes, operation and maintenance personnel can even operate and maintain middleware through Kubectl and YAML files. In the middleware service layer, we have implemented 7 basic middleware services, which basically have the core operation and maintenance capabilities mentioned earlier.

On the whole, middleware is based on Operator and can run on any Kubernetes cluster. The underlying resources do not matter. The virtual machine of the public cloud can be used as the Node of Kubernetes, and the cloud disk can be used as the storage of Kubernetes. In addition, we also allow some middleware developed by the community based on Operator to run on our platform.

Future outlook

Technology is for business. The biggest pain point of middleware is operation and maintenance. It must be solved by managed cloud services. The advantages of container technology make middleware containerization the best practice for realizing middleware cloud services. Operators are needed for implementation, and a more cloud-native model is needed to develop containerized middleware. Of course, the requirements for developers are also very high.

There are two points in the future plan. 1. our current containerized middleware platform can run on any Kubernetes, but we still have to run on Kubernetes distributions, such as OpenShift, Rancher, etc., and hope that these containerized middleware Operator can also run on it, but some compatibility needs to be done. 2. we want to build a cloud-native operating system as a whole. Middleware is one of the loads. Why don't I mix the load of middleware with the load of stateless applications? This can bring a higher resource utilization rate to the company and reduce costs.

thank you all!