Case analysis of key business jitter caused by memory recycling-on cloud native OS memory QoS guarantee

Case analysis of key business jitter caused by memory recycling-on cloud native OS memory QoS guarantee

Jiang Biao, senior engineer of Tencent Cloud, has been focusing on operating system related technologies for more than 10 years, and is a senior Linux kernel enthusiast. He is currently responsible for the research and development of Tencent Cloud's native OS and the performance optimization of OS/virtualization.

Lead

Cloud-native scenarios, compared with traditional IDC scenarios, the business is more complex and diverse, and the native Linux kernel often seems a little powerless when facing various complex cloud-native scenarios. Based on an actual case in a Tencent Cloud native scenario, this article shows some troubleshooting ideas for similar problems, and hopes to see the underlying logic of the Linux kernel and possible optimization directions.

background

On the node where a key business container of a Tencent Cloud customer is located, the CPU sys (kernel state CPU occupancy) was occasionally increased, resulting in business jitter and irregular recurrence. The node uses the upstream 3.x version of the kernel.

phenomenon

In the case of normal business load, the monitoring can see obvious CPU occupancy glitches, up to 100%, and the node load soars, and the business will be jittered at this time.

Capture data

Ideas

The fault phenomenon is caused by the CPU sys running high, that is, the CPU continues to run in the kernel mode. The analysis idea is very simple. When sys is running high, the specific execution context information can be stack or hot.

Difficulty: Due to the random occurrence of the fault, the duration is relatively short (second level), and because the kernel mode CPU is surging, when the fault recurs, the conventional troubleshooting tools cannot be scheduled to run, and the login terminal will be hung (due to the normal scheduling) , So routine monitoring (usually at the minute level) and troubleshooting tools cannot catch field data in time.

Specific operation

Second-level monitoring

By deploying second-level monitoring (based on atop), the system-level context information when the fault occurs can be captured when the fault occurs. Examples are as follows:

From the figure, we can see the following phenomena:

  1. sys is very high, usr is relatively low
  2. Triggered page recycling (PAG line), and very frequently
  3. For example, processes such as ps generally have higher CPU usage in kernel mode, but lower CPU usage in user mode and are in an exit state.

At this point, I have captured the system-level context information. At the time of the failure, you can see the processes and states that were running in the system with high CPU usage. There are also some system-level statistics. However, there is still no way to know the specific consumption of sys at the time of the failure. Wherever there is, you need to use other methods/tools to continue to capture the scene.

Fault scene

As mentioned earlier, the scene mentioned here can be the instantaneous stack information at the time of the fault, or it can be the hot spot information. For the collection of the stack, the simple way that can be directly thought of:

  1. pstack
  2. cat/proc//stack

Of course, both of these methods depend on:

  1. The pid of the process with high CPU usage at the time of the failure
  2. The collection process can be executed in time when a fault occurs, and it can be dispatched and processed in time

Obviously these are difficult to manipulate for the current problems.

The most direct way to collect hotspots is the perf tool, which is simple, direct and easy to use. But there are also problems:

  1. The cost is large and it is difficult to deploy in a normal manner; if it is deployed in a normal manner, the amount of collected data is huge and analysis is difficult
  2. There is no guarantee that the execution can be triggered in time in the event of a failure

Perf is essentially periodic sampling through pmu hardware, and NMI (x86) is used for sampling during implementation. Therefore, once the acquisition is triggered, it will not be disturbed by factors such as scheduling, interrupts, and soft interrupts. However, since the execution of the perf command itself must be triggered in the context of the process (via the command line, program, etc.), when a fault occurs, due to the high CPU usage in the kernel mode, there is no guarantee that the process executed by the perf command can be normal Scheduling to sample in time.

Therefore, the hotspot collection for this problem must be deployed in advance (normalized deployment). There are two ways to solve (alleviate) the aforementioned problems of high overhead and difficult data analysis:

  1. Reduce the perf sampling frequency, usually to 99 times/s, and the actual impact on the real business is controllable
  2. Perf data slice. By slicing the data collected by perf according to time period, combined with the failure time point (segment) in cloud monitoring, the corresponding data piece can be accurately located, and then targeted statistical analysis can be done.

Specific method: Collection:

`.``/perf`` record -F99 -g -a` duplicated code

analysis:

#View the captured on time in the header, which should indicate the end time. The time of last sample is the last collection timestamp. The unit is seconds. You can trace back the scene time ./perf report --header-only #Index by timestamp ./perf report --time start_tsc,end_tsc Copy code

According to this idea, a site was collected by deploying the perf tool in advance . The hot spot analysis is as follows:

As you can see, the main hot spot is a spinlock in shrink_dentry_list.

analysis

On-site analysis

According to the result of perf, we find the hot function dentry_lru_del in the kernel, simply look at the code:

//dentry_lru_del() function: static void dentry_lru_del(struct dentry *dentry) { if (!list_empty(&dentry->d_lru)) { spin_lock(&dcache_lru_lock); __dentry_lru_del(dentry); spin_unlock(&dcache_lru_lock); } } Copy code

The spinlock used in the function is dentry_lru_lock, which is an oversized lock (global lock) in the 3.x kernel code. All dentry of a single file system are put into the same lru linked list (located in the superblock). Almost all operations on the linked list (dentry_lru_(add|del|prune|move_tail)) require this lock, and all files The system shares the same global lock (3.x kernel code), refer to the add process:

static void dentry_lru_add(struct dentry *dentry) { if (list_empty(&dentry->d_lru)) { //Take the global lock spin_lock(&dcache_lru_lock); //Put dentry into the lru list of sb list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru); dentry->d_sb->s_nr_dentry_unused++; dentry_stat.nr_unused++; spin_unlock(&dcache_lru_lock); } } Copy code

Since dentry_lru_lock is a global lock, this lock can be held in some typical scenarios that can be thought of:

  1. File system umount process
  2. rmdir process
  3. Memory reclamation shrink_slab process
  4. The process exits the process of cleaning up the/proc directory (proc_flush_task)-the scene captured earlier

Among them, when the file system is umount, all dentry in the corresponding superblock will be cleaned up, and the lru linked list of the entire dentry will be traversed. If the number of dentry is too large, it will directly cause sys to rise, and other processes that rely on dentry_lru_lock will also cause serious problems. Because it is a spinlock, the lock contention of other contexts will also cause other context sys to go up. Next, looking back at the previous second-level monitoring log, you will find that the fault is that the system's slab occupies nearly 60G, which is very large:

The dentry cache (located in the slab) is likely to be the culprit. The easiest way to confirm the specific distribution of objects in the slab: Slabtop. Find a similar environment on other nodes in the same business cluster. It can be seen that the dentry occupancy rate is the majority:

Next, we can use the crash tool to parse the dentry lru linked list of the superblock corresponding to the file system online. It can be seen that the number of unused entries is as high as 200 million+

On the other hand, according to the context log of the business, it can be confirmed that when one type of failure occurs, the business has deleted the pod, and during the pod deletion, the overlayfs will be unmounted, and then the file system unmount operation will be triggered, and then this phenomenon occurs. The scene is exactly the same! Furthermore, in an environment with 200 million+dentry, manually drop slab and time by time, close to 40s, the blocking time can also be consistent.

`time`` echo` `2>` `/proc/sys/vm/drop_caches` duplicated code

So far, it can basically explain: the direct cause of sys's rush is too much dentry.

Where did the billion-level Dentry come from?

The next question: why are there so many dentry? The direct answer method is to find the absolute path of these dentry, and then reverse the business according to the path. So how to analyze 200 million +dentry?

Two ways:

Method 1: Online analysis

Through the online analysis of the crash tool (manual drill), the basic idea:

  1. Find the position of dentry lru list in sb
  2. List all node addresses, archive the results
  3. Because there are too many entries, they can be sliced and saved in separate files in batches, which can be parsed in batches later.
  4. Edit archive files in Vim column, insert commands in batches (file), and save them as files that execute commands in batches
  5. Crash -i executes command files in batches and archives the results
  6. Perform text processing on batch execution results, and count file paths and quantities

Example result:

among them:

  1. db is the similar xxxxx_dOeSnotExist_.db file mentioned later, which accounts for the majority.
  2. session is a temporary file created by systemd for each session

The db file analysis is as follows:

The file name has several obvious characteristics:

  1. There is a uniform count, which may be generated by a certain container
  2. The name contains the string "dOeSnotExist"
  3. All have the .db suffix

The corresponding absolute path example is as follows (used to confirm the container)

In this way, you can continue to find the corresponding container (docker inspect) through the overlayfs id to confirm the business.

Method 2: Dynamic tracking

By writing a systemtap script to track the dentry allocation request, the corresponding process can be captured (under the premise of reproducibility). The script example is as follows:

probe kernel.function("d_alloc") { printf("[%d] %s(pid:%d ppid:%d) %s %s\n", gettimeofday_ms(), execname(), pid(), ppid(), ppfunc(), kernel_string_n($ name->name, $name->len)); } Copy code

Statistics by process dimension:

Xxx_dOeSnotExist_.db file analysis

It can be judged that the file is related to the nss library (certificate/key related) through the previously captured path. For the https service, the underlying nss cryptographic library needs to be used. The tools for accessing web services such as curl use this library, and nss Inventory of bugs: bugzilla.mozilla.org/show_bug.cg... bugzilla.redhat.com/show_bug.cg...

The behavior of massively accessing non-existent paths is to detect whether nss db is accessed on the network file system. If accessing the temporary directory is much faster than accessing the database directory, the cache will be enabled. This detection process will try to loop files that do not exist in stat within 33ms (maximum 10,000 times). This behavior causes a large number of negative dentry. Use the curl tool to simulate this bug, execute the following commands in the test machine:

`strace`` -f -e trace = access curl `` 'https://baidu.com'` copy the code

Workaround: Set the environment variable NSS_SDB_USE_CACHE=yes. Solution: Upgrade the nss service in the pod. So far, the problem analysis is almost complete. It seems to be a bloody case caused by a bug in a mediocre user mode component, and the analysis methods and means are also mediocre, but the following analysis is the focus of our attention.

Another phenomenon

Recall the previously mentioned dentry_lru_lock big lock competition scene, carefully analyze other cases of second-level monitoring scenes where sys rushed high, and found that there is no delete pod action (that is, no umount action) in this scenario, which means there is no When traversing dentry lru, there should be no repeated holding of dentry_lru_lock, and at the same time the phenomenon of sys charging high.

It can be seen that the cache before and after the failure was recycled by 2G+, but the actual free memory did not increase, but decreased, indicating that at this time, the business should be allocating a large amount of new memory, resulting in insufficient memory, and the memory has been recycled (scan The number has increased a lot).

When the memory is tight and enters the direct recovery, it will (may) shrink_slab, so that it needs to hold dentry_lru_lock. The specific logic and algorithm here are not analyzed :). When the pressure of reclaiming memory continues, it may repeatedly/concurrently enter the direct reclaim process, leading to dentry_lru_lock lock competition. At the same time, in a business scenario where problems occur, a single pod process has 2400+ threads. When batch exits, call proc_flush_task to release the/proc directory The process directory entry of the system will also obtain dcache_lru_lock locks in batches/concurrently, which will intensify lock competition and cause sys to charge high.

Both phenomena can be basically explained. Among them, the second phenomenon is more complicated than the first, because it involves concurrent processing logic when memory is tight.

Solve & think

Direct solution/circumvention

Based on the previous analysis, it can be seen that the most direct solution is: upgrade pod nss service, or set environment variables to avoid it. But if you think about it again: if nss has no bugs, but other components have done similar things that may generate a lot of dentry Actions, such as executing scripts like this:

#!/bin/bash i=0 while (( i <1000000 )); do if test -e ./$i; then echo $i> ./$i fi ((i++)) done Copy code

In essence, dentry (slab) is constantly produced. What should I do when facing this kind of scene? A possible simple solution/circumvention method is: periodic drop cache/slab, although it may cause occasional small fluctuations in performance, it can basically solve the problem.

Lock optimization

The previous analysis pointed out that the direct cause of the sys rush is the competition of the dcache_lru_lock lock. Is there room for optimization of this lock? The answer is: Have a look at the lock usage in the 3.x kernel code:

static void dentry_lru_add(struct dentry *dentry) { if (list_empty(&dentry->d_lru)) { //Global lock spin_lock(&dcache_lru_lock); list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru); dentry->d_sb->s_nr_dentry_unused++; dentry_stat.nr_unused++; spin_unlock(&dcache_lru_lock); } } Copy code

It can be clearly seen that this is a global variable, that is, a global lock common to all file systems. The actual dentry_lru is placed in the superblock. Obviously, the scope of this lock is inconsistent with lru. Therefore, in the new kernel version, this lock is really put into the superblock:

static void d_lru_del(struct dentry *dentry) { D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST); dentry->d_flags &= ~DCACHE_LRU_LIST; this_cpu_dec(nr_dentry_unused); if (d_is_negative(dentry)) this_cpu_dec(nr_dentry_negative); //No more separate locks, use the per list lock that comes with the list_lru_del primitive WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); } bool list_lru_add(struct list_lru *lru, struct list_head *item) { int nid = page_to_nid(virt_to_page(item)); struct list_lru_node *nlru = &lru->node[nid]; struct mem_cgroup *memcg; struct list_lru_one *l; //Use the lock of per lru list spin_lock(&nlru->lock); if (list_empty(item)) { // } spin_unlock(&nlru->lock); return false; } ` Copy code

In the new kernel, the global lock is abandoned, and the lock that comes with the list_lru primitive is used instead. Since list_lru itself is located in the superblock, the lock becomes a per list (superblock) lock, although it is still a bit large. But it is much smaller than before.

Therefore, in the new kernel, the lock is optimized, but it may not completely solve the problem.

Keep thinking 1

Why is dentry cache generated when accessing non-existent files/directories (nss cache and above script)? What is the use of a dentry cache of non-existent files/directories? Why do we need to keep it? On the surface, it seems unnecessary to keep dentry cache for a non-existent file/directory. In fact, such dentry cache (hereinafter referred to as dcache) has a standard definition in the kernel: Negative dentry

`A special form of dcache entry gets created ``if` `a process attempts to access a non-existent ``file``. Such an entry is known as a negative dentry.` Copy code

What are the specific uses of negative dentry? Since the main function of dcache is to speed up file search in the file system, imagine the following scenario: If an application always searches for specified files from some pre-configured path lists (similar to the PATH environment variable), and the The file only exists in one of these paths. In this case, if the negative dcache exists, it can speed up the search of the failed path and improve the performance of the file search as a whole.

Keep thinking 2

Can the number of negative dcaches be limited individually? The answer is: yes.

In the Rhel 7.8 kernel (3.10.0-1127.el7), a feature is incorporated: negative-dentry-limit, which is specifically used to limit the number of negative dcaches . For the description of this feature, please refer to: access.redhat.com/solutions/4...

For the specific implementation of feature, please refer to: lwn.net/Articles/81... The specific principle will not be explained :)

The cruel reality is: Neither rhel8 nor upstream kernel incorporates this feature, why?

Please refer to: Redhat's official explanation (in fact, there is no clear explanation) access.redhat.com/solutions/5...

Take a look at the heated discussion in the community: lore.kernel.org/patchwork/c...

Linus also stood up personally against it. The overall tone is: the existing cache reclaim mechanism is sufficient (complex enough), combined with memcg's low waterline and other protection measures (cgroup v2 only), can handle the work of cache reclaim, if limited, it may be possible It will involve synchronous recycling, etc., introducing new blocking, problems and unnecessary complexity. Compared with ordinary pagecache, negative dache has nothing special and should not be treated differently (being preferential treatment), and negative dcache itself is recycled quickly, balabala .

As a result, I still can't enter the community, even though this feature seems so "practical".

Keep thinking 3

Are there other ways to limit dcache? The answer is: there is also the file system layer, which provides the unused_dentry_hard_limit parameter, which can control the overall number of dcache, and the overall control logic is similar. I won t go into details about the specific code principle, and you are welcome to check the code. Unfortunately, this parameter depends on the implementation of each file system itself. In the 3.x kernel, only overlayfs has been implemented, but other file systems do not. Therefore, the versatility is limited, and the specific effect is unknown (not actually verified). So far, it seems that the analysis has really been clear?

Think More

Can you think about it again: Why are so many dentry not being recycled in time? The current case looks like a kernel jitter problem caused by an application (nss) bug, but if you think about it carefully, you will find that this is actually the kernel's insufficient ability to face similar scenarios. The essential problem lies in:

  1. Recycling is not timely
  2. cache unlimited

Recycling is not timely

Since the dentry corresponding to all files (directories) accessed in the kernel will be cached and stored in the slab (unless there is a feature mark), it is used to prompt the efficiency during the next visit. You can see that the slab is occupied in the environment where the problem occurs Both are as high as 60G, most of which are occupied by dentry. In the kernel, only (most scenes) when the memory is tight (reaching the memory waterline) will the active recovery of the cache (mainly including slab and pagecache) be triggered. In the problem environment, the memory is usually sufficient and the actual use is less. Most of them are caches (slab and pagecache). When the system free memory is below the low waterline, asynchronous recovery (kswapd) is triggered; when the free memory is below the min waterline, synchronous recovery is triggered. That is to say, dentry can be recovered only when the free memory is low to a certain level (waterline), and because the waterline is usually low, the recovery time is relatively late, and when the business has a sudden memory application, it may lead to a short-term The memory is reclaimed repeatedly. Note: The water line (global) is calculated by the kernel based on the memory size by default, and the default water line in the upstream kernel is relatively low. In some container scenarios, it is indeed not reasonable. There are some optimizations in the new version of the kernel (the distance between min and low can be set), but it is not perfect.

Memcg async reclaim In the cloud native (container) scenario, for the effective and timely recovery of cache, the kernel provides a standard asynchronous recovery method: kswapd recovery after reaching the low waterline, but kswapd is per-node granularity (global), even in After increasing the distance between min and low waterline (higher version kernel support), there are still the following shortcomings:

  1. The distance parameter is difficult to generalize and control
  2. The global scan is expensive and cumbersome
  3. Single-threaded (per-node) recycling may still be slow and not timely

In practical applications, it is also common that the waterline is broken down due to the failure of memory recovery in time, and the problem of business jitter occurs. In response to similar scenarios, some people in the community submitted memcg async relaim ideas and patches (relatively primitive) many years ago. The basic principle is: for each pod (memcg), create an asynchronous memory recovery thread similar to kswapd, when the pod level After the async low waterline is reached, the per-cgroup basic asynchronous memory reclamation is triggered. Theoretically, it can better solve/optimize problems in similar scenarios. However, after a long discussion, the community finally did not accept it. The main reason was due to container resource overhead and Isolation considerations:

  1. If you create a kernel thread for each cgroup, when the number of containers is large, the number of memory threads increases, and the overhead is difficult to control.
  2. The subsequent optimized version removes the per-cgroup kernel recycling thread, and borrows it from the kernel's own workqueue. Due to the pooling capability of the workqueue, requests can be merged, the number of thread creations is reduced, and the overhead is controlled. But then comes the problem of isolation. The problem is that the newly submitted workqueue request cannot be accounted for to a specific pod (cgroup), which breaks the isolation of the container.

From the maintainer's point of view, there are good reasons for rejection. But from the perspective of (cloud native) users, it can only be another loss. After all, the actual problem has not been fully resolved. Although the memcg async reclaim function was ultimately not accepted by the community, there are still a few vendors who insist on incorporating the corresponding functions in their version branches. Typical representatives are Google, and we also include our TencentOS Server (formerly TLinux). Not only has the original memcg async reclaim function been incorporated/enhanced, but it has also been integrated into our cloud-native resource QoS framework as a whole to provide the underlying support for ensuring the memory service quality of the business.

cache unlimited

Linux tends to use free memory as much as possible to use it as cache (mainly page cache and slab) to improve performance (mainly file access). This means that the cache in the system can grow almost unlimited (as long as there is free memory). There are many problems in the real scene, and the problem in this case is one of the typical ones. If there is cache limit capability, similar problems can be solved to a large extent in theory.

Cache limit . Regarding the topic of page cache limit, many years ago, there was a continuous debate in the Kernel upstream community for a long time, but in the end it failed to enter the upstream. The main reason was that it violated the original intention of making the best use of memory. Although there are indeed some problems in some scenarios, the community still recommends other ways to solve them (business or other core means). Although it is not accepted by the community, a small number of manufacturers still insist on incorporating the page cache limit function in their version branches. Typical representatives are SUSE and our TencentOS Server (formerly TLinux). We not only incorporate/enhance The page cache limit function supports synchronous/asynchronous recycling. At the same time, it also enhances the slab limit limit, which can limit the usage of page cache and slab at the same time. This function has played a key role in many scenarios.

Conclusion

  1. When the following multiple conditions occur at the same time, there may be lock competition related to dentry list, resulting in high sys:
    • There are a large number of dentry caches in the system (a large number of files/directories accessed by the container, accumulating continuously)
    • Business burst memory application, causing free memory to break the waterline and trigger memory recovery (repeated)
    • The business process exits, and the/proc file needs to be cleaned up when exiting. During this period, a large lock that relies on the dentry list will result in a spinlock race.
  2. The user mode application nss bug causes too much dcache, which is the direct cause of the accident.
  3. Deep thinking, we can find that the upstream kernel has abandoned many practical functions and designs in consideration of factors such as versatility and elegant architecture. In cloud-native scenarios, it is difficult to meet the extreme needs. To become the core base of cloud-native OS, depth is needed. hack.
  4. TencentOS Server has done a lot of in-depth customization and optimization for cloud-native massive scenarios, and can freely cope with various challenges (including the problems involved in this case) brought by complex and extreme cloud-native services. In addition, TencentOS Server is also designed to implement cloud-native resource QoS guarantee features (RUE), which provides QoS guarantee capabilities for various key resources for containers with different priorities. Stay tuned for related sharing.

Concluding remarks

In the cloud-native scenario, upstream kernel is difficult to meet the extreme needs of extreme scenarios. To become the base of the cloud-native OS, deep hacks are needed. And TencentOS Server is working hard for it!

[Note: Case material is taken from Tencent Cloud Virtualization Team and Cloud Technology Operation Team]

Container Service (Tencent Kubernetes Engine, TKE) is a one-stop cloud native PaaS service platform based on Kubernetes provided by Tencent Cloud. Provide users with enterprise-level services that integrate container cluster scheduling, Helm application orchestration, Docker image management, Istio service management, automated DevOps, and a full set of monitoring operation and maintenance systems.