In-depth explanation of the sending and receiving of linux network packets

In-depth explanation of the sending and receiving of linux network packets

This article has participated in the Haowen Convening Order activity, click to view: Back-end, big front-end dual-track submissions, 20,000 yuan prize pool waiting for you to challenge!

Friendly reminder, this article is a bit longer, you can read Mark first!

Today, we will give an in-depth explanation on the sending and receiving of Linux network packets. Let's try our proposal to write a simple sample code of udp socket as follows:

int main () { int serverSocketFd = socket(AF_INET, SOCK_DGRAM, 0 ); bind(serverSocketFd, ...); char buff[BUFFSIZE]; int readCount = recvfrom(serverSocketFd, buff, BUFFSIZE, 0 , ...); buff[readCount] = '\0' ; printf ( "Receive from client:%s\n" , buff); } Copy code

The above code is a piece of logic for udp server to receive receipts. When looking at it from a development perspective, as long as the client has the corresponding data sent, the server can receive it after executing recv_from and print it out. What we want to know now is, when the network packet reaches the network card, until our recvfrom receives the data, what happened in the middle?

Through this article, you will have an in-depth understanding of how the Linux network system is implemented internally and how the various parts interact. I believe this will be of great help to your work. This article is based on Linux 3.10, and the source code can be found in mirrors.edge.kernel.org/pub/linux/k...

1. Overview of Linux network packet collection

In the TCP/IP network layered model, the entire protocol stack is divided into the physical layer, link layer, network layer, transport layer and application layer. The physical layer corresponds to the network card and network cable, and the application layer corresponds to our common Nginx, FTP and other applications. Linux implements three layers: the link layer, the network layer, and the transport layer.

In the Linux kernel implementation, the link layer protocol is implemented by the network card driver, and the kernel protocol stack implements the network layer and the transport layer. The kernel provides a socket interface to the upper application layer for user processes to access. The layered model of the TCP/IP network we see from the perspective of Linux should look like this.

Figure 1 Network protocol stack from the Linux perspective In the Linux source code, the logic corresponding to the network device driver is located in driver/net/ethernet, and the driver of the intel series network card is in the driver/net/ethernet/intel directory. The protocol stack module code is located in the kernel and net directories.

The kernel and network device drivers are handled by means of interrupts. When data arrives on the device, a voltage change will be triggered on the relevant pins of the CPU to notify the CPU to process the data. For the network module, because the processing process is more complicated and time-consuming, if all the processing is completed in the interrupt function, the interrupt processing function (with too high priority) will occupy the CPU excessively, and the CPU will not be able to respond to other devices. For example, mouse and keyboard messages. Therefore, the Linux interrupt handling function is divided into the upper half and the lower half. The upper half is only the simplest work, fast processing and then release the CPU, and then the CPU can allow other interrupts to come in. Most of the rest of the work will be placed in the lower half, which can be handled slowly and calmly. The lower half of the implementation of the kernel version after 2.4 is the soft interrupt, which is handled by the ksoftirqd kernel thread. Unlike hard interrupts, hard interrupts are used to apply voltage changes to the physical pins of the CPU, while soft interrupts are used to notify the soft interrupt handler by giving the binary value of a variable in the memory.

Well, after roughly understanding the network card driver, hard interrupt, soft interrupt and ksoftirqd thread, we give a path to the kernel to receive packets on the basis of these concepts:

Figure 2 Overview of Linux kernel network packet receiving When data is received on the network card, the first working module in Linux is the network driver. The network driver will write the frames received on the network card to the memory by DMA. Then initiate an interrupt to the CPU to notify the CPU that data has arrived. 2. when the CPU receives an interrupt request, it will call the interrupt handling function registered by the network driver. The interrupt processing function of the network card does not do too much work, sends out a soft interrupt request, and then releases the CPU as soon as possible. ksoftirqd detects the arrival of a soft interrupt request, calls poll to start polling and receiving the packet, and after receiving it, it will be processed by the protocol stacks at all levels. For UDP packets, they will be placed in the receiving queue of the user socket.

From the above picture, we have grasped the process of processing data packets by Linux as a whole. But if we want to know more details about the work of the network module, we have to look down.

2. Linux boot

Linux driver, kernel protocol stack and other modules have to do a lot of preparation work before they are equipped to receive network card data packets. For example, the ksoftirqd kernel thread must be created in advance, the processing functions corresponding to each protocol must be registered, the network device subsystem must be initialized in advance, and the network card must be started. Only after these are Ready, we can actually start receiving data packets. So let's now take a look at how these preparations are done.

2.1 Create ksoftirqd kernel thread

Linux soft interrupts are all carried out in a dedicated kernel thread (ksoftirqd), so it is very necessary for us to see how these processes are initialized, so that we can understand the packet receiving process more accurately later. The number of processes is not 1, but N, where N is equal to the number of cores of your machine.

When the system is initialized, smpboot_register_percpu_thread is called in kernel/smpboot.c, and this function is further executed to spawn_ksoftirqd (located in kernel/softirq.c) to create a softirqd process.

Figure 3 Create ksoftirqd kernel thread

The relevant code is as follows:

//file: kernel/softirq.c static struct smp_hotplug_thread softirq_threads = { .store = &ksoftirqd, .thread_should_run = ksoftirqd_should_run, .thread_fn = run_ksoftirqd, .thread_comm = "ksoftirqd/%u" ,}; static __init int spawn_ksoftirqd ( void ) { register_cpu_notifier(&cpu_nfb); BUG_ON(smpboot_register_percpu_thread(&softirq_threads)); return 0 ; } early_initcall(spawn_ksoftirqd); Copy code

When ksoftirqd is created, it will enter its own thread loop functions ksoftirqd_should_run and run_ksoftirqd. Constantly determine whether there is a soft interrupt that needs to be processed. One thing to note here is that soft interrupts are not only network soft interrupts, but also other types.

//file: include/linux/interrupt.h enum { HI_SOFTIRQ = 0 , TIMER_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, BLOCK_SOFTIRQ, BLOCK_IOPOLL_SOFTIRQ, TASKLET_SOFTIRQ, SCHED_SOFTIRQ, HRTIMER_SOFTIRQ, RCU_SOFTIRQ, }; Copy code

2.2 Initialization of the network subsystem

Figure 4 Network subsystem initialization The Linux kernel initializes each subsystem by calling subsys_initcall. You can grep out many calls to this function in the source code directory. Here we are talking about the initialization of the network subsystem, which will be executed to the net_dev_init function.

//file: net/core/dev.c static int __init net_dev_init ( void ) { ...... for_each_possible_cpu(i) { struct softnet_data * sd = &per_cpu(softnet_data, i); memset (sd, 0 , sizeof (*sd)); skb_queue_head_init(&sd->input_pkt_queue); skb_queue_head_init(&sd->process_queue); sd->completion_queue = NULL ; INIT_LIST_HEAD(&sd->poll_list); ...... } ...... open_softirq(NET_TX_SOFTIRQ, net_tx_action); open_softirq(NET_RX_SOFTIRQ, net_rx_action); } subsys_initcall(net_dev_init); Copy code

In this function, a softnet_data data structure is applied for each CPU. The poll_list in this data structure is waiting for the driver to register its poll function. We can see this process when the network card driver is initialized later.

In addition, open_softirq registers a processing function for each soft interrupt. The processing function of NET_TX_SOFTIRQ is net_tx_action, and the function of NET_RX_SOFTIRQ is net_rx_action. After continuing to track open_softirq, I found that this registration method is recorded in the softirq_vec variable. When the ksoftirqd thread receives a soft interrupt later, it will also use this variable to find the corresponding processing function for each soft interrupt.

//file: kernel/softirq.c void open_softirq ( int nr, void (*action)(struct softirq_action *)) { softirq_vec[nr].action = action; } Copy code

2.3 Protocol stack registration

The kernel implements the ip protocol of the network layer, as well as the tcp protocol and udp protocol of the transport layer. The corresponding implementation functions of these protocols are ip_rcv(), tcp_v4_rcv() and udp_rcv(). Unlike the way we usually write code, the kernel is implemented through registration. The fs_initcall in the Linux kernel is similar to subsys_initcall, and it is also the entry point of the initialization module. After fs_initcall calls inet_init, the network protocol stack registration starts. Through inet_init, these functions are registered in the inet_protos and ptype_base data structures. As shown below:

Figure 5 AF_INET protocol stack registration The relevant code is as follows

//file: net/ipv4/af_inet.c static struct packet_type ip_packet_type __ read_mostly = { .type = cpu_to_be16(ETH_P_IP), .func = ip_rcv,}; static const struct net_protocol udp_protocol = { .handler = udp_rcv, .err_handler = udp_err, .no_policy = 1 , .netns_ok = 1 ,}; static const struct net_protocol tcp_protocol = { .early_demux = tcp_v4_early_demux, .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1 , .netns_ok = 1 , }; static int __init inet_init ( void ) { ...... if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0 ) pr_crit( "%s: Cannot add ICMP protocol\n" , __func__); if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0 ) pr_crit( "%s: Cannot add UDP protocol\n" , __func__); if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0 ) pr_crit( "%s: Cannot add TCP protocol\n" , __func__); ...... dev_add_pack(&ip_packet_type); } Copy code

In the above code, we can see that the handler in the udp_protocol structure is udp_rcv, and the handler in the tcp_protocol structure is tcp_v4_rcv, which is initialized through inet_add_protocol.

int inet_add_protocol ( const struct net_protocol Prot *, unsigned char Protocol) { IF (! prot-> netns_ok) { pr_err( "Protocol %u is not namespace aware, cannot register.\n" , protocol); return -EINVAL; } return !cmpxchg(( const struct net_protocol **)&inet_protos[protocol], NULL , prot)? 0 : -1 ; } Copy code

inet_add_protocol tcp udp inet_protos dev_add_pack(&ip_packet_type); ip_packet_type type func ip_rcv dev_add_pack ptype_base

//file: net/core/dev.c void dev_add_pack(struct packet_type *pt){ struct list_head *head = ptype_head(pt); ...... } static inline struct list_head *ptype_head(const struct packet_type *pt){ if (pt->type == htons(ETH_P_ALL)) return &ptype_all; else return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK]; }

Here we need to remember that inet_protos records the processing function address of udp and tcp, and ptype_base stores the processing address of the ip_rcv() function. Later we will see that in the soft interrupt, the ip_rcv function address will be found through ptype_base, and then the ip packet will be correctly sent to ip_rcv() for execution. In ip_rcv, the tcp or udp processing function will be found through inet_protos, and then the packet will be forwarded to the udp_rcv() or tcp_v4_rcv() function.

To expand, if you look at the code of functions such as ip_rcv and udp_rcv, you can see the processing of many protocols. For example, ip_rcv will handle netfilter and iptable filtering. If you have many or very complex netfilter or iptables rules, these rules are executed in the context of soft interrupts, which will increase network latency. For another example, udp_rcv will determine whether the socket receiving queue is full. The corresponding kernel parameters are net.core.rmem_max and net.core.rmem_default. If you are interested, I suggest you read the code of inet_init.

2.4 NIC driver initialization

Every driver (not just the network card driver) will use module_init to register an initialization function with the kernel. When the driver is loaded, the kernel will call this function. For example, the code for the igb network card driver is located in drivers/net/ethernet/intel/igb/igb_main.c

//file: drivers/net/ethernet/intel/igb/igb_main.c static struct pci_driver igb_driver = { .name = igb_driver_name, .id_table = igb_pci_tbl, .probe = igb_probe, .remove = igb_remove, ...... }; static int __init igb_init_module ( void ) { ...... ret = pci_register_driver(&igb_driver); return ret; } Copy code

After the driver's pci_register_driver call is completed, the Linux kernel knows the relevant information of the driver, such as the igb_driver_name and igb_probe function addresses of the igb network card driver. When the network card device is recognized, the kernel will call the probe method of its driver (the probe method of igb_driver is igb_probe). The purpose of driving the probe method is to make the device ready. For the igb network card, its igb_probe is located under drivers/net/ethernet/intel/igb/igb_main.c. The main operations are as follows:

Figure 6 NIC driver initialization In step 5, we see that the NIC driver implements the interface required by ethtool, and the function address is also registered here. When ethtool initiates a system call, the kernel will find the callback function for the corresponding operation. For the igb network card, the implementation functions are all under drivers/net/ethernet/intel/igb/igb_ethtool.c. I believe you can thoroughly understand the working principle of ethtool this time, right? The reason why this command can view the statistics of the network card receiving and sending packets, can modify the network card adaptive mode, and adjust the number and size of the RX queue is because the ethtool command finally calls the corresponding method of the network card driver, rather than ethtool itself has this super power.

The igb_netdev_ops registered in step 6 contains functions such as igb_open, which will be called when the network card is started.

//file: drivers/net/ethernet/intel/igb/igb_main.c static const struct net_device_ops igb_netdev_ops = { .ndo_open = igb_open, .ndo_stop = igb_close, .ndo_start_xmit = igb_xmit_frame, .ndo_get_stats64 = igb_get_stats64, .ndo_set_rx_mode = igb_set_rx_mode, .ndo_set_mac_address = igb_set_mac, .ndo_change_mtu = igb_change_mtu, .ndo_do_ioctl = igb_ioctl, ...... Copy code

In step 7, in the igb_probe initialization process, igb_alloc_q_vector is also called. He registered a poll function necessary for the NAPI mechanism. For the igb network card driver, this function is igb_poll, as shown in the following code.

static int igb_alloc_q_vector (struct igb_adapter *adapter, int v_count, int v_idx, int txr_count, int txr_idx, int rxr_count, int rxr_idx) { ...... /* initialize NAPI */ netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64 ); } Copy code

2.5 Start the network card

When the above initialization is complete, you can start the network card. Recall that when the network card driver was initialized, we mentioned that the driver registered the structure net_device_ops variable with the kernel, which contains callback functions (function pointers) such as network card activation, packet sending, and mac address setting. When a network card is enabled (for example, through ifconfig eth0 up), the igb_open method in net_device_ops will be called. It usually does the following:

Figure 7 Start the network card

//file: drivers/net/ethernet/intel/igb/igb_main.c static int __igb_open(struct net_device *netdev, bool resuming){ /* allocate transmit descriptors */ err = igb_setup_all_tx_resources(adapter); /* allocate receive descriptors */ err = igb_setup_all_rx_resources(adapter); /* Register interrupt processing function*/ err = igb_request_irq(adapter); if (err) goto err_req_irq; /* Enable NAPI */ for (i = 0 ; i <adapter->num_q_vectors; i++) napi_enable(&(adapter->q_vector[i]->napi)); ...... } Copy code

In the above __igb_open function calls igb_setup_all_tx_resources, and igb_setup_all_rx_resources. In this step of igb_setup_all_rx_resources, RingBuffer is allocated and the mapping relationship between memory and Rx queue is established. (The number and size of Rx Tx queues can be configured through ethtool). Let's look at the interrupt function registration igb_request_irq:

static int igb_request_irq (struct igb_adapter *adapter) { if (adapter->msix_entries) { err = igb_request_msix(adapter); if (!err) goto request_done; ...... } } static int igb_request_msix (struct igb_adapter *adapter) { ...... for (i = 0 ; i <adapter->num_q_vectors; i++) { ... err = request_irq(adapter->msix_entries[ vector ]. vector , igb_msix_ring, 0 , q_vector->name, } Copy code

Tracking the function call in the above code, __igb_open => igb_request_irq => igb_request_msix, we have seen in igb_request_msix, for the multi-queue network card, an interrupt is registered for each queue, and the corresponding interrupt processing function is igb_msix_ring (the function Also under drivers/net/ethernet/intel/igb/igb_main.c). We can also see that in msix mode, each RX queue has an independent MSI-X interrupt. From the hardware interrupt level of the network card, you can set the received packets to be processed by different CPUs. (You can use irqbalance, or modify/proc/irq/IRQ_NUMBER/smp_affinity to modify the binding behavior with the CPU).

When the above preparations are done, you can open the door to welcome guests (data packets)!

3. welcome the arrival of data

3.1 Hard interrupt handling

1. when the data frame arrives on the network card from the network cable, the first stop is the receiving queue of the network card. The network card looks for the available memory location in the RingBuffer allocated to it. After finding it, the DMA engine will DMA the data to the memory associated with the network card. At this time, the CPU is insensitive. When the DMA operation is completed, the network card will initiate a hard interrupt like the CPU to notify the CPU that data has arrived.

Figure 8 Hard interrupt processing process of network card data Note: When the RingBuffer is full, the new data packet will be discarded. When ifconfig checks the network card, there can be an overruns in it, which means that the packet is discarded because the ring queue is full. If you find packet loss, you may need to increase the length of the ring queue through the ethtool command. In the section of starting the network card, we mentioned that the processing function of the hard interrupt registration of the network card is igb_msix_ring.

//file: drivers/net/ethernet/intel/igb/igb_main.c static irqreturn_t igb_msix_ring ( int irq, void *data) { struct igb_q_vector * q_vector = data; /* Write the ITR value calculated from the previous interrupt. */ igb_write_itr(q_vector); napi_schedule(&q_vector->napi); return IRQ_HANDLED; } Copy code

igb_write_itr just records the hardware interrupt frequency (it is said that the purpose is to reduce the interrupt frequency to the CPU). Follow the call of napi_schedule all the way down, __napi_schedule=>____napi_schedule

/* Called with irq disabled */ static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi){ list_add_tail(&napi->poll_list, &sd->poll_list); __raise_softirq_irqoff(NET_RX_SOFTIRQ); } Copy code

Here we see that list_add_tail modifies the poll_list in the CPU variable softnet_data and adds the poll_list passed by the driver napi_struct. Among them, poll_list in softnet_data is a two-way list, in which devices all have input frames waiting to be processed. Then __raise_softirq_irqoff triggered a soft interrupt NET_RX_SOFTIRQ. This so-called triggering process is just an OR operation on a variable.

void __raise_softirq_irqoff( unsigned int nr){ trace_softirq_raise(nr); or_softirq_pending ( 1UL << NR); } //file: include/linux/irq_cpustat.h #define or_softirq_pending(x) (local_softirq_pending() |= (x))

Linux CPU poll_list

3.2 ksoftirqd

9 ksoftirqd ksoftirqd ksoftirqd_should_run run_ksoftirqd ksoftirqd_should_run

static int ksoftirqd_should_run(unsigned int cpu){ return local_softirq_pending(); } #define local_softirq_pending()/ __IRQ_STAT(smp_processor_id(), __softirq_pending)

local_softirq_pending NET_RX_SOFTIRQ, run_ksoftirqd

static void run_ksoftirqd(unsigned int cpu){ local_irq_disable(); if (local_softirq_pending()) { __do_softirq(); rcu_note_context_switch(cpu); local_irq_enable(); cond_resched(); return; } local_irq_enable(); } __do_softirq CPU action asmlinkage void __do_softirq(void){ do { if (pending & 1) { unsigned int vec_nr = h - softirq_vec; int prev_count = preempt_count(); ... trace_softirq_entry(vec_nr); h->action(h); trace_softirq_exit(vec_nr); ... } h++; pending >>= 1; } while (pending); }

NET_RX_SOFTIRQ net_rx_action net_rx_action

ksoftirq smp_processor_id() CPU CPU Linux CPU CPU CPU

net_rx_action

static void net_rx_action(struct softirq_action *h){ struct softnet_data *sd = &__get_cpu_var(softnet_data); unsigned long time_limit = jiffies + 2; int budget = netdev_budget; void *have; local_irq_disable(); while (!list_empty(&sd->poll_list)) { ...... n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list); work = 0 ; if (test_bit(NAPI_STATE_SCHED, &n->state)) { work = n->poll(n, weight); trace_napi_poll(n); } budget -= work; } } Copy code

The time_limit and budget at the beginning of the function are used to control the automatic exit of the net_rx_action function to ensure that the reception of network packets does not occupy the CPU. Wait until the next time the network card has a hard interrupt, and then process the remaining received data packets. The budget can be adjusted through kernel parameters. The remaining core logic in this function is to obtain the current CPU variable softnet_data, traverse its poll_list, and then execute the poll function registered by the network card driver. For the igb network card, it is the igb_poll function of the igb driving force.

static int igb_poll (struct napi_struct *napi, int budget) { ... if (q_vector->tx.ring) clean_complete = igb_clean_tx_irq(q_vector); if (q_vector->rx.ring) clean_complete &= igb_clean_rx_irq(q_vector, budget); ... } Copy code

In the read operation, the focus of igb_poll is the call to igb_clean_rx_irq.

static bool igb_clean_rx_irq (struct igb_q_vector *q_vector, const int budget) { ... do { /* retrieve a buffer from the ring */ skb = igb_fetch_rx_buffer(rx_ring, rx_desc, skb); /* fetch next buffer in frame if non-eop */ if (igb_is_non_eop(rx_ring, rx_desc)) continue ; } /* verify the packet layout is correct */ if (igb_cleanup_headers(rx_ring, rx_desc, skb)) { skb = NULL ; continue ; } /* populate checksum, timestamp, VLAN, and protocol */ igb_process_skb_fields(rx_ring, rx_desc, skb); napi_gro_receive(&q_vector->napi, skb); } Copy code

The function of igb_fetch_rx_buffer and igb_is_non_eop is to remove the data frame from the RingBuffer. Why do we need two functions? Because it is possible that the frame will occupy more than one RingBuffer, it is obtained in a loop until the end of the frame. A data frame obtained is represented by a sk_buff. After receiving the data, perform some checks on it, and then start to set the timestamp, VLAN id, protocol and other fields of the sbk variable. Next enter the napi_gro_receive:

//file: net/core/dev.c gro_result_t napi_gro_receive (struct napi_struct *napi, struct sk_buff *skb) { skb_gro_reset_offset(skb); return napi_skb_finish(dev_gro_receive(napi, skb), skb); } Copy code

The function dev_gro_receive represents the GRO feature of the network card. It can be simply understood as combining related small packages into one big package. The purpose is to reduce the number of packets sent to the network stack, which helps to reduce the amount of CPU usage. Let's ignore it for now and look directly at napi_skb_finish, this function mainly calls netif_receive_skb.

//file: net/core/dev.c static gro_result_t napi_skb_finish ( gro_result_t ret, struct sk_buff *skb) { switch (ret) { case GRO_NORMAL: if (netif_receive_skb(skb)) ret = GRO_DROP; break ; ...... } Copy code

In netif_receive_skb, the data packet will be sent to the protocol stack. It is declared that the following 3.3, 3.4, and 3.5 are also the processing of soft interrupts, but due to the length of the space, they are taken out separately into subsections.

3.3 Network protocol stack processing

The netif_receive_skb function will be based on the protocol of the package. If it is a udp package, the package will be sent to the ip_rcv() and udp_rcv() protocol processing functions for processing.

Figure 10 Network protocol stack processing

//file: net/core/dev.c int netif_receive_skb (struct sk_buff *skb) { //RPS processing logic, ignore first... return __netif_receive_skb(skb); } static int __netif_receive_skb(struct sk_buff *skb){ ...... ret = __netif_receive_skb_core(skb, false );} static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc){ ...... //pcap logic, here the data will be sent to the packet capture point. tcpdump is to get the package list_for_each_entry_rcu(ptype, &ptype_all, list) { if (!ptype->dev || ptype->dev == skb->dev) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } } ...... list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list ) { if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } } } Copy code

In __netif_receive_skb_core, I looked at the packet capture point of tcpdump, which I used to often use, and I was very excited. It seems that the time to read the source code is really not wasted. Then __netif_receive_skb_core takes out the protocol, it takes out the protocol information from the data packet, and then traverses the list of callback functions registered on this protocol. ptype_base is a hash table, which we mentioned in the protocol registration section. The ip_rcv function address is stored in this hash table.

//file: net/core/dev.c static inline int deliver_skb (struct sk_buff *skb, struct packet_type *pt_prev, struct net_device *orig_dev) { ...... return pt_prev->func(skb, skb->dev, pt_prev, orig_dev); } Copy code

The line pt_prev->func calls the processing function registered by the protocol layer. For the ip package, it will enter ip_rcv (if it is an arp package, it will enter arp_rcv).

3.4 IP protocol layer processing

Let's take a rough look at what Linux does at the ip protocol layer, and how the packet is further sent to the udp or tcp protocol processing function.

//file: net/ipv4/ip_input.c int ip_rcv (struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { ...... return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL , ip_rcv_finish); } Copy code

Here NF_HOOK is a hook function, when the registered hook is executed, it will execute to the function ip_rcv_finish pointed to by the last parameter.

static int ip_rcv_finish (struct sk_buff *skb) { ...... if (!skb_dst(skb)) { int err = ip_route_input_noref(skb, iph->daddr, iph->saddr, iph->tos, skb->dev); ... } ...... return dst_input(skb); } Copy code

After tracing ip_route_input_noref, I saw that it called ip_route_input_mc again. In ip_route_input_mc, the function ip_local_deliver is assigned to dst.input, as follows:

//file: net/ipv4/route.c static int ip_route_input_mc (struct sk_buff *skb, __be32 daddr, __be32 saddr,u8 tos, struct net_device *dev, int our) { if (our) { rth->dst.input = ip_local_deliver; rth->rt_flags |= RTCF_LOCAL; } } So go back to return dst_input(skb); in ip_rcv_finish. /* Input packet from network to transport. */ static inline int dst_input (struct sk_buff *skb) { return skb_dst(skb)->input(skb); } Copy code

The input method called by skb_dst(skb)->input is the ip_local_deliver assigned by the routing subsystem.

//file: net/ipv4/ip_input.c int ip_local_deliver (struct sk_buff *skb) { /* * Reassemble IP fragments. */ if (ip_is_fragment(ip_hdr(skb))) { if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER)) return 0 ; } return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL , ip_local_deliver_finish); } static int ip_local_deliver_finish (struct sk_buff *skb) { ...... int protocol = ip_hdr(skb)->protocol; const struct net_protocol * ipprot ; ipprot = rcu_dereference(inet_protos[protocol]); if (ipprot != NULL ) { ret = ipprot->handler(skb); } } Copy code

For example, in the protocol registration section, you can see that the function addresses of tcp_rcv() and udp_rcv() are stored in inet_protos. Here, the distribution will be selected according to the protocol type in the package, where the skb package will be further distributed to the higher-level protocols, udp and tcp.

3.5 UDP protocol layer processing

In the protocol registration section, we said that the processing function of the udp protocol is udp_rcv.

//file: net/ipv4/udp.c int udp_rcv (struct sk_buff *skb) { return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP); } int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, int proto){ sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable); if (sk != NULL ) { int ret = udp_queue_rcv_skb(sk, skb } icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0 ); } Copy code

__udp4_lib_lookup_skb searches for the corresponding socket based on skb, and when found, puts the data packet in the buffer queue of the socket. If it is not found, an icmp packet with the target unreachable is sent.

//file: net/ipv4/udp.c int udp_queue_rcv_skb (struct sock *sk, struct sk_buff *skb) { ...... if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf)) goto drop; rc = 0 ; ipv4_pktinfo_prepare(skb); bh_lock_sock(sk); if (!sock_owned_by_user(sk)) rc = __udp_queue_rcv_skb(sk, skb); else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) { bh_unlock_sock(sk); goto drop; } bh_unlock_sock(sk); return rc; } Copy code

sock_owned_by_user judges whether the user is making a system call on this socket (socket is occupied), if not, it can be directly placed in the socket receiving queue. If so, add the packet to the backlog queue through sk_add_backlog. When the user releases the socket, the kernel will check the backlog queue and move it to the receiving queue if there is data.

If the sk_rcvqueues_full receiving queue is full, the packet will be discarded directly. The receive queue size is affected by the kernel parameters net.core.rmem_max and net.core.rmem_default.

4. recvfrom system call

Two flowers bloom, one on each table. Above we have finished the process of receiving and processing data packets by the entire Linux kernel, and finally put the data packets in the receiving queue of the socket. So let's look back at what happened after the user process called recvfrom. The recvfrom we call in the code is a glibc library function. After the function is executed, the user will be trapped into the kernel mode and enter the system call sys_recvfrom implemented by Linux. Before understanding Linux's sys_revvfrom, let's take a brief look at the core data structure of socket. This data structure is too big, we only draw the content related to our topic today, as follows:

Figure 11 The const struct proto_ops in the socket data structure of the socket kernel data organization corresponds to the method set of the protocol. Each protocol implements a different set of methods. For the IPv4 Internet protocol suite, each protocol has a corresponding processing method, as follows. For udp, it is defined by inet_dgram_ops, in which the inet_recvmsg method is registered.

//file: net/ipv4/af_inet.c const struct proto_ops inet_stream_ops = { ...... .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, ...... } const struct proto_ops inet_dgram_ops = { ...... .sendmsg = inet_sendmsg, .recvmsg = inet_recvmsg, ...... } Copy code

Another data structure struct sock *sk in the socket data structure is a very large and very important substructure. The sk_prot defines the secondary processing function. For the UDP protocol, it will be set to the method set udp_prot implemented by the UDP protocol.

//file: net/ipv4/udp.c struct proto udp_prot = { .name = "UDP" , .owner = THIS_MODULE, .close = udp_lib_close, .connect = ip4_datagram_connect, ...... .sendmsg = udp_sendmsg, .recvmsg = udp_recvmsg, .sendpage = udp_sendpage, ...... } Copy code

After reading the socket variables, let's look at the implementation process of sys_revvfrom.

Figure 12 The internal realization process of recvfrom function sk->sk_prot->recvmsg is called in inet_recvmsg.

//file: net/ipv4/af_inet.c int inet_recvmsg (struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size, int flags) { ...... err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); if (err >= 0 ) msg->msg_namelen = addr_len; return err; } Copy code

We mentioned above that for the socket of the udp protocol, the sk_prot is the struct proto udp_prot under net/ipv4/udp.c. Thus we found the udp_recvmsg method.

//file:net/core/datagram.c:EXPORT_SYMBOL(__skb_recv_datagram); struct sk_buff *__ skb_recv_datagram ( struct sock * sk , unsigned int flags , int * peeked , int * off , int * err ){ ...... do { struct sk_buff_head * queue = &sk->sk_receive_queue; skb_queue_walk( queue , skb) { ...... } /* User doesn't want to wait */ error = -EAGAIN; if (!timeo) goto no_packet; } while (!wait_for_more_packets(sk, err, &timeo, last)); } Copy code

Finally we found the point we wanted to look at. Above we saw the so-called reading process, which is to visit sk->sk_receive_queue. If there is no data and the user is allowed to wait, wait_for_more_packets() will be called to perform the waiting operation, which will put the user process into sleep.

5. summary

The network module is the most complex module in the Linux kernel. It seems that a simple packet receiving process involves the interaction between many kernel components, such as network card drivers, protocol stacks, and kernel ksoftirqd threads. It seems very complicated. This article wants to explain the kernel packet receiving process clearly in an easy-to-understand manner by means of illustrations. Now let us string together the entire collection process.

After the user executes the recvfrom call, the user process proceeds to the kernel mode work through the system call. If there is no data in the receiving queue, the process goes to sleep and is suspended by the operating system. This one is relatively simple, and most of the rest of the scene is performed by other modules of the Linux kernel.

First of all, before starting to collect packages, Linux has to do a lot of preparations:

  1. Create ksoftirqd thread, set its own thread function for it, and count on it to handle soft interrupts later

  2. Protocol stack registration, Linux needs to implement many protocols, such as arp, icmp, ip, udp, tcp, each protocol will register its own processing function, so that the package can quickly find the corresponding processing function

  3. The network card driver is initialized. Each driver has an initialization function, and the kernel will make the driver initialize too. In this initialization process, prepare your own DMA and tell the kernel the address of the poll function of NAPI

  4. Start the network card, allocate RX and TX queues, register the corresponding processing function of the interrupt

The above is the important work before the kernel prepares to receive the packet. When the above is ready, you can turn on the hard interrupt and wait for the arrival of the data packet.

When the data arrives, the first to greet it is the network card (I'll go, isn't this nonsense):

  1. The network card DMAs the data frame to the RingBuffer of the memory, and then initiates an interrupt notification to the CPU

  2. The CPU responds to the interrupt request and calls the interrupt handling function registered when the network card is started

  3. The interrupt handler has almost nothing to do, so it initiates a soft interrupt request

  4. When the kernel thread ksoftirqd thread finds that a soft interrupt request comes, close the hard interrupt first

  5. The ksoftirqd thread starts to call the poll function of the driver to receive packets

  6. The poll function sends the received packet to the ip_rcv function registered by the protocol stack

  7. The ip_rcv function will send the package to the udp_rcv function (for the tcp package, it will be sent to tcp_rcv)

Now we can go back to the question at the beginning. The simple line recvfrom we saw in the user layer, the Linux kernel has to do so much work for us, so that we can receive the data smoothly. This is still simple UDP. If it is TCP, the kernel has to do more work. I can't help but sigh that the developers of the kernel are really well-intentioned.

After understanding the entire packet receiving process, we can clearly know the CPU overhead of Linux receiving a packet. First of all, the first block is the overhead of the user process calling the system call into the kernel mode. The second block is the CPU overhead of the CPU responding to the hard interrupt of the packet. The third block is spent by the soft interrupt context of the ksoftirqd kernel thread. Later, we will post an article to actually observe these overheads.

In addition, there are many final details in the network transmission and reception, we have not expanded it, such as no NAPI, GRO, RPS and so on. Because what I said is too right will affect everyone's grasp of the entire process, so try to keep only the main frame, less is more!