CN117714400A

CN117714400A - Network performance comprehensive optimization method for Loongson 3A domestic software and hardware platform

Info

Publication number: CN117714400A
Application number: CN202311621493.6A
Authority: CN
Inventors: 王云涛; 虞文武; 张振华; 孟浩飞
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-03-15

Abstract

The invention provides a network performance comprehensive optimization method for a Loongson 3A domestic software and hardware platform. According to the invention, the interruption frequency and the copying times are reduced through the optimized network card drive at the home network card, and a large number of DMA memory allocation and release operations are reduced during data forwarding, so that the repeated operation of the memory and the memory management burden of the system are reduced; after the data packet is forwarded from the domestic network card to the domestic operating system kernel, the generated interrupt is subjected to load balancing in the kernel by an optimization method of interrupt rotation load balancing, and each processor core of the Loongson 3A participates in network interrupt processing, so that the network throughput is effectively improved, and the packet loss rate of the data is reduced; and meanwhile, the data access frequency is reduced in the kernel protocol stack through an optimized cache lock. The network system data forwarding comprehensive optimization method based on the domestic software and hardware platform and designed for the Loongson 3A processor can meet the application requirements of high broadband and low time delay.

Description

Network performance comprehensive optimization method for Loongson 3A domestic software and hardware platform

Technical Field

The invention belongs to the technical field of network performance optimization, relates to a domestic software and hardware platform adaptation optimization technology, and particularly relates to a network performance comprehensive optimization method for a Loongson 3A domestic software and hardware platform.

Background

Loongson No. 3 is a domestic multi-core high-performance processor developed by the research and development of computing technology of China academy of sciences, and is the first four-core processor with complete independent intellectual property rights in China. Compared with the international mainstream processor platform, the network performance of the domestic software and hardware platform of the Loongson 3A series multi-core processor is still different from that of the international mainstream processor platform, so that the design of the network performance optimization method for the Loongson 3A domestic software and hardware platform is urgent for realizing the network data forwarding of the Loongson multi-core platform with high bandwidth and low time delay.

Disclosure of Invention

In order to solve the problems, the invention provides a network performance comprehensive optimization method for Loongson 3A domestic software and hardware platforms.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the network performance comprehensive optimization method for the Loongson 3A domestic software and hardware platform comprises the following steps:

when the Loongson 3A domestic software and hardware platform receives the data packet transmitted by the transmitting end, data forwarding is carried out based on the network card of the optimized drive;

after the data packet is forwarded from the network card to the kernel of the operating system, carrying out load balancing on the generated interrupt in the kernel of the system by an optimization method of interrupt rotation load balancing; and simultaneously, cache locking operation is carried out in the kernel protocol stack.

Further, the process of forwarding data by the optimally driven network card includes the following steps:

(1) Selecting proper number of received message buffer areas for the network card driver, and establishing a received message buffer model based on a NAPI polling mechanism;

(2) After obtaining related parameters through a received message buffer model, pre-distributing the number of the obtained received message buffers when the network card is initialized, and carrying out DMA (direct memory access) streaming mapping on each received message buffer and storing the DMA streaming mapping as channel parameters of the DMA;

(3) When the network card receives a data packet and triggers an interrupt, judging whether the network card is under a high load condition, if so, closing a hardware interrupt in an interrupt processing program and activating a polling thread;

(4) In a polling thread, data packets are DMA-transferred to a recorded receiving message buffer zone one by one, and specific fields of the message buffer zone are set and then transferred to an upper protocol stack;

(5) After the polling thread processes all or the maximum number of data packets of the receive queue, the hardware interrupt is opened, and the system continues to perform other tasks until the next interrupt is generated.

Further, when the hardware interrupt is closed, a timer is triggered to count, and when the count reaches the delay time threshold, data is sent to the protocol stack.

Further, the optimization method for interrupt rotation load balancing comprises the following steps:

after receiving an interrupt signal, the interrupt signal is processed in a round-robin mode in the middle circuit breaking, and the inter-core interrupt is sent to a target processor through the designated interrupt signal;

and the target processor core receiving the inter-core interrupt reads the IPI_Status register to obtain the distributed interrupt number, and then performs secondary distribution according to the interrupt number.

Further, the cache lock locks the skb_buff data structure.

Further, the cache lock locks header information in the packet buffer.

Compared with the prior art, the invention has the following advantages and beneficial effects:

according to the invention, the interruption frequency and the copying times are reduced through the optimized network card drive at the home network card, and a large number of DMA memory allocation and release operations are reduced during data forwarding, so that the repeated operation of the memory and the memory management burden of the system are reduced; processing the data packet by adopting a data packet receiving model with mixed interruption and polling, closing the hardware interruption of the network card when the network is under high load, and carrying out polling processing on the data packet, otherwise, using an interruption mechanism to receive the data packet; the invention performs load balancing on generated interrupt in the kernel by an optimization method of interrupt rotation load balancing; and performing cache locking operation on the skb_buff data structure, and performing cache locking operation on header information in a data packet buffer area, so that the data access frequency is reduced in a kernel protocol stack through an optimized cache lock. The network system data forwarding comprehensive optimization method based on the domestic software and hardware platform and designed for the Loongson 3A processor can meet the application requirements of high broadband and low time delay.

Drawings

Fig. 1 is a flow chart of a network performance comprehensive optimization method for a Loongson 3A domestic software and hardware platform.

Detailed Description

The technical scheme provided by the present invention will be described in detail with reference to the following specific examples, and it should be understood that the following specific examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The invention provides a network performance comprehensive optimization method for a Loongson 3A domestic software and hardware platform, which is applied to the Loongson 3A domestic software and hardware platform. The Loongson 3A domestic software and hardware platform comprises a Loongson 3A series multi-core processor, a domestic network chip and a domestic operating system.

step 1, when a Loongson 3A domestic software and hardware platform receives a data packet transmitted by a transmitting end, reducing interrupt frequency and copying times at a domestic network card through an optimized network card driver;

wherein, the optimization is realized by modifying the network card drive, which comprises the following steps:

(1) Selecting proper number of received message buffer areas for the domestic network card driver, and establishing a received message buffer model based on a NAPI (New API) polling mechanism for the number;

(2) After obtaining related parameters (the number of buffer areas) through a received message buffer model, pre-distributing the obtained number of the received message buffer areas when the home network card is initialized, and carrying out DMA (Direct Memory Access) flow mapping on each received message buffer area and storing the corresponding received message buffer area as a channel parameter of DMA (direct memory access);

(3) When the domestic network card receives a data packet and triggers an interrupt, judging whether the domestic network card is under a high load condition, if so, closing a hardware interrupt in an interrupt processing program and activating a polling thread;

In particular, the optimization of the network card according to the invention is embodied as an interrupt adjustment algorithm, which comprises two principles. (1) Collecting as many data packets as possible in each interrupt signal; (2) The response to the interrupt signal is as fast as possible, rather than pursuing the number of packets sent in a single interrupt signal. An interrupt conditioning algorithm designed based on the principles described above empirically sets a 250 microsecond delay time threshold. When the interrupt signal of the data packet is received, the receiving interrupt is immediately closed, the triggering timer is enabled to count down, then the data is stored in a circulating buffer area for receiving DMARING, after the delay time of 250 microseconds is maximally passed, the data is sent to a protocol stack, and then the receiving interrupt closed before is opened again. The process of transmitting data is similar to the process of receiving data.

The specific process of judging the interrupt signal is as follows:

in the interrupt function, an irqreturn_t udma_inr () function is called to judge whether the received interrupt signal is an interrupt signal for transmitting data or an interrupt signal for receiving data, wherein if the received signal is an interrupt signal of a first data packet, a udma_irq_disable () function is called to prohibit interrupt.

And (3) calling a udma_ring_clean_rx_irq () function to judge whether the current DMARING Desc is finished, and if so, calling the udma_rx_ring_pop () function to clean the DMARING Desc and forwarding data at the back, thereby realizing the function of cleaning the DMARING.

Step 2, after the data packet is forwarded from the domestic network card to the domestic operating system kernel, carrying out load balancing on generated interrupt in the kernel by an optimization method of interrupt rotation load balancing; and meanwhile, the data access frequency is reduced in the kernel protocol stack through an optimized cache lock.

The optimization method for interrupt rotation load balancing is specifically implemented as follows: the Loongson 3A processor has an IPI_Status inter-core interrupt Status register, and when any one bit is set to 1 and the corresponding bit of the IPI_Enable register is enabled, the INT4 interrupt line of the Loongson processor core is set, the INT4 corresponds to the IP6 of the STATUS register, i.e. the inter-core interrupt is corresponding, so as to trigger the inter-core interrupt. And after the interrupt is received by the interrupt controller, processing the interrupt number in a round-robin mode, and sending the inter-core interrupt to the target processor through the designated interrupt number. The target processor core receiving the inter-core interrupt can read the IPI_Status register of the target processor core to obtain the distributed interrupt number, and then carry out secondary distribution according to the interrupt number to execute do_IRQ ().

The specific implementation process using the interrupt rotation load balancing technique is as follows:

in the kernel interrupt function map_irq_dispatch (), a variable cpu mask flag is set to mark the target processor core of the next round, and the get_irq_ht () function is called in the dispatch_ip3 () function to obtain an interrupt number from the HT interrupt register, if the interrupt number is 3 or 5, the loongson3_send_irq_by_ipi () function is called to send an inter-core interrupt, and the cpu mask variable is modified. I.e. modifying the target processor core of the next round, then calling the aloognson 3_ipi_interrupt () function to process inter-core interrupt, and finally calling the do_IRQ () function to process interrupt distribution.

After the interrupt rotation load balancing method is used for optimizing the processing, each processor core of the Loongson 3A participates in the network interrupt processing, so that the network throughput is effectively improved, and the packet loss rate of data is reduced.

The optimized cache lock is specifically realized as follows: during the whole system operation, the DMA descriptor and the sending and receiving queue address are fixed, and the dynamically allocated skb_buff and the data buffer area are in a certain address range, so that the space locality of the data part in the whole network processing flow is good, and the cache lock can be adopted to improve the network processing performance. Because the system has zero copy technology application, the access to the skb_buff data structure is more frequent, and the access to the data packet buffer is less, in the specific implementation, only the head information (mac address, IP address, TCP/UDP head information) in the data packet buffer is subjected to cache locking operation.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The network performance comprehensive optimization method for the Loongson 3A domestic software and hardware platform is characterized by comprising the following steps:

2. The method for comprehensively optimizing the network performance of the Loongson 3A domestic software and hardware platform according to claim 1, wherein the process of forwarding the data by the optimally driven network card comprises the following steps:

3. The method for comprehensively optimizing network performance of a Loongson 3A domestic software and hardware platform according to claim 2, wherein the method is characterized in that a timer is triggered to count when the hardware interrupt is closed, and data is sent to a protocol stack when the count reaches a delay time threshold.

4. The network performance comprehensive optimization method for Loongson 3A domestic software and hardware platform according to claim 1, wherein the optimization method for interrupt rotation load balancing comprises the following steps:

5. The network performance comprehensive optimization method for the Loongson 3A domestic software and hardware platform according to claim 1, wherein the cache lock locks a skb_buff data structure.

6. The method for comprehensively optimizing network performance of a Loongson 3A domestic software and hardware platform according to claim 5, wherein the cache lock locks header information in a data packet buffer.