CN113051057A

CN113051057A - Multithreading data lock-free processing method and device and electronic equipment

Info

Publication number: CN113051057A
Application number: CN202110341793.3A
Authority: CN
Inventors: 何全安
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-06-29

Abstract

The application provides a multithreading data lock-free processing method, a multithreading data lock-free processing device and electronic equipment. Therefore, the queue is completely free of lock during the multi-thread concurrent operation, the waste of CPU resources is reduced, and the system performance is improved.

Description

Multithreading data lock-free processing method and device and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a lock-free processing method and apparatus for multithreading data, and an electronic device.

Background

With the rapid growth of big data, the demands become more and more complex and diverse, and multi-threaded concurrent transmission, calculation, processing and storage have become essential requirements. Multithreading concurrency inevitably creates a resource contention problem.

Conventionally, the resource contention is solved by using a thread lock and the like to protect critical resources, but the thread lock and the like can cause CPU resources to seize and wait, so that the processing capacity of the whole system is reduced, and the hardware cost of the whole system is increased.

Disclosure of Invention

In view of the above, to solve the above problems, the present application provides a lock-free processing method and apparatus for multithread data, and an electronic device, where the technical scheme is as follows:

one aspect of the present application provides a lock-free processing method for multithreading data, including:

creating a plurality of message queues, wherein each message queue corresponds to a group of production threads and consumption threads;

when the target production thread runs, caching the data of the target production thread into a corresponding message queue;

and scanning data in the corresponding message queue by the target consumption thread when the target consumption thread runs.

Preferably, the creating a plurality of message queues includes:

obtaining thread configuration information, wherein the number n of production threads in the thread configuration information is equal to or different from the number m of consumption threads, and both n and m are integers greater than or equal to 1;

creating a matrix queue, wherein the number of array elements of the matrix queue is n x m, one array element is a message queue, one production thread corresponds to i message queues, one consumption thread corresponds to j message queues, i is more than or equal to 1 and less than or equal to m, and j is more than or equal to 1 and less than or equal to n.

Preferably, the scanning data in the message queue corresponding to the scanning data comprises:

obtaining a weight factor of each corresponding message queue, wherein the weight factor can represent the scanned priority of the message queue;

and sequentially scanning the data in each corresponding message queue according to the priority represented by the weight factor.

Preferably, the obtaining the weight factor of each message queue corresponding to the weight factor includes:

acquiring attribute information of each message queue corresponding to the message queue;

and calculating the weight factor of the message queue according to the attribute information.

Preferably, the attribute information includes a data amount and a scanning duration since the current distance was scanned last time, and both the data amount and the scanning duration are proportional to the weighting factor.

Preferably, if n ═ m, one production thread uniquely corresponds to one of the message queues in the matrix queue, and one consumption thread uniquely corresponds to one of the message queues in the matrix queue.

Preferably, the message queue includes a data memory queue and an idle memory queue, and the memory block of the idle memory queue is obtained from a general memory band;

correspondingly, the buffering of the data thereof into the message queue corresponding thereto includes:

for each corresponding message queue, obtaining a first available memory block from an idle memory queue of the message queue, writing data of the first available memory block into the first available memory block, and mounting the first available memory block into a data memory queue of the message queue;

correspondingly, the scanning data in the message queue corresponding to the scanning data comprises:

and for each message queue corresponding to the message queue, obtaining a second available memory block from the data memory queue of the message queue, and after the available memory block is scanned, mounting the second available memory block with released memory into an idle memory queue of the message queue.

Preferably, the method for releasing the memory includes:

obtaining the accumulated data volume of the historical available memory block and the second available memory block of the unreleased memory;

and if the accumulated data volume meets a preset release condition, releasing the memories of the historical available memory block and the second available memory block in batches.

Another aspect of the present application provides a multithreading data lock-less processing apparatus, including:

the queue creating module is used for creating a plurality of message queues, and each message queue corresponds to a group of production threads and consumption threads;

the thread running module is used for caching the data of the target production thread into a corresponding message queue when the target production thread runs; and scanning data in the corresponding message queue by the target consumption thread when the target consumption thread runs.

Yet another aspect of the present application provides an electronic device, including:

at least one memory and at least one processor; the memory stores a program, and the processor calls the program stored in the memory, wherein the program is used for realizing any one of the multithreading data lock-free processing methods.

By the technical scheme, a message queue is created for a group of production threads and consumption threads, the production threads/consumption threads can quickly perform data caching/data scanning on the message queue based on the message queue without waiting for other concurrent threads, and once operation is successful as long as the queue is not full/empty. Therefore, the queue is completely free of lock during the multi-thread concurrent operation, the waste of CPU resources is reduced, and the system performance is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a block diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a method for lock-free processing of multithreaded data according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for processing multithreaded data without lock according to a second embodiment of the present application;

fig. 4 is a schematic diagram of a matrix queue according to an embodiment of the present application;

fig. 5 is a flowchart of a method of lock-free processing of multithreaded data according to a third embodiment of the present application;

FIG. 6 is a flowchart of the system operation provided by an embodiment of the present application;

FIG. 7 is a diagram of a software architecture provided by an embodiment of the present application;

FIG. 8 is a flowchart of a work flow provided by an embodiment of the present application;

fig. 9 is a schematic structural diagram of a multithread data lock-less processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

For the convenience of understanding the present application, the related concepts in the present application will be explained first:

in the thread world, a production thread is a thread that produces data (in the role of producer) and a consumption thread is a thread that consumes data (in the role of consumer). The production thread only takes care of posting data to the message queue and no care of who to fetch, and the message thread only needs to fetch data from the message queue regardless of who posted. In this way, neither the producing thread nor the consuming thread is aware of the existence of the other.

With the advancement of science and technology, high-frequency multi-core processors have become targets for the development of CPUs. With the rapid growth of big data, the demands become more and more complex and diverse, and multi-threaded concurrent transmission, calculation, processing and storage have become essential requirements. Multithreading concurrency inevitably creates a resource contention problem. Conventionally, the resource contention is solved by using a thread lock and the like to protect critical resources, but the thread lock and the like can cause CPU resources to seize and wait, so that the processing capacity of the whole system is reduced, and the hardware cost of the whole system is increased.

The existing solution for solving the multi-thread resource sharing is the most common lock-free queue, And the basic principle is to use cas (compare And swap) technology to implement the lock-free queue. The CAS is an atomic operation instruction for guaranteeing data consistency. The instruction has three parameters, namely a current memory value V, an old expected value A and an updated value B, if and only if the expected value A is the same as the memory value V, the memory value is modified to B and true is returned, otherwise nothing is done and false is returned, and then do … while attempt is made again until success.

Therefore, during single-thread production and single-thread consumption, as long as the queue is not full or empty, the operation is successful to some extent, but during multi-thread production and multi-thread consumption, the operation is not always successful, and the attempt is required to be made again when the operation fails, although no lock exists, the attempt of while circulation is still wasted on CPU resources, and the performance of the system is reduced.

Aiming at the problems and the defects of the existing solutions, the application provides a high-performance multithreading data lock-free processing scheme, which can completely have no lock when multiple producers and multiple consumers operate concurrently, successfully complete production and consumption at one time, and does not need do … while loop attempts, so that the performance of the system is improved, and the more the number of threads is, the more the performance difference is obvious. The scheme can be sealed into an independent component, and complex requirements can be simply and easily realized through simple parameter setting and interface calling.

The present application provides a lock-free processing method for multi-thread data, which may be applied to an electronic device, and referring to a hardware structure block diagram of the electronic device shown in fig. 1, the hardware structure of the electronic device may include: a processor 11, a communication interface 12, a memory 13 and a communication bus 14;

in the embodiment of the present application, the number of the processor 11, the communication interface 12, the memory 13 and the communication bus 14 is at least one, and the processor 11, the communication interface 12 and the memory 13 complete mutual communication through the communication bus 14.

The processor 11 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application, etc.

The memory 13 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, for example, at least one disk memory.

The memory 13 stores applications and data generated by the applications, and the processor 11 executes the applications to implement the following functions:

creating a plurality of message queues, wherein each message queue corresponds to a group of production threads and consumption threads; when the target production thread runs, caching the data of the target production thread into a corresponding message queue; and scanning data in the corresponding message queue by the target consumption thread when the target consumption thread runs.

It should be noted that the processor performs refinements and extensions of the implemented functions of the application program, as described below.

An embodiment of the present application provides a lock-free processing method for multithreaded data, which is shown in a method flowchart shown in fig. 2, and includes the following steps:

s101, a plurality of message queues are created, and each message queue corresponds to a group of production threads and consumption threads.

In the embodiment of the application, one message queue can only be uniquely determined to be a group of production threads and consumption threads, namely a single producer and a single consumer, and the consumption queue can only be shared by a unique group of producers and consumers. This ensures production of data by production threads within the group and consumption of data by consuming threads within the group.

The lock-free design avoids multiple do … while attempts possibly caused by CAS atomic operation, thereby better utilizing the CPU and improving the processing capacity of the whole system.

It should be noted that for a set of production threads and consumption threads, this means that a combination of a single producer and a single consumer is unique. For example, [ production thread 1, consumption thread 1], [ production thread 1, consumption thread 2], [ production thread 2, consumption thread 1], [ production thread 2, consumption thread 2] are four sets of production threads and consumption threads.

It should be further noted that the multiple message queues created in the present application allocate system resources, which are occupied by the lock-free queues in the existing scheme, to the multiple message queues, and the isolation of memory resources is realized through the design of the message queues, thereby greatly simplifying the design and management problems of the multi-thread memory resources. Of course, the amount of memory resources occupied by each message queue is not limited in this embodiment of the present application.

And S102, caching the data of the target production thread into a corresponding message queue when the target production thread runs.

In the embodiment of the application, for a running target production thread, at least one message queue corresponding to the running target production thread is determined firstly, and then data generated by the running target production thread is cached in the target message queue according to a certain scheduling algorithm.

Continuing with the above example, assume that four sets of [ production thread 1, consumption thread 1], [ production thread 1, consumption thread 2], [ production thread 2, consumption thread 1], [ production thread 2, consumption thread 2] correspond to message queue 1, message queue 2, message queue 3, and message queue 4, respectively.

Assuming that the running production thread is production thread 1, the corresponding message queue for production thread 1 includes message queues 1 and 2, so that it can buffer one piece of data to message queue 1 or 2 every time the data is produced according to the corresponding scheduling algorithm. Of course, the same is true for the operation of other production threads, which are not illustrated, and are not described herein again. Multiple production threads run simultaneously and do not affect each other.

It should be noted that the scheduling algorithm corresponding to the production thread may be an RR scheduling algorithm or a percentage scheduling algorithm. Specifically, the method comprises the following steps:

continuing with the example of the production thread 1, when it uses the RR scheduling algorithm, each piece of data produced by it is sequentially buffered in the message queues 1 and 2, that is, the 1 st piece of data is buffered in the message queue 1, the 2 nd piece of data is buffered in the message queue 2, and the 3 rd piece of data is buffered in the message queue 1 … ….

When the production thread 1 uses the percentage scheduling algorithm, since the percentage scheduling algorithm allocates the buffer ratio to the message queues 1 and 2 in advance, the production thread 1 buffers each piece of data to be produced into the message queues 1 and 2 according to the buffer ratio. Taking the buffer ratio of 1:2 as an example, the production thread 1 will buffer the 1 st data into the message queue 1, the 2 nd to 3 rd data into the message queue 2, the 4 th data into the message queue 1, and the 5 th to 6 th data into the message queue 2 … ….

And S103, scanning the data in the corresponding message queue when the target consumption thread runs.

In the embodiment of the application, for a running target consumption thread, at least one message queue corresponding to the running target consumption thread is determined at first, and then data in the target message queue is scanned according to a certain scheduling algorithm for consumption.

Continuing with the above example, assuming that the consuming thread in operation is consuming thread 1, the corresponding message queue for consuming thread 1 includes message queues 1 and 3, so that it can scan the data in message queue 1 or 3 every time it consumes one piece of data according to the corresponding scheduling algorithm. Of course, the same is true for the operation of other consumption threads, which are not illustrated, and are not described herein again. Multiple consuming threads run simultaneously and do not affect each other.

It should be noted that the scheduling algorithm corresponding to the consuming thread may also be an RR scheduling algorithm or a percentage scheduling algorithm. Specifically, the method comprises the following steps:

continuing with the example of consumption thread 1, when using the RR scheduling algorithm, it scans each consumed data item in turn from message queues 1 and 3, i.e. scans the 1 st data item from message queue 1, scans the 2 nd data item from message queue 3, scans the 3 rd data item from message queue 1, and scans the 4 th data item … … from message queue 3.

When the consuming thread 1 uses the percentage scheduling algorithm, the consuming thread 1 scans the data in the message queues 1 and 3 according to the scanning proportion because the percentage scheduling algorithm allocates the scanning proportion to the message queues 1 and 3 in advance. Taking the scanning ratio of 2:1 as an example, the consuming thread 1 scans the 1 st data consumed from the message queue 1, scans the 2 nd to 3 rd data consumed from the message queue 3, scans the 4 th data consumed from the message queue 1, and scans the 5 th to 6 th data consumed from the message queue 3.

It should be noted that, the "bar" as the measurement unit of the data produced by the production thread and the data scanned by the consumption thread lock may refer to a certain amount of data, and may also refer to a complete piece of information, such as a message, having at least a start mark. Of course, in consideration of a specific application scenario, other definitions may be made, and this is not limited in the embodiment of the present application.

Therefore, based on the application, when the multiple production threads and the multiple consumption threads run concurrently, the multiple production threads and the multiple consumption threads cannot affect each other, and any thread can quickly finish the caching/scanning operation on the message queue without waiting for other concurrent threads. For a production thread, as long as the memory of a message queue for executing cache operation is not full; for the consuming thread, the one-time operation is successful as long as the message queue of the consuming thread executing the scanning operation is not empty, so that the queue of the multi-thread concurrent operation is completely free of lock.

As an implementation manner of creating multiple message queues, a second embodiment of the present application provides another lock-free processing method for multithreaded data, and referring to a flowchart of the method shown in fig. 3, the method includes the following steps:

s201, obtaining thread configuration information, wherein the number n of production threads in the thread configuration information is equal to or different from the number m of consumption threads, and both n and m are integers greater than or equal to 1.

In the embodiment of the application, when the system is initialized, the thread configuration information is loaded, the information is generated by performing resource management on the thread, and the resource management comprises allocation, management and recovery of the thread.

In addition, the embodiment of the application provides a thread creating scheme under a general scene, the number of production threads and consumption threads is not limited, and no matter whether the production threads and the consumption threads are equal or not, lock-free queues can be realized.

S202, a matrix queue is created, the number of array elements of the matrix queue is n x m, one array element is a message queue, one production thread corresponds to i message queues, one consumption thread corresponds to j message queues, i is larger than or equal to 1 and smaller than or equal to m, and j is larger than or equal to 1 and smaller than or equal to n.

In the embodiment of the present application, it is assumed that there are n production threads and m consumption threads, and then an n × m matrix queue is adopted, where one production thread corresponds to m message queues at most, and one consumption thread corresponds to n message queues at most. Of course, the contents of the queues in the matrix queue that are not corresponding to a set of producing and consuming threads are empty.

Take m message queues corresponding to one production thread and n message queues corresponding to one consumption thread as an example. Referring to the matrix queue diagram shown in FIG. 4, the production threads include P1, P2, P3, P4, … …, Pn, and the consumption threads include C1, C2, C3, C4, … …, Cm. The production thread takes P1 as an example, and the corresponding message queue comprises 11, 12, 13, 14, … …, 1 n; the consuming thread takes C1 as an example, and the corresponding message queue comprises 11, 21, 31, 41, … … and n 1. Thus, each message queue in the matrix queue has a unique set of producing and consuming threads, exemplified by message queue 23, whose corresponding set of threads is producing thread P2 and consuming thread C3.

Based on this, when the production thread runs, the data generated by the running of the production thread can be cached in the m queues corresponding to the production thread; while the consuming thread scans data from its corresponding n message queues at run time. Taking the production thread P1 as an example, assuming that the scheduling algorithm is RR algorithm, it will buffer 1 data into the

message queue

11, 2 nd data into the message queue 12, and 3 rd data into the message queues 13 and … …, until the m +1 th data is buffered into the message queue 11 after the m-th data is buffered into the message queue 1 m. Taking the consuming thread C1 as an example, assuming that its scheduling algorithm is also RR algorithm, it scans the 1 st data from the message queue 11, the 2 nd data from the message queue 21, the 3 rd data from the message queue 31, … …, until the 1 st data from the message queue 11 continues to be scanned after the nth data from the message queue n 1.

Therefore, each array element (namely, one annular lock-free FIFO queue Ring) of the matrix queue belongs to a unique pair of production threads and consumption threads, so that the performance is not influenced by the number of the production threads and the consumption threads, and higher processing capacity is achieved.

And S203, when the target production thread runs, caching the data of the target production thread into a corresponding message queue.

And S204, scanning the data in the corresponding message queue when the target consumption thread runs.

It can be seen that the matrix queue is the key of lock-free, and each array element in the matrix queue is a lock-free queue of a single producer and a single consumer. The matrix queue ensures that each pair of production threads and consumption threads uniquely corresponds to one array element in the matrix queue in a multithreading scene, and conversely, each array element is only shared by the unique production threads and the unique consumption threads. The matrix queue ensures that the thread can return data once for production and consumption, and avoids multiple do … while attempts possibly caused by CAS atomic operation, so that the CPU is better utilized, and the processing capacity of the whole system is improved. The isolation of the lock-free matrix queue to the memory resources greatly simplifies the design and management problems of the multi-thread memory resources.

As an implementation manner of the step of "scanning data in a message queue corresponding to the step" in the first embodiment and the second embodiment, a third embodiment of the present application provides another lock-free processing method for multithreaded data, and referring to a flowchart of a method shown in fig. 5 (described as the second embodiment), the method includes the following steps:

s301, obtaining thread configuration information, wherein the number n of production threads in the thread configuration information is equal to or different from the number m of consumption threads, and both n and m are integers greater than or equal to 1.

S302, a matrix queue is created, the number of array elements of the matrix queue is n x m, one array element is a message queue, one production thread corresponds to i message queues, one consumption thread corresponds to j message queues, i is larger than or equal to 1 and smaller than or equal to m, and j is larger than or equal to 1 and smaller than or equal to n.

And S303, caching the data of the target production thread into a corresponding message queue when the target production thread runs.

S304, when the target consumption thread runs, the weight factors of the corresponding message queues are obtained, the weight factors can represent the scanned priorities of the message queues, and the data in the corresponding message queues are scanned in sequence according to the priorities represented by the weight factors.

Generally, the scheduling algorithm used by the consuming thread in scanning data is one of RR scheduling algorithm and percentage scheduling algorithm. However, both scheduling algorithms have limited the scanning order of the message queue, which limits the consumption scenarios of the consumption thread.

In this regard, the embodiment of the present application provides a weighted scheduling algorithm, that is, a weighting factor is used to describe the priority of each message queue corresponding to a message thread to scan, and the higher the priority of the message queue is, the earlier the time scanned by a consuming thread is. The weighting factor of the message queue may be set according to a scenario, or may be determined according to a thread type of the production/consumption thread or a data type of data cached/scanned by the production/consumption thread, which is not limited in this embodiment of the present application.

Further, in order to realize the elastic weight factor, that is, the weight factor has learning ability, and the consumption thread scans the data in the target message queue according to the elastic weighted scheduling algorithm, in the embodiment of the present application, when the weight factor of each message queue is obtained, the attribute information of each message queue may be obtained first, and then the weight factor of the message queue to which the consumption thread belongs may be calculated according to the attribute information.

In the embodiment of the present application, the attribute information of the message queue may consider, on one hand, the attribute of the message queue itself, and on the other hand, the attribute of the production/consumption thread corresponding to the message queue may also be considered, and certainly, for a plurality of message queues corresponding to one consumption thread, the consumption threads scanning the message queues are the same, so that only the attribute of the production thread corresponding to each message queue may be considered, and the attribute of the consumption thread may be further considered between different consumption threads to determine the weighting factor.

Specifically, to ensure that a message queue with a large data volume is preferentially scanned and consumed, and to avoid starvation of a message queue with a small data volume, the elastic weighting factor in the embodiment of the present application has a self-learning capability, and is influenced by the data volume (i.e., the total queue buffer) and the attribute of the two message queues that are currently away from the scanning duration scanned last time (i.e., the earliest scheduling time), and both are in direct proportion to the weighting factor.

The initial weighting factors of the message queues are the same when the system is initialized for all message queues of a consuming thread. For each message queue, the data volume and the scanning time length of the message queue are updated in real time and stored in the message queue, and the larger the data volume is, the larger the weight factor is, the larger the scanning time length is, and the larger the weight factor is. Specifically, the step length for adjusting the weight factor may be set for the data amount and the scanning duration, that is, the weight factor increases the step length of one data amount every time the data amount increases by 1 unit, otherwise, the weight factor decreases the step length of one data amount every time the data amount decreases by one unit, and similarly, the weight factor increases the step length of one scanning duration every time the scanning duration increases by 1 unit. Of course, the data amount and the scanning time length are not in the same unit, and the step sizes may be the same or different and may be set separately.

Based on this, before scanning data, one consuming thread can select the current target message queue to be scanned by reading the weight factor of each message queue for each corresponding message queue. The value range of the weight factor can be set to be 0.01-1.0, the step length of the data volume and the scanning duration can be set to be 0.01, the weight factor of each message queue is 1.0 when the system is initialized, the consumption thread can complete data scanning and consumption according to an RR scheduling algorithm or a percentage scheduling algorithm, the data volume and the scanning duration of each message queue can change along with the advance of time, correspondingly, the weight factor of each message queue changes along with the step length of 0.01, and the subsequent consumption thread can scan the message queue with the largest scanning weight factor preferentially.

In addition, the message queue can also adjust the weight factor according to the priority of the corresponding production thread, and the higher the priority is, the larger the step length of the weight factor increase is, and the change of the weight factor is faster.

For convenience of understanding, assume that there are n production threads and m consumption threads, and continue to take the example of one production thread corresponding to m message queues and one consumption thread corresponding to n message queues. The number of message queues in the matrix queue is n × m, the maximum size of one message queue is aij, and the matrix queue is the largestTotal amount of data in large cache is

In one application scenario: the whole system architecture comprises a production thread, a consumption thread, a management thread, a matrix queue, a scheduling algorithm and resource management. Wherein, the production thread: the data processing method comprises the steps that data are produced, and each thread caches the data to a matrix queue according to an RR scheduling algorithm or a percentage scheduling algorithm; consumption thread: the data processing system is responsible for consuming data and scanning the data in the matrix queue to consume according to an RR scheduling algorithm, a percentage scheduling algorithm or a weighted scheduling algorithm; managing the threads: the system is responsible for periodically updating the matrix queue weight factors; matrix queue: the system is responsible for caching data, wherein one array element is only processed by a uniquely determined production thread and a uniquely determined consumption thread; and (3) scheduling algorithm: the scheduling mode responsible for the production thread and the consumption thread; resource management: and the system is responsible for distributing, managing and recycling each thread and the matrix queue and binding CPU resources for each thread. The working process is as follows:

see the system workflow diagram shown in fig. 6. Step a (system initialization): loading thread configuration information, calculating the array element number of the matrix queue, and initializing the system; b, creating a matrix queue according to the number of the array elements; step C (create thread): creating a production thread, a consumption thread and a management thread, and binding a CPU for each thread; step D (resource allocation): allocating a unique array element (namely a message queue) to each production thread and each consumption thread, and assigning each thread scheduling algorithm; step E (start thread): starting a management thread, a consumption thread and a production thread in sequence; step F (production data): the production thread produces data to the matrix queue; step G (consumption data): consuming the data in the matrix queue by the consuming thread; step H (management thread): and the management thread periodically updates the weight factors of the matrix queue.

It should be noted that, in the embodiment of the present application, each message queue in the matrix queue has the same data processing manner as that of the lock-free queue in the existing scheme, that is, each message queue in the matrix queue includes a memory queue and a memory band.

In the present application, in the process of caching data of a running target production thread in a target message queue of the running target production thread, an available memory block matched with a data memory to be cached is first obtained in a memory band in the target message queue, and then the data is written into the available memory block, and then the available memory block with the written data is mounted in the memory queue in the target message queue.

Correspondingly, in the process of scanning the data in the target message queue by the running target consumption thread, the target consumption thread firstly obtains the available memory block written with the data from the memory queue in the target message queue, and after the available memory block is scanned, releases the memory of the available memory block, so that the available memory block with empty content is mounted to the memory band in the target message queue.

Moreover, in the present application, each message queue in the matrix queue belongs to a Ring-shaped lock-free FIFO queue Ring, that is, the available memory blocks of the memory queue in the message queue follow the principle of "first-in first-out", that is, the earlier the available memory blocks are mounted to the memory queue, the earlier the available memory blocks are scanned, the memory is released, and the earlier the available memory blocks are mounted to the memory band.

Therefore, the embodiment of the application provides a multithreading data lock-free processing scheme under a general scene, when multiple production threads and multiple consumption threads run concurrently, the multiple production threads and the multiple consumption threads cannot affect each other, and any thread can quickly finish caching/scanning operation on a message queue without waiting for other concurrent threads. In addition, the consumption thread can also perform targeted scanning based on the priority of the message queue, so that the message queue with large data volume can be guaranteed to be preferentially scanned and consumed, and meanwhile, the message queue with small data volume is prevented from being starved.

With the rapid development of information network technologies such as mobile networks, internet of things and the like, the network data of broadband devices and mobile terminals are increasing at an ultrafast speed, the development of cloud storage, cloud computing and network technologies is becoming more and more compact, and cloud network convergence becomes a future development trend gradually. Along with interconnection and intercommunication of networks, in order to provide multi-dimensional and multi-level real-time service and technical support for users, a large number of server nodes must be transversely expanded, data interaction among the nodes becomes denser and denser, and how to improve network transmission capability becomes more important and more difficult.

Existing solutions for data transmission use a DPDK suite most commonly. The core of the kit lies in excellent transmission performance, and the CPU soft interrupt and context switching generated by a network card receiving and sending packet are avoided by using a DPDK (digital pre-distortion keying) drive PDM (product data management) polling technology; and through the user state drive, data copying and system calling between the kernel and the user layer are avoided. The QDMA drive is one of high-efficiency drives under DPDK, is mainly used for data transmission of an FPGA high-speed board card, and directly writes data from PCIE into a pre-allocated memory block through DMA by adopting a Queue (Queue) technology and then hangs the data to a user layer Queue. The QDMA drive applies and releases the memory block from a general memory pool every time, although a lock-free technology is adopted by DPDK atomic operation, during multi-queue, multi-thread resource concurrency conflict still exists, and more frequent quasi-spin logic attempts occur, so that CPU resources are wasted, the transmission capability of a network is reduced, and the sensing is most obvious during 200G ultrahigh transmission.

In the QDMA driving scene, data produced by one production thread can only be consumed by one consumption thread, and data produced by one production thread can only be consumed by one consumption thread. Therefore, as an implementation manner of creating a matrix queue, an embodiment of the present application provides another lock-free processing method for multithreaded data, where:

and if n is m, one production thread uniquely corresponds to one message queue in the matrix queues, and one consumption thread uniquely corresponds to one message queue in the matrix queues.

In the embodiment of the application, the number of the production threads and the number of the consumption threads are equal, the production threads and the consumption threads are grouped according to an application scene, one production thread corresponds to one consumption thread, and one consumption thread corresponds to one production thread. Continuing with the matrix queue diagram shown in fig. 4. Two of the production threads and two of the consumption threads are illustrated:

assuming that a producing thread P1 and a consuming thread C1 are grouped and a producing thread P2 and a consuming thread C2 are grouped, [ producing thread P1, consuming thread C1] corresponds to message queue 11 and [ producing thread P2, consuming thread C2] corresponds to message queue 2. That is, the production thread P1 can only buffer its data into the message queue 11 (the message queues 12-1 m are not written with data and the contents are empty), and the consumption thread C1 can only scan data from the message queue 11 (no longer scan the message queues 21-n 1). Similarly, the production thread P2 can only buffer its data into the message queue 22 during operation (the message queues 21, 23-2 m are not written with data, and the contents are all empty), and the consumption thread C2 can only scan data from the message queue 22 (the message queues 12, 32-n 2 are not scanned again).

Further, aiming at the defects of the QDMA drive, the scheme for further improving the QDMA drive performance under the condition of high throughput is provided, so that resource competition conflict caused when the drive applies for and releases the memory can be effectively avoided, and the transmission throughput of the drive is improved.

The method comprises the following steps: paired dedicated memory queue techniques. According to the application, two special annular memory queues are built in each queue channel arranged in an application layer to replace the common memory pool shared by all the queues in the prior art, and the model ensures that only one producer and one consumer use one queue resource simultaneously, so that performance degradation caused by resource competition during memory application and release is avoided in a multi-thread scene, the effective utilization rate of a CPU is higher, and the throughput of DPDK transmission is improved. The paired memory queues allow the consumption data of the upper application FIFO (first in first out), but can retain part of the memory, thereby avoiding the limitation of FOFI (first in first out), avoiding the waiting caused by the sequential release of the upper application memory under special conditions, ensuring the real-time recovery of resources and improving the operating efficiency of the drive. The method comprises the following specific steps:

the message queue comprises a data memory queue and an idle memory queue, and memory blocks of the idle memory queue are obtained from the general memory band. That is, the free memory queues of all message queues are derived from the same common memory pool, i.e., the common memory band.

In the process of caching the data of the running target production thread in the corresponding message queue, firstly, at least one available memory block matched with the data memory to be cached is obtained from the idle memory queue of the message queue, and then the data is written into the at least one available memory block, and then the at least one available memory block with the written data is mounted in the data memory queue of the message queue.

It should be noted that, in the present application, the free memory queues in each message queue are all derived from the general memory band, so that the memory amount of the available memory block in the free memory queue is determined when the free memory queue is generated. In an idle memory queue in a message queue, memories of available memory blocks may be the same or different, which is not limited in this embodiment of the present application.

In the process of scanning data in a message queue corresponding to a running target consumption thread, at least one available memory block with data written therein is obtained from a data memory queue of the message queue, and after the scanning of the at least one available memory block is finished, the memory of the at least one available memory block is released, so that the at least one available memory block with empty content is mounted to an idle memory queue in the message queue.

In the embodiment of the application, a dual Ring memory queue Ring is adopted, and when a drive is initialized, two independent Ring memory queues Ring are respectively allocated to an application receiving queue (Rx) or a sending queue (Tx) on the upper layer of each DPDK, wherein one Ring is used for idle memory management, and the other Ring is used for data memory management. The Ring size is 2N (N > ═ 8), belongs to single producer and single consumer type, and this has guaranteed that even when many queues, a Ring will only be operated by a production thread and a consumption thread at most simultaneously, has avoided the resource competition that the multiwire caused across the queue to the performance of QDMA drive has been promoted. Due to the adoption of double-queue management, the mode excellently supports the unordered use of the memory blocks, avoids the situation that the release of the next memory block must wait for the release of the previous block to be completed, and can also improve the performance of the drive.

The second method comprises the following steps: and a batch memory back-supplementing technology is adopted. The compensation is needed after each consumption, the compensation result descriptor needs to inform hardware such as a board card through register updating, and the transmission efficiency is obviously influenced when the updating is too frequent. By adopting a batch back-up technology, the updating times can be reduced by times, so that the transmission throughput is increased, and the driving performance is improved. The application and release of the memory also adopt a packing batch operation strategy, the batch operation reduces the access frequency of the driver to the annular resource queue, and the performance of the driver is further improved.

Practice proves that the updating frequency of the board card hardware resource register descriptor is inversely proportional to the transmission performance, and the performance difference is close to 40% at the flow of 100G. The method and the device can adaptively compensate the number of the idle memory blocks, reduce the updating frequency as much as possible, and ensure the maximization of a transmission throughput area.

In the process that a running target consumption thread releases a memory once, historical available memory blocks of the memory which is not released before and the accumulated data volume of the available memory blocks to be released at this time are obtained firstly; and if the accumulated data volume meets the preset release condition, releasing the memory of all the available memory blocks in batches. Otherwise, skipping the releasing of the memory.

In the process of consuming data of a target consumption thread once in operation, if the accumulated data volume of the consumption data is large, the idle memory blocks (the available memory blocks of the idle memory queue) are automatically replenished in batches once, the backfill amount is 2N (N > ═ 4), and the following requirements are met: 2N < cumulative data volume <2N + 1; if the accumulated data amount is less, jumping to empty this time, and replenishing the idle memory blocks when the condition is met next time; if the jump is performed for N times, all the available memory blocks of the memory to be released can be compensated back forcibly. This greatly reduces the frequency of descriptor updating and resource request, and greatly improves the QDMA driving performance.

Based on the above memory batch release strategy, the present application may further adopt a memory batch application strategy, that is, the memory of the available memory block applied by the running target production thread before writing data may be higher than the memory of the current cache data. For each queue channel arranged in an application layer, the application and release of resources are all operations of completing a plurality of blocks in a packet mode at one time.

See the software architecture diagram shown in fig. 7. The overall software architecture can be divided into large 3 layers: a Linux kernel layer, a DPDK layer and an application layer. A Linux kernel layer: the Linux operating system kernel runs in the kernel state of the system and is the basis of the whole software system; an application layer: the DPDK application program runs in a Linux user mode and is mainly responsible for processing and transceiving actions of service data; DPDK layer: the system is in charge of bypassing the Linux kernel protocol and efficiently transmitting data between hardware and an application layer.

Wherein, the DPDK layer comprises a plurality of components, and the most directly related components in the application comprise EAL, MBUF, MEMPOOL, RING and TIMER. EAL: the environment abstraction layer is used for initializing the DPDK environment of the application program; MBUF: the network message cache management component is used for providing a data cache block creating and releasing interface for the QDMA drive; MEMBOOL: the memory pool management component is used for providing memory block objects for the QDMA drive queue and data transmission; RING: ring buffer management to provide lock-free ring queues for QDMA drives; TIMER: and the system is responsible for providing accurate timing service for the QDMA drive. The QDMA driver is responsible for receiving and transmitting network data, and when the driver is started, a general memory pool is initialized, and two independent annular memory queues are respectively initialized for each receiving queue and each transmitting queue.

See the workflow diagram shown in fig. 8. Step a (device initialization): the DPDK initializes the operating environment; step B (memory allocation): as shown in the first step, a general memory band is allocated for the DPDK drive, then an idle memory queue is allocated, and memory blocks are sequentially acquired from the general memory band and mounted to the idle memory queue; step C (queue configuration): if so, configuring a port queue, allocating a unique queue number to the queue, further allocating a data memory queue, sequentially acquiring memory blocks from a general memory band and mounting the memory blocks to the data memory queue, allocating a special idle memory queue and a data memory queue to a queue channel arranged on each application layer, and initializing description related to hardware; step D (device start): starting DPDK equipment and starting all threads of an application layer; step E (resource application): as shown in the third and fourth embodiments, when the DPDK transmits or receives data, an available memory block is applied from the free memory queue; step F (data assembly): assembling data to be transmitted or received into a newly applied available memory block according to an mbuf structure; step G (resource mounting): mounting the assembled memory blocks to a data memory queue as shown in the fifth step; step H (resource consumption): the consumption thread acquires the consumable memory block from the data memory queue for processing; step I (resource release): and sixthly, the memory blocks of the memory released by the consumption thread board are mounted to the idle memory queue.

Corresponding to the above multithreading data lock-free processing method, the present application also discloses a multithreading data lock-free processing apparatus, as shown in fig. 8, the multithreading data lock-free processing apparatus includes:

a queue creating module 10, configured to create a plurality of message queues, where each message queue corresponds to a group of production threads and consumption threads;

the thread running module 20 is used for caching the data of the target production thread into a corresponding message queue when the target production thread runs; and scanning data in the corresponding message queue by the target consumption thread when the target consumption thread runs.

In another embodiment of the multithreaded data lock-less processing apparatus disclosed in the present application, the queue creating module 10 is specifically configured to:

obtaining thread configuration information, wherein the number n of production threads in the thread configuration information is equal to or different from the number m of consumption threads, and both n and m are integers greater than or equal to 1; creating a matrix queue, wherein the number of array elements of the matrix queue is n x m, one array element is a message queue, one production thread corresponds to i message queues, one consumption thread corresponds to j message queues, and i is more than or equal to 1 and less than or equal to m, and j is more than or equal to 1 and less than or equal to n.

In another embodiment of the multithreaded data lock-less processing device disclosed in the present application, the process of scanning data in the message queue corresponding to the thread execution module 20 includes:

obtaining the weight factor of each corresponding message queue, wherein the weight factor can represent the scanned priority of the message queue; and sequentially scanning the data in each corresponding message queue according to the priority represented by the weight factor.

In another embodiment of the multithreaded data lock-less processing apparatus disclosed in the present application, the process for the thread running module 20 to obtain the weighting factor of each message queue corresponding thereto includes:

acquiring attribute information of each message queue corresponding to the message queue; and calculating the weight factor of the message queue according to the attribute information.

In another embodiment of the multithreaded data lock-less processing device disclosed in the present application, the attribute information includes a data size and a scan duration since the current distance was last scanned, and both the data size and the scan duration are proportional to the weighting factor.

In another embodiment of the multithreaded data lock-less processing apparatus disclosed in the present application, if n ═ m, one production thread uniquely corresponds to one message queue in the matrix queues, and one consumption thread uniquely corresponds to one message queue in the matrix queues.

In another embodiment of the multithreaded data lock-less processing apparatus disclosed in the present application, the message queue includes a data memory queue and an idle memory queue, and the memory blocks of the idle memory queue are obtained from a general memory band;

accordingly, the process of the thread running module 20 for caching the data thereof in the message queue corresponding to the thread running module includes:

for each corresponding message queue, obtaining a first available memory block from an idle memory queue of the message queue, writing the data of the first available memory block into the first available memory block, and mounting the first available memory block into a data memory queue of the message queue;

accordingly, the process of the thread running module 20 scanning the data in the message queue corresponding to the thread running module includes:

and for each corresponding message queue, obtaining a second available memory block from the data memory queue of the message queue, and after the available memory block is scanned, mounting the second available memory block with the released memory into an idle memory queue of the message queue.

In another embodiment of the multithreaded data lock-less processing device disclosed in the present application, the manner in which the thread running module 20 releases the memory includes:

obtaining the accumulated data volume of the historical available memory block and the second available memory block of the unreleased memory; and if the accumulated data volume meets the preset release condition, releasing the memories in batches from the historical available memory blocks and the second available memory blocks.

Corresponding to the above multithreading data lock-free processing method, the present application also discloses a storage medium, in which computer-executable instructions are stored, and the computer-executable instructions are used for executing the multithreading data lock-free processing method according to any one of the above embodiments.

The multithreading data lock-free processing method, the multithreading data lock-free processing device and the electronic equipment are introduced in detail, specific examples are applied in the description to explain the principle and the implementation mode of the application, and the description of the embodiments is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include or include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A lock-free processing method for multi-thread data comprises the following steps:

2. The method of claim 1, the creating a plurality of message queues comprising:

3. The method of claim 2, the scanning data in a message queue corresponding thereto, comprising:

4. The method of claim 3, wherein the obtaining the weighting factor of each message queue corresponding to the weighting factor comprises:

5. The method of claim 4, the attribute information comprising a data volume and a scan duration since a current distance was last scanned, both the data volume and the scan duration being proportional to the weighting factor.

6. The method of claim 2, wherein if n-m, then one production thread corresponds uniquely to one of the matrix queues and one consumption thread corresponds uniquely to one of the matrix queues.

7. The method of claim 6, wherein the message queue comprises a data memory queue and an idle memory queue, and memory blocks of the idle memory queue are obtained from a general memory band;

8. The method of claim 7, the manner in which memory is released comprising:

9. A multi-threaded data lock-less processing apparatus, comprising:

10. An electronic device, comprising:

at least one memory and at least one processor; the memory stores a program, and the processor calls the program stored in the memory, and the program is used for realizing the lock-free processing method of the multi-thread data according to any one of claims 1 to 8.