CN117909087B - Data processing method and device, central processing unit and electronic equipment - Google Patents

Data processing method and device, central processing unit and electronic equipment Download PDF

Info

Publication number
CN117909087B
CN117909087B CN202410321393.XA CN202410321393A CN117909087B CN 117909087 B CN117909087 B CN 117909087B CN 202410321393 A CN202410321393 A CN 202410321393A CN 117909087 B CN117909087 B CN 117909087B
Authority
CN
China
Prior art keywords
calculation
request data
request
processing
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410321393.XA
Other languages
Chinese (zh)
Other versions
CN117909087A (en
Inventor
孙梁
闻广亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Technologies Co Ltd
Original Assignee
New H3C Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Technologies Co Ltd filed Critical New H3C Technologies Co Ltd
Priority to CN202410321393.XA priority Critical patent/CN117909087B/en
Publication of CN117909087A publication Critical patent/CN117909087A/en
Application granted granted Critical
Publication of CN117909087B publication Critical patent/CN117909087B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Advance Control (AREA)

Abstract

The embodiment of the application provides a data processing method, a device, a central processing unit and electronic equipment, which relate to the technical field of communication and are applied to a CPU (Central processing Unit), wherein the CPU enables at least one process, and a plurality of request threads are enabled in each process, and the method comprises the following steps: receiving computing request data in parallel by a plurality of request threads; the received calculation request data are uniformly written into a plurality of software cache queues through a plurality of request threads; and sending the calculation request data cached in each software cache queue to the hardware acceleration card through the corresponding processing protocol of each software cache queue, so that the hardware acceleration card calculates the received calculation request data to obtain a calculation result, and the software cache queues are in one-to-one correspondence with the processing protocol. By applying the technical scheme provided by the embodiment of the application, the CPU calculation load can be reduced, and the equipment performance can be improved.

Description

Data processing method and device, central processing unit and electronic equipment
Technical Field
The present application relates to the field of communications technologies, and in particular, to a data processing method, a data processing device, a central processing unit, and an electronic device.
Background
At present, new technologies such as a fifth generation mobile communication technology (5th Generation Mobile Communication Technology,5G), cloud computing, artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), high-performance computing, big data and the like promote a great number of emerging applications, so that the data volume is expanded in a explosive manner.
Distributed storage is an important data infrastructure and is the best base for mass applications. However, key calculations such as erasure, compression, etc. in distributed storage are processed by a central processing unit (Central Processing Unit, CPU), resulting in excessive CPU calculation load, which becomes a bottleneck for performance improvement.
Disclosure of Invention
The embodiment of the application aims to provide a data processing method, a data processing device, a central processing unit and electronic equipment, so as to reduce the CPU calculation load and improve the equipment performance. The specific technical scheme is as follows:
In a first aspect, an embodiment of the present application provides a data processing method, which is applied to a CPU, where the CPU enables at least one process, and each process enables a plurality of request threads; the method comprises the following steps:
receiving computing request data in parallel through the plurality of request threads;
the received calculation request data are uniformly written into a plurality of software cache queues through the plurality of request threads;
And sending the calculation request data cached in each software cache queue to a hardware acceleration card through a processing protocol corresponding to each software cache queue, so that the hardware acceleration card calculates the received calculation request data to obtain a calculation result, and the software cache queues are in one-to-one correspondence with the processing protocols.
In some embodiments, a plurality of request coroutines are enabled in each request thread, each request coroutine corresponds to a preset type of calculation request data, and each software cache queue corresponds to a preset type of calculation request data;
the parallel receiving, by the plurality of request threads, computing request data includes:
Receiving calculation request data of a preset type in parallel through a plurality of request coroutines;
the step of uniformly writing the received calculation request data into a plurality of software cache queues through the plurality of request threads comprises the following steps:
And writing the received calculation request data into a software cache queue corresponding to the preset type through each request cooperative program.
In some embodiments, the method further comprises:
Setting a first request cooperative program into a waiting state for each request thread, wherein first calculation request data received by the first request cooperative program is written into a corresponding software cache queue;
Waking up a second request coroutine;
and if the calculation result of the first calculation request data is obtained, waking up the first request coroutine.
In some embodiments, the sending, by the processing protocol corresponding to each software buffer queue, the calculation request data buffered in the software buffer queue to the hardware accelerator card includes:
Each processing cooperative program is polled and awakened, so that the currently awakened processing cooperative program detects whether calculation request data is cached in a corresponding software cache queue;
if the calculation request data is not cached in the corresponding software cache queue, setting the currently awakened processing cooperative program to be in a waiting state, and awakening the next processing cooperative program;
If the calculation request data is cached in the corresponding software cache queue, the currently-awakened processing protocol is kept in an awakening state, so that the calculation request data cached in the corresponding software cache queue is sent to the hardware accelerator card.
In some embodiments, the method further comprises:
After the calculation request data cached in the corresponding software cache queue is sent to the hardware acceleration card, each processing assistant is polled and awakened so as to enable the currently awakened processing assistant Cheng Jiance to obtain a calculation result of the corresponding calculation request data or not;
If the hardware acceleration card does not acquire the calculation result of the corresponding calculation request data, the currently awakened processing assistant program is set to be in a waiting state, and the next processing assistant program is awakened;
If the hardware acceleration card obtains the calculation result of the corresponding calculation request data, the currently-awakened processing assistant is kept in an awakening state, so that the calculation result of the corresponding calculation request data is returned to the corresponding request thread.
In some embodiments, the sending, by the processing protocol corresponding to each software buffer queue, the calculation request data buffered in the software buffer queue to the hardware accelerator card includes:
Acquiring calculation request data of preset data quantity from each software cache queue through a corresponding processing protocol of each software cache queue;
and sending the acquired calculation request data to the hardware accelerator card.
In some embodiments, each thread in the at least one process shares the same CPU core.
In some embodiments, the at least one process is an object storage device (Object Storage Device, OSD) process or a data acceleration engine (DATA SPEED ENGINE, DSE) process; the plurality of processing covariates are located in a Data Plane Development Kit (DPDK) thread; the plurality of request threads are threads for executing erasure correction, erasure correction or compression calculation; the hardware accelerator card is a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) hardware card or a data processor (Data Processing Unit, DPU).
In some embodiments, the CPU interfaces with the hardware accelerator card via a plurality of direct memory access (Direct Memory Access, DMA) interfaces; and configuring a fixed number of DMA interfaces by each process, so that the difference value of the number of the DMA interfaces corresponding to each two processes is smaller than or equal to a preset threshold value.
In a second aspect, an embodiment of the present application provides a data processing apparatus, which is applied to a CPU, where the CPU enables at least one process, and each process enables a plurality of request threads; the device comprises:
the receiving module is used for receiving calculation request data in parallel through the plurality of request threads;
The writing module is used for uniformly writing the received calculation request data into a plurality of software cache queues through the plurality of request threads;
the sending module is used for sending the calculation request data cached in the software cache queues to the hardware acceleration card through the corresponding processing cooperative program of each software cache queue so that the hardware acceleration card calculates the received calculation request data to obtain a calculation result, and the software cache queues are in one-to-one correspondence with the processing cooperative program.
In some embodiments, a plurality of request coroutines are enabled in each request thread, each request coroutine corresponds to a preset type of calculation request data, and each software cache queue corresponds to a preset type of calculation request data;
The receiving module is specifically configured to receive calculation request data of a preset type in parallel through a plurality of request coroutines;
the writing module is specifically configured to write the received calculation request data into a software cache queue corresponding to the preset type through each request protocol.
In some embodiments, the writing module is further configured to set, for each request thread, a first request coroutine to a waiting state, where first calculation request data received by the first request coroutine has been written into a corresponding software cache queue; waking up a second request coroutine; and if the calculation result of the first calculation request data is obtained, waking up the first request coroutine.
In some embodiments, the sending module is specifically configured to:
Each processing cooperative program is polled and awakened, so that the currently awakened processing cooperative program detects whether calculation request data is cached in a corresponding software cache queue;
if the calculation request data is not cached in the corresponding software cache queue, setting the currently awakened processing cooperative program to be in a waiting state, and awakening the next processing cooperative program;
If the calculation request data is cached in the corresponding software cache queue, the currently-awakened processing protocol is kept in an awakening state, so that the calculation request data cached in the corresponding software cache queue is sent to the hardware accelerator card.
In some embodiments, the sending module is further configured to:
After the calculation request data cached in the corresponding software cache queue is sent to the hardware acceleration card, each processing assistant is polled and awakened so as to enable the currently awakened processing assistant Cheng Jiance to obtain a calculation result of the corresponding calculation request data or not;
If the hardware acceleration card does not acquire the calculation result of the corresponding calculation request data, the currently awakened processing assistant program is set to be in a waiting state, and the next processing assistant program is awakened;
If the hardware acceleration card obtains the calculation result of the corresponding calculation request data, the currently-awakened processing assistant is kept in an awakening state, so that the calculation result of the corresponding calculation request data is returned to the corresponding request thread.
In some embodiments, the sending module is specifically configured to:
Acquiring calculation request data of preset data quantity from each software cache queue through a corresponding processing protocol of each software cache queue;
and sending the acquired calculation request data to the hardware accelerator card.
In some embodiments, each thread in the at least one process shares the same CPU core.
In some embodiments, the at least one process is an OSD process or a DSE process; the plurality of processing coprocessors are located in a DPDK thread; the plurality of request threads are threads for executing erasure correction, erasure correction or compression calculation; the hardware acceleration card is an FPGA hardware card or a DPU.
In some embodiments, the CPU is connected with the hardware accelerator card through a plurality of DMA interfaces; and configuring a fixed number of DMA interfaces by each process, so that the difference value of the number of the DMA interfaces corresponding to each two processes is smaller than or equal to a preset threshold value.
In a third aspect, an embodiment of the present application provides a central processing unit, performing any of the method steps of the first aspect.
In a fourth aspect, an embodiment of the present application provides an electronic device, including a CPU and a hardware accelerator card; the CPU enables at least one process, and a plurality of request threads are enabled in each process;
The CPU receives calculation request data in parallel through the plurality of request threads; the received calculation request data are uniformly written into the plurality of software cache queues through the plurality of request threads; sending calculation request data cached in each software cache queue to a hardware acceleration card through a corresponding processing cooperative program of each software cache queue, wherein the software cache queues are in one-to-one correspondence with the processing cooperative programs;
And the hardware acceleration card calculates the received calculation request data to obtain a calculation result.
In a fifth aspect, in a further embodiment of the present application, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the method steps of the first aspect described above.
The embodiment of the application has the beneficial effects that:
In the technical scheme provided by the embodiment of the application, the CPU enables a plurality of request threads and a plurality of processing coprocessors to unload the computing task. After the CPU receives the calculation request data in batches in parallel through a plurality of request threads, the calculation request data are uniformly written into a software cache queue, and the calculation request data cached in the software cache queue are sent to the hardware accelerator card through a plurality of processing co-programs, so that the task is unloaded onto the hardware accelerator card, and the CPU only needs to execute the scheduling and management of the task, thereby saving CPU calculation resources, reducing CPU calculation load, enabling CPU efficiency to be higher and improving the overall storage performance of the equipment.
In addition, the CPU receives the calculation request data in batches through a plurality of request threads in parallel, in this case, even if one or more request threads are blocked, the calculation request data are cached in the software cache queue by the unblocked request threads, and correspondingly, the calculation request data cached in the software cache queue are continuously sent to the hardware accelerator card by a plurality of processing co-ordinates, so that the hardware accelerator card is always in a busy state, the possible time waste of the CPU is reduced, and the overall storage performance of the device is further improved.
In addition, the software buffer queue is a logic queue, and the software buffer queue actually buffers a logic address or a physical address of data, and no data copy exists, so that the software buffer queue is utilized to reduce the data copy in the memory, the CPU can finish unloading a calculation task only by storing calculation request data once in the memory, the memory access process is optimized, and the overall storage performance of the device is further improved.
Of course, it is not necessary for any one product or method of practicing the application to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the application, and other embodiments may be obtained according to these drawings to those skilled in the art.
FIG. 1 is a schematic diagram of a first flow chart of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a second flow chart of a data processing method according to an embodiment of the present application;
FIG. 3 is a third flow chart of a data processing method according to an embodiment of the present application;
FIG. 4 is a fourth flowchart of a data processing method according to an embodiment of the present application;
FIG. 5 is a fifth flowchart of a data processing method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a cluster management and device management architecture for storage devices according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a memory device according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a data processing flow of a request thread according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a data processing flow of a processing protocol according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a hardware interaction process flow provided in an embodiment of the present application;
FIG. 11 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by the person skilled in the art based on the present application are included in the scope of protection of the present application.
At present, new technologies such as 5G, cloud computing, artificial intelligence, high-performance computing and big data are promoted to generate a great deal of emerging applications, so that the data volume is expanded in a burst mode. Distributed storage is an important data infrastructure and is the best base for mass applications. However, key computation such as erasure correction, compression and the like in the distributed storage is processed by the CPU, for example, the CPU is used for computing redundancy blocks of erasure correction codes, computing Hash values of erasure correction, computing compression and decompression and the like, so that more CPU resources are occupied, and the CPU becomes a bottleneck for improving performance due to too high computational load.
In addition, for the calculation of erasure codes (Eresure Coding, EC), the current distributed storage maximally supports an erasure ratio of 16+4, which cannot reach an erasure ratio of 22+2 or higher, and the disk usage cannot be further improved.
In the prior art, an intelligent storage acceleration library (INTELLIGENT STORAGE ACCELERATION LIBRARY, ISA-L) and the like are adopted to optimize the instruction set of the EC algorithm, so that the consumption of a CPU is reduced, but the load of the CPU is still higher.
Other schemes such as fast assist technology (QuickAssist Technology, QAT) have also proposed using a hardware accelerator card for hardware offloading to reduce CPU load. In these schemes, the offload acceleration method has both synchronous and asynchronous modes of operation.
In the synchronous mode, after the CPU runs the calling thread and sends the calculation request data to the hardware acceleration card, the calling thread can be blocked to the semaphore or other synchronous mechanisms, and after the hardware acceleration card finishes processing, the calculation result is returned to the calling thread. At this point, the calling thread can process the next piece of calculation request data. In the synchronous mode, no sleep context is switched, and when the CPU waits for a calculation result, the CPU wastes time.
In the asynchronous mode, once the CPU sends the calculation request data to the hardware acceleration card, the calculation result of the hardware acceleration card is returned to the calling thread in the form of a control message, for example, a mechanism for triggering and executing a callback function is partially realized, and a mechanism for polling is partially realized. When the calculation result is not acquired, the CPU cannot continue the subsequent work of the calculation request data, and at this time, the CPU may have time waste.
In addition, the above schemes have the problems of memory data copying and the like. Therefore, in the prior art, software adaptation processing for calculating request data does not enable a CPU to schedule to the maximum extent, and the overall performance gain is small.
In order to solve the above problems, an embodiment of the present application provides a data processing method, which is applied to a CPU, where the CPU enables at least one process, and each process enables a plurality of request threads. In the embodiment of the application, a plurality of processing coprocessors are also enabled in each process. Referring to fig. 1, a first flowchart of a data processing method according to an embodiment of the present application is shown, where the data processing method includes the following steps.
In step S11, calculation request data is received in parallel by a plurality of request threads.
Step S12, the received calculation request data is uniformly written into a plurality of software cache queues through a plurality of request threads.
And S13, sending the calculation request data cached in each software cache queue to a hardware acceleration card through a processing protocol corresponding to each software cache queue, so that the hardware acceleration card calculates the received calculation request data to obtain a calculation result, and the software cache queues and the processing protocols are in one-to-one correspondence.
In the technical scheme provided by the embodiment of the application, the CPU enables a plurality of request threads and a plurality of processing coprocessors to unload the computing task. After the CPU receives the calculation request data in batches in parallel through a plurality of request threads, the calculation request data are uniformly written into a software cache queue, and the calculation request data cached in the software cache queue are sent to the hardware accelerator card through a plurality of processing co-programs, so that the task is unloaded onto the hardware accelerator card, and the CPU only needs to execute the scheduling and management of the task, thereby saving CPU calculation resources, reducing CPU calculation load, enabling CPU efficiency to be higher and improving the overall storage performance of the equipment.
In addition, the CPU receives the calculation request data in batches through a plurality of request threads in parallel, in this case, even if one or more request threads are blocked, the calculation request data are cached in the software cache queue by the unblocked request threads, and correspondingly, the calculation request data cached in the software cache queue are continuously sent to the hardware accelerator card by a plurality of processing co-ordinates, so that the hardware accelerator card is always in a busy state, the possible time waste of the CPU is reduced, and the overall storage performance of the device is further improved.
In addition, the software buffer queue is a logic queue, and the software buffer queue actually buffers a logic address or a physical address of data, and no data copy exists, so that the software buffer queue is utilized to reduce the data copy in the memory, the CPU can finish unloading a calculation task only by storing calculation request data once in the memory, the memory access process is optimized, and the overall storage performance of the device is further improved.
In the embodiment of the present application, the process enabled by the CPU may be an OSE process, a DSE process or other processes, which is not limited herein.
The calculation request data is data for requesting to perform calculation, that is, the calculation request data is a calculation task that needs to be offloaded to the hardware accelerator card process. The calculation request data may include, but is not limited to, data requesting to perform erasure-correction calculation (i.e., erasure-correction calculation request data), data requesting to perform compression calculation (i.e., compression calculation request data), and the like. The calculation request data may be data input by an upper layer application, or may be data which is sent to the CPU by other devices and requests to execute calculation.
The request thread is a thread for performing computation on computation request data. The request thread may include, but is not limited to: a thread that performs erasure correction calculation on erasure correction calculation request data, a thread that performs compression calculation on compression calculation request data, and the like. In the embodiment of the application, one request thread can execute one or more kinds of calculation, for example, one request thread can execute only erasure correction calculation, or can execute two or three kinds of calculation in erasure correction, erasure correction and compression simultaneously. The aim of the embodiment of the application is to unload the computing task executed by the request thread to the hardware accelerator card.
And the processing coroutine is the coroutine for accessing the software cache queue in the processing thread, so that the interaction between the CPU and the hardware acceleration card is realized. In the embodiment of the application, the processing coroutines are in one-to-one correspondence with the software cache queues. The multiple processing coprocessors initiated in a process may be located in one or more processing threads. The processing thread is a thread for realizing interaction between the CPU and the hardware acceleration card. The processing threads enabled in each process may be DPDK threads or other threads, and the processing threads are not limited herein.
In the embodiment of the application, in order to rapidly unload the calculation task of the CPU to the hardware accelerator card, a plurality of processing threads can be started in each process, so that the plurality of processing threads work simultaneously, and the efficiency of unloading the calculation task is improved. In order to reduce the occupation of the CPU resource by the thread, a processing thread can be started in each process, so that the calculation task unloaded to the hardware acceleration card can be prevented from waiting for processing, and the occupation of the CPU resource by the thread is reduced.
In the embodiment of the application, in order to further reduce the occupation of CPU resources, each thread in all processes started by the CPU can share the same CPU core, namely, a plurality of request threads and processing threads share the same CPU core, so that the occupation of CPU core resources is reduced.
The hardware accelerator card may be an FPGA hardware card, a DPU, or the like, and is not limited thereto as long as the calculation request data can be processed. The CPU and the hardware acceleration card can be connected through a plurality of DMA interfaces to conduct data transmission and hardware interaction. Each process enabled by the CPU may configure a fixed number of DMA interfaces. The CPU can allocate DMA interfaces for each process started by the CPU in advance according to the number of DMA interfaces supported by the hardware accelerator card and the number of processes started by the CPU, and the number of DMA interfaces configured by each process reaches balance, namely the difference value of the number of the DMA interfaces corresponding to each two processes is smaller than or equal to a preset threshold value.
For example, if the number of DMA interfaces supported by the hardware accelerator card is 128, the number of processes enabled by the CPU is 4, and the preset threshold is 1, the number of DMA interfaces corresponding to each process is 128/4=32. The number of DMA interfaces supported by the hardware accelerator card, the number of processes enabled by the CPU, the value of the preset threshold, and the allocation manner of the DMA interfaces are not limited herein.
In the embodiment of the application, interaction between the CPU and the hardware accelerator is completed through a multi-channel direct memory access (Multichannel Direct Memory Access, MCDMA) interface of the DPDK, calculation request data can be temporarily stored in the hardware accelerator card without data copying in the hardware accelerator card, so that the cost of data copying in a kernel is reduced, and the performance of the CPU is improved. In addition, the quantity of the DMA interfaces configured by each process is balanced, so that load balance among a plurality of processes can be realized, and the processing efficiency of the CPU is improved.
The software buffer queue is used for buffering data to be sent to the hardware accelerator card by the computing task, namely buffering computing request data. In the embodiment of the application, the software cache queue is a logic queue, and the cache calculation request data in the software cache queue can be understood as a logic address or a physical address of the cache calculation request data. Therefore, in the process of unloading the computing task, the CPU does not need to copy the data of the memory, and the computing resource of the CPU is saved. In one example, the software buffer queue may be a First-In First-Out (FIFO) queue, which may ensure that data written to the queue is read First, thereby improving accuracy of data processing.
In the above step S11, in each process employed by the CPU, the CPU simultaneously runs a plurality of request threads that receive calculation request data in parallel. For example, when two request threads are enabled by one process, the two request threads may simultaneously receive one calculation request data respectively, so as to implement that the CPU receives the two calculation request data in parallel. The calculation request data here is actually stored in the memory.
In the above step S12, each process is configured with a set of software buffer queues, one request thread in the process may correspond to one or more software buffer queues in the set of software buffer queues, and the software buffer queues corresponding to different request threads may be the same or different, so long as only the calculation request data from one request thread is buffered in the software buffer queue under a certain time, and the buffered data of multiple software buffer queues can reach equilibrium.
For each request thread, the CPU may write the calculation request data received by the request thread into the software cache queue corresponding to the request thread, that is, one request thread writes the logical address or the physical address of the calculation request data received by the request thread into the software cache queue corresponding to the request thread, so as to implement balanced writing of the calculation request data into a plurality of software cache queues, that is, balanced writing of the data amount of the calculation request data into each software cache queue.
For example, the request thread a corresponds to the software cache queue a1 and the software cache queue a2, the request thread b corresponds to the software cache queue b1 and the software cache queue b2, and after receiving calculation request data, the request thread a writes the received calculation request data into the software cache queue a1 and the software cache queue a2; after receiving the calculation request data, the request thread b writes the received calculation request data into the software cache queue b1 and the software cache queue b2.
In the embodiment of the application, one or more request coroutines can be started in one request thread, and the request coroutines are coroutines for executing calculation on calculation request data. The above steps S11 and S12 implemented based on the request thread can be understood as the above steps S11 and S12 implemented based on the request coroutine.
In the step S13, for each software buffer queue, the processing protocol corresponding to the software buffer queue reads the calculation request data buffered in the software buffer queue, and sends the read calculation request data to the hardware accelerator card. The hardware acceleration card receives calculation request data sent by the CPU, calculates according to the calculation request data to obtain a calculation result, and feeds back the calculation result to a corresponding request thread in the CPU. For example, when the calculation request data is erasure correction calculation request data, the hardware accelerator card calculates a check code of the erasure correction calculation request data as a calculation result; when the request data is the deduplication calculation request data, the hardware accelerator card calculates a fingerprint (such as a hash value) of the deduplication calculation request data as a calculation result, and the calculation result is not limited herein.
In some embodiments, a plurality of request coroutines are enabled in each request thread, each request coroutine corresponding to a predetermined type of computation request data, and each software cache queue corresponding to a predetermined type of computation request data. The number of the enabled request coroutines in different request threads can be the same or different; the preset types may include, but are not limited to, erasure-calculation request data, compression-calculation request data, and the like.
For example, two request threads are started in one process. Three request coroutines are enabled in one request thread, and correspond to erasure correction calculation request data, erasure correction calculation request data and compression calculation request data respectively, that is, the three request coroutines in the request thread are used for performing calculation on the erasure correction calculation request data, performing calculation on the erasure correction calculation request data and performing calculation on the compression calculation request data respectively. Two request coroutines are enabled in one request thread, and correspond to the erasure correction calculation request data and the compression calculation request data respectively, that is, the two request coroutines in the request thread are used for performing calculation on the erasure correction calculation request data and performing calculation on the compression calculation request data respectively. In addition, the process is configured with five software cache queues, each request cooperative program corresponds to one software cache queue, and each software cache queue corresponds to one preset type of calculation request data.
In the embodiment of the application, the CPU can process various preset types of calculation request data, one preset type of calculation request data is cached in one software cache queue, confusion of different types of calculation request data is avoided, and accuracy of a calculation result is ensured. To further ensure accuracy of the calculation result, the request coroutines are in one-to-one correspondence with the software cache queues.
Based on the plurality of request coroutines enabled in each request thread, referring to fig. 2, a second flowchart of a data processing method according to an embodiment of the present application may include the following steps.
Step S21, through a plurality of request coroutines, the calculation request data of a preset type is received in parallel.
Step S22, through each request protocol, the received calculation request data is written into a software cache queue corresponding to the preset type.
Step S23, through the corresponding processing protocol of each software buffer queue, the calculation request data buffered in the software buffer queue is sent to the hardware acceleration card, so that the hardware acceleration card calculates the received calculation request data to obtain a calculation result, and the software buffer queues are in one-to-one correspondence with the processing protocol. The same as in step S13 described above.
In the technical scheme provided by the embodiment of the application, the CPU enables a plurality of request coroutines in the request thread to receive the calculation request data of different preset types, and can enable another request coroutine to process the calculation request data when one request coroutine is blocked, so that the task scheduling is optimized, the CPU operates more efficiently, and the maximum unloading efficiency is achieved. In addition, a plurality of request coroutines receive calculation request data of different types in parallel, so that the calculation tasks of erasure correction, erasure correction or compression calculation which are common in a storage system can be unloaded to a hardware acceleration card.
In the above step S21, at most one request co-process among the plurality of request co-processes in each request thread is in the awake state, and the other request co-processes are in the wait (wait) state. The request coroutines in the wake state in the plurality of request threads may be the same or different. A request coroutine is a piece of code executed by a CPU, and the enabling of multiple request coroutines in one request thread indicates that multiple pieces of code are enabled in the request thread. The request coroutine being in an awake state means that the CPU enters the request coroutine, and uses the request coroutine to perform computation. The request coroutine in the awakening state receives calculation request data corresponding to a preset type. For example, a request coroutine in an awake state is used to perform computation on compressed computation request data, and then the request coroutine receives the compressed computation request data. The request coroutine being in a waiting state means that the CPU jumps out of the request coroutine, and the request coroutine goes into dormancy. The request protocol in the waiting state does not participate in the consumption of the CPU, and the CPU resource occupation is reduced. Here, a process of putting one of the processes into a waiting state and entering the other process, i.e., a process of switching the code run by the CPU from the code of one process to the other process, may be referred to as context switching.
And the CPU realizes the parallel receiving of the calculation request data of a preset type through the request coroutines of the plurality of awakening states.
In the step S22, for each request protocol, after receiving the calculation request data, the request protocol writes the received calculation request data into the software cache queue corresponding to the request protocol, that is, writes the received calculation request data into the software cache queue corresponding to the preset type.
For example, a request thread a and a request thread b are enabled in one process, a request coroutine a1 and a request coroutine a2 are enabled in the request thread a, and a request coroutine b1 and a request coroutine b2 are enabled in the request thread b. The request coroutine a1 and the request coroutine b1 correspond to the calculation request data of the type 1, and the request coroutine a2 and the request coroutine b2 correspond to the calculation request data of the type 2. The request thread a corresponds to the software cache queue c1 and the software cache queue c2, and the request thread b corresponds to the software cache queue c3 and the software cache queue c4. The software cache queues c1 and c3 correspond to the calculation request data of the type 1, the software cache queues c2 and c4 correspond to the calculation request data of the type 2, that is, the software cache queue c1 corresponds to the request coroutine a1, the software cache queue c2 corresponds to the request coroutine a2, the software cache queue c3 corresponds to the request coroutine b1, and the software cache queue c4 corresponds to the request coroutine b2. The request assistant a1 writes the received calculation request data of the type 1 into the software cache queue c1, the request assistant a2 writes the received calculation request data of the type 2 into the software cache queue c2, the request assistant b1 writes the received calculation request data of the type 1 into the software cache queue c3, and the request assistant b2 writes the received calculation request data of the type 2 into the software cache queue c4.
In some embodiments, for each request thread, the CPU sets the first request coroutine to a wait state, the first calculation request data received by the first request coroutine having been written into a corresponding software cache queue; waking up a second request coroutine; if the hardware acceleration card obtains the calculation result of the first calculation request data, the CPU wakes up the first request coroutine.
In the embodiment of the present application, the first request cooperative thread and the second request cooperative thread may be any request cooperative thread, where the first request cooperative thread and the second request cooperative thread are located in the same request thread. The first calculation request data is calculation request data received by the first request cooperative program, and the type of the first calculation request data is a preset type corresponding to the first request cooperative program. After the first request co-procedure in the wake-up state writes the first calculation request data into the corresponding software cache queue, the CPU may switch the context and enter the second request co-procedure, i.e. the first request co-procedure is set to be in the wait state, and wake up the second request co-procedure. The awakened second request routine may receive the calculation request data (e.g., the second calculation request data) and write the second calculation request data into the corresponding software cache queue. In this case, the request thread of the CPU does not need to wait for the calculation result of the first calculation request data, and can process other calculation request data through request coroutine switching in the request thread, so that possible time waste of the CPU is further reduced.
Before the hardware accelerator card obtains the calculation result of the first calculation request data, that is, before the CPU obtains the calculation result of the first calculation request data, the first request coroutine is kept in a waiting state until the calculation result of the first calculation request data is obtained, that is, when the calculation result of the first calculation request data is obtained, the CPU wakes up the first request coroutine. And the first request cooperative program can further process the calculation result of the first calculation request data and receive new first calculation request data.
In the embodiment of the application, the CPU can run a monitoring function, such as a sleep function. The CPU performs context switching on the request coroutine using the monitoring function. Specifically, the CPU may configure a linked list, where the linked list includes description information of each request coroutine, and each request coroutine corresponds to a semaphore of the notification mechanism. The monitoring function performs polling wakeup on a plurality of request coroutines in each request thread based on the linked list and the semaphore.
For example, for a request thread, according to the sequence of the description information of each request coroutine in the request thread in the linked list, the monitoring function polls and detects each request coroutine; when a request cooperative program is detected, the monitoring function generates a signal quantity of a notification mechanism and sends the signal quantity to the currently detected request cooperative program so as to wake up the currently detected request cooperative program; the currently detected request cooperative program is awakened to wait after receiving the signal quantity, namely the currently detected request cooperative program is awakened, and the currently detected request cooperative program can receive calculation request data or process calculation results fed back by a hardware acceleration card and other calculation processes. After the calculation processing is finished, for example, after the calculation request data is written into the corresponding software cache queue, or after the calculation result processing is finished, the currently detected request protocol is set to be in a waiting state. And then, according to the sequence of the description information of the request coroutines in the linked list, the monitoring function continues to poll and detect each request coroutine.
In the embodiment of the application, for the first request protocol which has written the first calculation request data into the corresponding software cache queue, the CPU can wake up the first request protocol through the mechanism notified by the task trigger, namely, when the calculation result of the first calculation request data is obtained, the CPU generates a signal quantity to wake up the first request protocol, otherwise, the CPU keeps the first request protocol in a waiting state, and the monitoring function does not carry out polling detection on the first request protocol. By the scheme, polling idling can be effectively prevented, and CPU resource occupation is further reduced; in addition, in the scheme, the method of multi-request threads, multi-request coroutines and asynchronous submitting of calculation results is adopted for unloading, so that the processing time delay can be reduced to the greatest extent, and the CPU utilization efficiency is improved.
In some embodiments, to improve the processing efficiency of computing request data of a certain preset type (e.g., the first preset type), the CPU may keep the request coroutine corresponding to the first preset type in an awake state. In one example, the data amount of the calculation request data of the first preset type is greater than a preset threshold, and taking the request cooperative process corresponding to the first preset type as the third request cooperative process as an example. When the data volume of the calculation request data of the first preset type is larger than the preset threshold value, the calculation request data of the first preset type is more, and the CPU can keep the third request protocol in an awake state. When the data volume of the first preset type of calculation request data is smaller than or equal to the preset threshold value, the CPU can adopt the polling and awakening mode to poll and awaken the request protocol.
According to the embodiment of the application, the third request protocol is always kept in the wake-up state, and after the calculation result of the calculation request data of the first preset type is obtained on the hardware accelerator card, the third request protocol Cheng Shike is used for rapidly obtaining the calculation result from the hardware accelerator card, so that the time consumption caused by state switching is reduced, and the processing efficiency of the calculation request data of the first preset type is improved.
In some embodiments, referring to fig. 3, a third flowchart of a data processing method according to an embodiment of the present application may include the following steps.
In step S31, calculation request data is received in parallel by a plurality of request threads. The same as in step S11 described above.
Step S32, the received calculation request data is uniformly written into a plurality of software cache queues through a plurality of request threads. The same as in step S12 described above.
Step S33, each processing assistant program is polled and awakened, so that the currently awakened processing assistant program detects whether calculation request data is cached in the corresponding software cache queue. If not, namely, the calculation request data is not cached in the software cache queue corresponding to the currently awakened processing routine, executing the step S34; if yes, that is, the calculation request data is cached in the software cache queue corresponding to the currently-awakened processing routine, step S35 is executed.
Step S34, the currently awakened processing procedure is set to be in a waiting state, and the next processing procedure is awakened, and the step S33 is continuously executed.
And step S35, keeping the currently awakened processing routine in an awakening state so as to send the calculation request data cached in the corresponding software cache queue to the hardware accelerator card.
In step S33 described above, the CPU may run a monitoring function, such as a sleep function. The CPU performs context switching on the processing routine using the monitor function. Specifically, the CPU may configure a linked list, where the linked list includes description information of each processing coroutine, and each processing coroutine corresponds to a semaphore of the notification mechanism. The monitoring function performs polling wakeup on a plurality of processing routines based on the linked list and the semaphore. The unit of poll wakeup is one processing thread.
For example, for a processing thread, according to the sequence of the description information of each processing cooperative thread in the processing thread in the linked list, monitoring function polling detects each processing cooperative thread; when a processing cooperative procedure is detected, the monitoring function generates a signal quantity of a notification mechanism and sends the signal quantity to the currently detected processing cooperative procedure so as to wake up the currently detected processing cooperative procedure; after the currently detected processing assistant receives the semaphore, the wake-up waiting state is that the currently detected processing assistant is waken up, and the currently detected processing assistant detects whether the calculation request data is cached in the corresponding software cache queue.
If the calculation request data is not cached in the corresponding software cache queue, which indicates that the currently-awakened processing auxiliary has no calculation task, step S34 is executed, the currently-detected processing auxiliary is set to be in a waiting state, and the monitoring function continues to poll and detect each processing auxiliary according to the sequence of the description information of the processing auxiliary in the linked list, and awakens the next processing auxiliary. If the corresponding software cache queue is cached with the calculation request data, which indicates that the currently-awakened processing auxiliary has the calculation task, step S35 is executed, where the CPU keeps the currently-awakened processing auxiliary in an awakened state, and then the currently-awakened processing auxiliary sends the calculation request data cached in the corresponding software cache queue to the hardware accelerator card.
In the technical scheme provided by the embodiment of the application, the CPU wakes up the processing protocol by adopting a polling mechanism. In the scheme, if the calculation request data is not cached in the software cache queue, the CPU sets the corresponding processing cooperation program to be in a waiting state, and the processing cooperation program does not participate in the consumption of the CPU, so that the occupation of CPU resources is reduced. If the calculation request data is cached in the software cache queue, the CPU keeps the corresponding processing cooperative program in an awake state, so that the processing cooperative program can read the calculation request data cached in the software cache queue, send the calculation request data to the hardware acceleration card, and execute calculation on the calculation request data by the hardware acceleration card.
In the embodiment of the application, since the plurality of request threads receive the calculation request data in parallel, the plurality of software cache queues can cache the calculation request data, and correspondingly, the plurality of processing co-programs are sequentially placed into the wake-up state, and the calculation request data cached in the corresponding software cache queues is sent to the hardware accelerator card. Although the time consumption of switching processing is larger than the time consumption of switching processing, so that the hardware accelerator card can process the next calculation request data after processing one calculation request data, the hardware accelerator card is always busy, the unloading efficiency of calculation tasks is improved, the possible time waste of a CPU is reduced, and the overall storage performance of the device is further improved.
In some embodiments, referring to fig. 4, a fourth flowchart of a data processing method according to an embodiment of the present application may include the following steps.
In step S41, calculation request data is received in parallel by a plurality of request threads. The same as in step S11 described above.
Step S42, the received calculation request data is uniformly written into a plurality of software cache queues through a plurality of request threads. The same as in step S12 described above.
In step S43, each processing assistant is polled and awakened, so that the currently awakened processing assistant detects whether the calculation request data is cached in the corresponding software cache queue. If not, namely, the calculation request data is not cached in the software cache queue corresponding to the currently awakened processing routine, executing the step S44; if yes, that is, the calculation request data is cached in the software cache queue corresponding to the currently awakened processing routine, step S45 is executed. The same as in step S33 described above.
Step S44, the currently awakened processing procedure is set to be in a waiting state, and the next processing procedure is awakened, and the step S43 is continuously executed. The same as in step S34 described above.
And step S45, keeping the currently awakened processing routine in an awakening state so as to send the calculation request data cached in the corresponding software cache queue to the hardware accelerator card. The same as in step S35 described above.
Step S46, polling and waking up each processing assistant program so that the currently-woken processing assistant program detects whether the hardware acceleration card obtains a calculation result of corresponding calculation request data; if not, that is, the hardware accelerator card does not obtain the calculation result of the corresponding calculation request data, executing step S47; if yes, the hardware accelerator card obtains the calculation result of the corresponding calculation request data, and step S48 is executed.
Step S47, the currently awakened processing procedure is set to be in a waiting state, the next processing procedure is awakened, and the step S46 is continuously executed;
Step S48, the currently awakened processing routine is kept in an awakening state, so that the calculation result of the corresponding calculation request data is returned to the corresponding request thread.
In the above step S46, after executing step S45, the CPU sends the calculation request data buffered in the corresponding software buffer queue to the hardware accelerator card, and then polls and wakes up each processing protocol. The process of the CPU polling to wake up each processing procedure can be referred to the polling detection procedure of step S43, which is different in that: the calculation request data is not transmitted to the hardware accelerator card when step S43 is performed, and the calculation request data has been transmitted to the hardware accelerator card when step S46 is performed. That is, before sending the calculation request data to the hardware accelerator card, the CPU polls and wakes up each processing assistant, and the waken processing assistant detects whether the calculation request data is cached in the corresponding software cache queue; after the calculation request data are sent to the hardware acceleration card, the awakened processing assistant program detects whether the hardware acceleration card obtains a calculation result of the corresponding calculation request data.
If the hardware acceleration card obtains the calculation result of the corresponding calculation request data, the currently awakened processing assistant program can generate a semaphore based on the mechanism notified by the task trigger, and send the semaphore to the corresponding request thread to awaken the request assistant program, and the awakened request assistant program can obtain the calculation result from the hardware acceleration card.
In the embodiment of the application, the CPU wakes up the processing cooperation program by adopting a polling mechanism, so that the calculation result is conveniently checked, and the calculation result is quickly returned to the request thread, thereby ensuring the successful unloading and the efficiency of the calculation task.
In some embodiments, referring to fig. 5, a fifth flowchart of a data processing method according to an embodiment of the present application may include the following steps.
In step S51, calculation request data is received in parallel by a plurality of request threads. The same as in step S11 described above.
Step S52, the received calculation request data is uniformly written into a plurality of software cache queues by a plurality of request threads. The same as in step S12 described above.
Step S53, obtaining calculation request data of preset data quantity from each software cache queue through the corresponding processing procedure of the software cache queue.
In the step S53, the preset data amount may be set according to the actual situation, for example, the total data amount transferred at a time by one DMA interface corresponding to the process is not limited thereto. For each software cache queue, the CPU cuts the calculation request data with larger data quantity in the software cache queue through the corresponding processing protocol of the software cache queue, combines the calculation request data with smaller data quantity, and obtains the calculation request data with preset data quantity. And then, the CPU executes step S54, and sends the acquired calculation request data of the preset data quantity to the hardware accelerator card through the processing protocol corresponding to the software cache queue.
Step S54, the acquired calculation request data is sent to the hardware accelerator card.
In the technical scheme provided by the embodiment of the application, the CPU can reasonably and processing the calculation request data in the software cache queue so as to reduce the switching of processing cooperative programs and the polling of calculation results, and can reasonably cut the calculation request data in the software cache queue, so that the time of each processing cooperative program switching and polling is relatively balanced, and the processing efficiency of the calculation task is improved.
The following describes in detail the data processing method provided by the embodiment of the present application through fig. 6 to 10.
Fig. 6 is a schematic diagram of a cluster management and device management architecture of a storage device according to an embodiment of the present application. As shown in fig. 6, the cluster management and device management architecture of the storage device includes a plurality of service modules, such as: internet small Computer system interface (INTERNET SMALL Computer SYSTEM INTERFACE, ISCSI) extensions (iSCSI Extensions for RDMA, iSER), fast emulator (Quick Emulator, QEMU), and method for remote direct memory access (Remote Direct Memory Access, RDMA), A container orchestration platform (e.g., kubernetes, K8 s), a distributed file system (Hadoop Distributed FILE SYSTEM, HDFS), hypertext transfer protocol (Hypertext Transfer Protocol, HTTP), internet small Computer system interface targets (INTERNET SMALL Computer SYSTEM INTERFACE TARGET, ISCSI TGT), target forwarding (Forward Target, FTGT), universal Network file system (Common INTERNET FILE SYSTEM, CIFS), network file system (Network FILE SYSTEM, NFS), file transfer Protocol (FILE TRANSFER Protocol, FTP), simple storage service (Simple Storage Service, S3), object storage Protocol (e.g., swift), block value added service, reliable autonomous distributed object storage block device (Reliable Autonomic Distributed Object Store Block Device, RBD), file value-added services, distributed storage file systems (e.g., CEPH FILE SYSTEM, ceph FS) (user state), metadata services (CEPH METADATA SERVER, MDS), reliable autonomous distributed object storage services (Reliable Autonomic Distributed Object Store Gateway, RGW), metadata retrieval (messenger), distributed caching (cache), redirect-On-Write, ROW) and garbage collection (Gabarage Collector, GC), snapshot, deduplication, compression, database management systems (e.g., postgreSQL, PG), cache pool (cache pool), tier (tier), multiple copies, EC, reconfiguration, storage engine (e.g., bluestore), architecture engine (e.g., rocksdb), solid state disk cache (Solid STATE DISK CACHE, SSD cache), OSD unified process, memory unified management, cross-site dual activity, and monitoring (monitor), etc. The storage devices herein are distributed storage devices.
The DPU/FPGA hardware card is additionally arranged in the framework of the storage device, so that hardware acceleration and task unloading of the CPU are realized, for example, erasure correction and compression calculation service is unloaded onto the DPU/FPGA hardware card.
Fig. 7 is a schematic structural diagram of a memory device according to an embodiment of the present application. The storage device shown in fig. 7 includes a CPU and an FPGA hardware card, and fig. 7 illustrates only the hardware accelerator card as an FPGA hardware card, which is not limited thereto. As shown in fig. 7, the CPU enables a plurality of OSD processes, each of which initiates one or more EC threads (i.e., request threads) for performing EC calculations. The CPU also enables a DSE process, wherein a deduplication thread (namely a request thread) and a compression thread (namely a request thread) are started in the DSE process, the deduplication thread is used for executing deduplication calculation, and the compression thread is used for executing compression calculation. In fig. 7, only one calculation is performed by each request thread, which is not limiting. In addition, the type of processes, the number of processes, and the number of request threads in a process that are enabled in the CPU are not limited.
The CPU and the FPGA hardware card are connected through a high-speed serial computer expansion bus standard (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe), and hardware interaction is realized through a plurality of DMA channels (DMA CHANNEL) supported by the FPGA hardware card, such as the DMA channels 0-127 in FIG. 7. The number of DMA channels supported by the FPGA hardware card in fig. 7 is 128, for example, and is not meant to be limiting. The DMA channel may also be referred to as a DMA interface.
And distributing all the DMA interfaces to each process according to a preset specification, and starting an independent DPDK thread (namely a processing thread) in each process so that the quantity of the DMA interfaces corresponding to a plurality of processes is balanced. The method solves the problem that the one-to-one correspondence between each computing task and the DMA interface cannot be established because the number of processes and threads in the storage device can be related to the hardware environment. In addition, each process is provided with a software buffer queue, and the software buffer queue is used for buffering calculation request data to be sent to the FPGA hardware card.
The CPU and the FPGA hardware card send (tx) data to the FPGA hardware card or receive data returned by the (rx) FPGA hardware card through a DPDK thread and a DMA interface. After receiving the calculation request data through the DMA interface, the FPGA hardware card calculates according to hardware calculation logic such as EC, deleting and compressing calculation logic in the FPGA hardware card, and the calculation result is obtained.
In the technical scheme provided by the embodiment of the application, the hardware acceleration card is used for carrying out hardware unloading on the erasure correction, erasure correction and compression calculation tasks in the CPU, unified processing interfaces and processes of erasure correction, erasure correction and compression calculation can be defined through software so as to support the rapid adaptation of a new hardware acceleration card, and a special interface scheduling and management process method of software can be designed, so that the CPU efficiency is higher, and the equipment performance is improved. By combining the mature acceleration card hardware unloading method with the data processing method provided by the embodiment of the application, a high-efficiency task scheduling and memory access optimization algorithm is obtained, the problems of data interaction and memory access optimization between software and hardware are solved, the calculation load of a CPU is reduced, and the overall storage performance is effectively improved.
In addition, the CPU can realize larger erasure rate by unloading EC calculation to the hardware acceleration card, thereby solving the problem that the CPU resource occupation is too high and the performance is affected under the condition of large erasure rate.
Based on fig. 6 to fig. 7, in the technical solution provided in the embodiment of the present application, the data processing flow of the request thread may be shown in fig. 8, the data processing flow of the processing protocol may be shown in fig. 9, and the hardware interaction processing flow may be shown in fig. 10.
As shown in fig. 8, taking a CPU as an example, n request threads, that is, thread 1 to thread n, are enabled in the process, each request thread may perform EC, deduplication, or compression calculation, that is, multiple request coroutines may be enabled in each request thread (not shown in fig. 8). The process also configures a software buffer queue, such as FIFO queues, e.g., FIFO 1-FIFO n in fig. 8. In fig. 8, threads 1 to n may be in one-to-one correspondence with the FIFO queues to implement balanced writing of calculation request data into the FIFO queues.
In the embodiment of the application, a caller of a computing task can provide a plurality of unfinished computing request data, and then threads 1 to n receive the computing request data in batches, and write the received computing request data into a corresponding FIFO queue in an equalizing manner. The processing time delay can be reduced to the greatest extent through batch processing, and a plurality of request threads are adopted to asynchronously extract the calculation result of the calculation request, namely, the request threads are responsible for writing calculation request data into the FIFO queue, and in addition, threads (such as processing threads) are responsible for interacting with the hardware acceleration card, so that the processing time delay can be reduced to a certain extent.
And enabling a plurality of processing assistant programs in the processing thread, wherein the processing assistant programs are in one-to-one correspondence with the FIFO queues. After the request thread writes the calculation request data into the FIFO queue, the CPU may wake up the waiting state of the processing routine corresponding to the FIFO queue, that is, the processing routine corresponding to the FIFO queue is in the waiting state, and the CPU wakes up the processing routine Cheng Huanxing. The awakened processing routine may read the compute request data from the FIFO queue.
In addition, after the request thread writes the calculation request data into the FIFO queue, the CPU sets the request thread into a waiting state, and the request thread enters the waiting state to wait for the calculation result to be processed after awakening, and then the subsequent operation is executed. And after the processing assistant acquires the calculation result of the calculation request data, waking up the waiting state of the request thread, namely waking up the request thread.
In the embodiment of the application, a plurality of request coroutines can be started in one request thread. After one request protocol writes the calculation request data into the FIFO queue, the CPU switches the context before obtaining the calculation result of the calculation request data, namely, the same request thread polls the switching request protocol, namely, a plurality of request protocols poll and wake up in one request thread, processes other calculation request data in parallel, and checks the calculation result.
To prevent polling from idling as much as possible, the CPU may introduce a mechanism for a task to trigger event notification, for example, the processing assistant performs a wake-up operation after acquiring a calculation result, that is, a signal amount generating the notification mechanism is sent to the requesting assistant, and the requesting assistant wakes up after receiving the signal amount, thereby performing a subsequent operation.
As shown in FIG. 9, a plurality of processing threads, namely, coroutine 1 through coroutine n, are configured in one processing thread. The number of processing shreds may be adapted according to the hardware environment at the time of thread initialization for processing computing requests and hardware interactions. Each processing assistant program corresponds to a FIFO queue and the signal quantity of the notification mechanism, and a plurality of processing assistant programs poll and wake up in the processing thread to judge whether calculation request data are cached in the corresponding FIFO queues.
If the calculation request data is not cached in the FIFO queue, the corresponding processing assistant program of the FIFO queue enters a waiting state and does not participate in the consumption of the CPU, the CPU switches the context and enters the next processing assistant program, namely the processing assistant program is polled and switched in the same processing thread, namely a plurality of processing assistant programs are polled and awakened in the same processing thread until the calculation request data is detected to be written into the FIFO queue. If the calculation request data is cached in the FIFO queue, the processing assistant corresponding to the FIFO queue is kept in an awakening state, and the processing assistant reads the calculation request data from the FIFO queue to execute hardware interaction processing and send the calculation request data to the hardware accelerator card. And then, the CPU sets the processing assistant program into a waiting state, performs context switching and enters other processing assistant programs in the processing thread.
In the process that the processing assistant program is after submitting the calculation request data to the hardware acceleration card and before the hardware acceleration card obtains the calculation result, even if the request assistant program has new calculation request data to be written into the FIFO queue, the request assistant program still keeps a waiting state until the hardware acceleration card finishes processing the received calculation request data. After the hardware accelerator card acquires the calculation result, when the CPU polls and wakes up the corresponding processing assistant program, the processing assistant program can detect the calculation result and wake up the request assistant program so as to carry out subsequent processing on the calculation result and write new calculation request data into the FIFO queue.
As shown in fig. 10, in the flow of the hardware interaction process (i.e., interaction between the CPU and the hardware accelerator card), the CPU submits the calculation request data corresponding to each of the two processes to the hardware accelerator card by polling the switching process protocol in the same processing thread. The CPU queries the processing states and the calculation results of the calculation request data respectively submitted to the hardware accelerator card by each processing assistant through polling and switching the processing assistant. After the processing state indicates to be finished and the hardware accelerator card obtains the calculation result, the CPU wakes up the request coroutine through the processing coroutine, and then the request coroutine writes new calculation request data into the FIFO queue, and wakes up the corresponding processing coroutine to process the data in the FIFO queue.
In the process that the processing assistant program is after submitting the calculation request data to the hardware acceleration card and before the hardware acceleration card obtains the calculation result, even if the request assistant program has new calculation request data to be written into the FIFO queue, the request assistant program still keeps a waiting state until the hardware acceleration card finishes processing the received calculation request data.
In addition, the hardware accelerator card has a large difference in the performance of processing sequential blocks, random blocks, read-write blends, and the like. When the data volume is small, the CPU can perform optimization operations such as merging processing on a plurality of calculation request data through the processing cooperation program, so that data with reasonable data volume (namely preset data volume) are obtained and sent, and switching of the processing cooperation program and polling of calculation results are reduced. When the data volume is more, the CPU can cut the written calculation request data into data with reasonable data volume (namely preset data volume) through the processing protocol to send, so that the time of each processing protocol switching and polling is balanced. When the data volume is reasonable (namely, the preset data volume), the CPU sends calculation request data with reasonable data volume to the hardware accelerator card.
In the technical scheme provided by the embodiment of the application, the mode of combining multiple threads with multiple coprocesses can enable the hardware accelerator card to always process busy state, single threads are also switched among the coprocesses and process processing, and the CPU always performs batch task scheduling, so that the task processing efficiency is improved. And when the user requests (namely, calculation request data) are fewer, the blocking without CPU consumption is carried out through a semaphore or task notification mechanism, so that the idle running of the CPU is avoided to the greatest extent.
Corresponding to the above data processing method embodiment, the embodiment of the present application further provides a data processing device, which is shown in fig. 11. The data processing apparatus is applied to a CPU, the CPU enables at least one process, and enables a plurality of request threads in each process, and the data processing apparatus includes:
a receiving module 111, configured to receive, in parallel, calculation request data through a plurality of request threads;
A writing module 112, configured to uniformly write the received calculation request data into a plurality of software cache queues through a plurality of request threads;
and the sending module 113 is configured to send the calculation request data buffered in the software buffer queues to the hardware accelerator card through the processing protocol corresponding to each software buffer queue, so that the hardware accelerator card calculates the received calculation request data to obtain a calculation result, where the software buffer queues correspond to the processing protocol one by one.
In the technical scheme provided by the embodiment of the application, the CPU enables a plurality of request threads and a plurality of processing coprocessors to unload the computing task. After the CPU receives the calculation request data in batches in parallel through a plurality of request threads, the calculation request data are uniformly written into a software cache queue, and the calculation request data cached in the software cache queue are sent to the hardware accelerator card through a plurality of processing co-programs, so that the task is unloaded onto the hardware accelerator card, and the CPU only needs to execute the scheduling and management of the task, thereby saving CPU calculation resources, reducing CPU calculation load, enabling CPU efficiency to be higher and improving the overall storage performance of the equipment.
In addition, the CPU receives the calculation request data in batches through a plurality of request threads in parallel, in this case, even if one or more request threads are blocked, the calculation request data are cached in the software cache queue by the unblocked request threads, and correspondingly, the calculation request data cached in the software cache queue are continuously sent to the hardware accelerator card by a plurality of processing co-ordinates, so that the hardware accelerator card is always in a busy state, the possible time waste of the CPU is reduced, and the overall storage performance of the device is further improved.
In addition, the software buffer queue is a logic queue, and the software buffer queue actually buffers a logic address or a physical address of data, and no data copy exists, so that the software buffer queue is utilized to reduce the data copy in the memory, the CPU can finish unloading a calculation task only by storing calculation request data once in the memory, the memory access process is optimized, and the overall storage performance of the device is further improved.
In some embodiments, a plurality of request coroutines are enabled in each request thread, each request coroutine corresponds to a preset type of calculation request data, and each software cache queue corresponds to a preset type of calculation request data;
The receiving module 111 may specifically be configured to receive, in parallel, calculation request data of a preset type through a plurality of request coroutines;
The writing module 112 may be specifically configured to write, through each request protocol, the received calculation request data into a software cache queue corresponding to the preset type.
In some embodiments, the writing module 112 may be further configured to put the first request cooperative thread into a waiting state for each request thread, where the first calculation request data received by the first request cooperative thread has been written into a corresponding software cache queue; waking up a second request coroutine; if the calculation result of the first calculation request data is obtained, the first request coroutine is awakened.
In some embodiments, the sending module 113 may be specifically configured to:
Each processing cooperative program is polled and awakened, so that the currently awakened processing cooperative program detects whether calculation request data is cached in a corresponding software cache queue;
if the calculation request data is not cached in the corresponding software cache queue, setting the currently awakened processing cooperative program to be in a waiting state, and awakening the next processing cooperative program;
If the calculation request data is cached in the corresponding software cache queue, the currently-awakened processing protocol is kept in an awakening state, so that the calculation request data cached in the corresponding software cache queue is sent to the hardware accelerator card.
In some embodiments, the sending module 113 may also be configured to:
After the calculation request data cached in the corresponding software cache queue is sent to the hardware acceleration card, each processing assistant is polled and awakened, so that the currently awakened processing assistant detects whether the hardware acceleration card obtains a calculation result of the corresponding calculation request data;
If the hardware acceleration card does not acquire the calculation result of the corresponding calculation request data, the currently awakened processing assistant program is set to be in a waiting state, and the next processing assistant program is awakened;
If the hardware acceleration card obtains the calculation result of the corresponding calculation request data, the currently-awakened processing assistant is kept in an awakening state, so that the calculation result of the corresponding calculation request data is returned to the corresponding request thread.
In some embodiments, the sending module 113 may be specifically configured to:
acquiring calculation request data of preset data quantity from the software cache queues through the corresponding processing cooperative procedure of each software cache queue;
and sending the acquired calculation request data to the hardware accelerator card.
In some embodiments, each thread in at least one process shares the same CPU core.
In some embodiments, at least one process is an OSD process or a DSE process; the plurality of processing coprocessors are positioned in the DPDK thread; the plurality of request threads are threads for executing erasure correction, erasure correction or compression calculation; the hardware acceleration card is an FPGA hardware card or a DPU.
In some embodiments, the CPU is connected with the hardware accelerator card through a plurality of DMA interfaces; and configuring a fixed number of DMA interfaces by each process, so that the difference value of the number of the DMA interfaces corresponding to each two processes is smaller than or equal to a preset threshold value.
The embodiment of the application also provides a CPU for executing the steps of any data processing method.
The embodiment of the application also provides an electronic device, namely the storage device in fig. 6, as shown in fig. 12, including a CPU 121 and a hardware accelerator card 122; CPU 121 enables at least one process, each process having multiple requesting threads enabled therein;
the CPU 121 receives calculation request data in parallel by a plurality of request threads; the received calculation request data are uniformly written into a plurality of software cache queues through a plurality of request threads; the method comprises the steps that through a processing cooperative program corresponding to each software cache queue, calculation request data cached in the software cache queue are sent to a hardware accelerator card, and the software cache queues correspond to the processing cooperative programs one by one;
The hardware accelerator card 122 calculates the received calculation request data to obtain a calculation result.
In yet another embodiment of the present application, there is also provided a computer readable storage medium having stored therein a computer program which when executed by a processor implements the steps of any of the data processing methods described above.
In a further embodiment of the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the data processing methods of the above embodiments.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, CPU, electronic device, computer storage medium, and computer program product embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the portions of the method embodiments that are described herein.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims (11)

1. A data processing method, characterized in that it is applied to a central processing unit CPU, said CPU enabling at least one process, each process enabling a plurality of request threads; the method comprises the following steps:
receiving computing request data in parallel through the plurality of request threads;
the received calculation request data are uniformly written into a plurality of software cache queues through the plurality of request threads;
Sending the calculation request data cached in each software cache queue to a hardware acceleration card through a processing protocol corresponding to each software cache queue, so that the hardware acceleration card calculates the received calculation request data to obtain a calculation result, wherein the software cache queues are in one-to-one correspondence with the processing protocols;
The method further comprises the steps of:
After the calculation request data cached in the corresponding software cache queue is sent to the hardware acceleration card, each processing assistant is polled and awakened so as to enable the currently awakened processing assistant Cheng Jiance to obtain a calculation result of the corresponding calculation request data or not;
If the hardware acceleration card does not acquire the calculation result of the corresponding calculation request data, the currently awakened processing assistant program is set to be in a waiting state, and the next processing assistant program is awakened;
If the hardware acceleration card obtains the calculation result of the corresponding calculation request data, the currently-awakened processing assistant is kept in an awakening state, so that the calculation result of the corresponding calculation request data is returned to the corresponding request thread.
2. The method of claim 1, wherein a plurality of request coroutines are enabled in each request thread, each request coroutine corresponding to a predetermined type of computation request data, each software cache queue corresponding to a predetermined type of computation request data;
the parallel receiving, by the plurality of request threads, computing request data includes:
Receiving calculation request data of a preset type in parallel through a plurality of request coroutines;
the step of uniformly writing the received calculation request data into a plurality of software cache queues through the plurality of request threads comprises the following steps:
And writing the received calculation request data into a software cache queue corresponding to the preset type through each request cooperative program.
3. The method according to claim 2, wherein the method further comprises:
Setting a first request cooperative program into a waiting state for each request thread, wherein first calculation request data received by the first request cooperative program is written into a corresponding software cache queue;
Waking up a second request coroutine;
and if the calculation result of the first calculation request data is obtained, waking up the first request coroutine.
4. The method according to claim 1, wherein the sending, by the processing protocol corresponding to each software cache queue, the calculation request data cached in the software cache queue to the hardware accelerator card includes:
Each processing cooperative program is polled and awakened, so that the currently awakened processing cooperative program detects whether calculation request data is cached in a corresponding software cache queue;
if the calculation request data is not cached in the corresponding software cache queue, setting the currently awakened processing cooperative program to be in a waiting state, and awakening the next processing cooperative program;
If the calculation request data is cached in the corresponding software cache queue, the currently-awakened processing protocol is kept in an awakening state, so that the calculation request data cached in the corresponding software cache queue is sent to the hardware accelerator card.
5. The method according to claim 1, wherein the sending, by the processing protocol corresponding to each software cache queue, the calculation request data cached in the software cache queue to the hardware accelerator card includes:
Acquiring calculation request data of preset data quantity from each software cache queue through a corresponding processing protocol of each software cache queue;
and sending the acquired calculation request data to the hardware accelerator card.
6. The method of any of claims 1-5, wherein each thread in the at least one process shares a same CPU core.
7. The method of any of claims 1-5, wherein the at least one process is an object storage device OSD process or a data acceleration engine DSE process; the plurality of processing coroutines are positioned in a DPDK thread of the data plane development kit; the plurality of request threads are threads for executing erasure correction, erasure correction or compression calculation; the hardware accelerator card is a field programmable gate array FPGA hardware card or a data processor DPU.
8. The method of any of claims 1-5, wherein the CPU interfaces with the hardware accelerator card with a plurality of direct memory access, DMA, interfaces; and configuring a fixed number of DMA interfaces by each process, so that the difference value of the number of the DMA interfaces corresponding to each two processes is smaller than or equal to a preset threshold value.
9. A data processing apparatus, applied to a central processing unit CPU, said CPU enabling at least one process, each process having a plurality of request threads enabled therein; the device comprises:
the receiving module is used for receiving calculation request data in parallel through the plurality of request threads;
The writing module is used for uniformly writing the received calculation request data into a plurality of software cache queues through the plurality of request threads;
The sending module is used for sending the calculation request data cached in each software cache queue to the hardware acceleration card through the corresponding processing protocol of each software cache queue so that the hardware acceleration card calculates the received calculation request data to obtain a calculation result, and the software cache queues are in one-to-one correspondence with the processing protocol;
the sending module is further configured to:
After the calculation request data cached in the corresponding software cache queue is sent to the hardware acceleration card, each processing assistant is polled and awakened so as to enable the currently awakened processing assistant Cheng Jiance to obtain a calculation result of the corresponding calculation request data or not;
If the hardware acceleration card does not acquire the calculation result of the corresponding calculation request data, the currently awakened processing assistant program is set to be in a waiting state, and the next processing assistant program is awakened;
If the hardware acceleration card obtains the calculation result of the corresponding calculation request data, the currently-awakened processing assistant is kept in an awakening state, so that the calculation result of the corresponding calculation request data is returned to the corresponding request thread.
10. A central processing unit, characterized in that the method steps of any of claims 1-8 are performed.
11. An electronic device is characterized by comprising a Central Processing Unit (CPU) and a hardware accelerator card; the CPU enables at least one process, and a plurality of request threads are enabled in each process;
The CPU receives calculation request data in parallel through the plurality of request threads; the received calculation request data are uniformly written into a plurality of software cache queues through the plurality of request threads; sending calculation request data cached in each software cache queue to a hardware acceleration card through a corresponding processing cooperative program of each software cache queue, wherein the software cache queues are in one-to-one correspondence with the processing cooperative programs; after the calculation request data cached in the corresponding software cache queue is sent to the hardware acceleration card, each processing assistant is polled and awakened so as to enable the currently awakened processing assistant Cheng Jiance to obtain a calculation result of the corresponding calculation request data or not; if the hardware acceleration card does not acquire the calculation result of the corresponding calculation request data, the currently awakened processing assistant program is set to be in a waiting state, and the next processing assistant program is awakened; if the hardware acceleration card obtains the calculation result of the corresponding calculation request data, keeping the currently-awakened processing assistant in an awakening state so as to return the calculation result of the corresponding calculation request data to the corresponding request thread;
And the hardware acceleration card calculates the received calculation request data to obtain a calculation result.
CN202410321393.XA 2024-03-20 2024-03-20 Data processing method and device, central processing unit and electronic equipment Active CN117909087B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410321393.XA CN117909087B (en) 2024-03-20 2024-03-20 Data processing method and device, central processing unit and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410321393.XA CN117909087B (en) 2024-03-20 2024-03-20 Data processing method and device, central processing unit and electronic equipment

Publications (2)

Publication Number Publication Date
CN117909087A CN117909087A (en) 2024-04-19
CN117909087B true CN117909087B (en) 2024-06-21

Family

ID=90682371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410321393.XA Active CN117909087B (en) 2024-03-20 2024-03-20 Data processing method and device, central processing unit and electronic equipment

Country Status (1)

Country Link
CN (1) CN117909087B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115525226A (en) * 2022-09-28 2022-12-27 新华三技术有限公司 Hardware batch fingerprint calculation method, device and equipment

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3631629A1 (en) * 2017-05-29 2020-04-08 Barcelona Supercomputing Center-Centro Nacional de Supercomputación Managing task dependency
CN111209094A (en) * 2018-11-21 2020-05-29 北京小桔科技有限公司 Request processing method and device, electronic equipment and computer readable storage medium
CN110955535B (en) * 2019-11-07 2022-03-22 浪潮(北京)电子信息产业有限公司 Method and related device for calling FPGA (field programmable Gate array) equipment by multi-service request process
US11782720B2 (en) * 2020-11-16 2023-10-10 Ronald Chi-Chun Hui Processor architecture with micro-threading control by hardware-accelerated kernel thread
CN113778694B (en) * 2021-11-12 2022-02-18 苏州浪潮智能科技有限公司 Task processing method, device, equipment and medium
CN115048145B (en) * 2022-06-14 2023-04-25 海光信息技术股份有限公司 Information acquisition method and device and related equipment
WO2024007207A1 (en) * 2022-07-06 2024-01-11 Huawei Technologies Co., Ltd. Synchronization mechanism for inter process communication

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115525226A (en) * 2022-09-28 2022-12-27 新华三技术有限公司 Hardware batch fingerprint calculation method, device and equipment

Also Published As

Publication number Publication date
CN117909087A (en) 2024-04-19

Similar Documents

Publication Publication Date Title
US10884799B2 (en) Multi-core processor in storage system executing dynamic thread for increased core availability
US10871991B2 (en) Multi-core processor in storage system executing dedicated polling thread for increased core availability
US11016956B2 (en) Database management system with database hibernation and bursting
WO2023082560A1 (en) Task processing method and apparatus, device, and medium
US11743333B2 (en) Tiered queuing system
US11005970B2 (en) Data storage system with processor scheduling using distributed peek-poller threads
US20150112934A1 (en) Parallel scanners for log based replication
US9298765B2 (en) Apparatus and method for handling partially inconsistent states among members of a cluster in an erratic storage network
CN112600761A (en) Resource allocation method, device and storage medium
WO2023104194A1 (en) Service processing method and apparatus
CN110851276A (en) Service request processing method, device, server and storage medium
US11232010B2 (en) Performance monitoring for storage system with core thread comprising internal and external schedulers
US7093036B2 (en) Processor state aware interrupts from peripherals
US10671453B1 (en) Data storage system employing two-level scheduling of processing cores
US8359601B2 (en) Data processing method, cluster system, and data processing program
CN111177032A (en) Cache space application method, system, device and computer readable storage medium
US10318362B2 (en) Information processing apparatus, information processing method, and non-transitory computer-readable storage medium
CN117909087B (en) Data processing method and device, central processing unit and electronic equipment
CN110955461B (en) Processing method, device, system, server and storage medium for computing task
CN117407159A (en) Memory space management method and device, equipment and storage medium
CN109800184B (en) Caching method, system, device and storable medium for small block input
US10581748B2 (en) Information processing apparatus, information processing method, and non-transitory computer-readable storage medium
CN107229424B (en) Data writing method for distributed storage system and distributed storage system
US20230393782A1 (en) Io request pipeline processing device, method and system, and storage medium
CN116601616A (en) Data processing device, method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant