WO2021008257A1

WO2021008257A1 - Coprocessor and data processing acceleration method therefor

Info

Publication number: WO2021008257A1
Application number: PCT/CN2020/093840
Authority: WO
Inventors: 袁进辉; 成诚
Original assignee: 北京一流科技有限公司
Priority date: 2019-07-15
Filing date: 2020-06-02
Publication date: 2021-01-21
Also published as: CN110188067A; CN110188067B

Abstract

A coprocessor (100-1, 100-2), comprising an execution body assembly (110), a first execution body (120) provided with at least two output data caches (1221, 1222), and a second execution body (130). The execution body assembly (110) alternatingly reads data requiring processing from one of the at least two output data caches (1221, 1222) of the first execution body (120), executes predetermined operation processing to obtain operation result data, and feeds back a message to the first execution body (120) after obtaining the operation result data so as to inform the first execution body (120) to set said output data cache among the at least two output data caches (1221, 1222) to a vacant state; the second execution body (130) outputs the operation result data of the execution body assembly (110) by means of a communication channel after obtaining the message from the execution body assembly (110), and after the operation result data is outputted, feeds back a message to the execution body assembly (110) and sends to the first execution body (120) a message that data may be transferred; and after the first execution body (120) obtains the message from the second execution body (120) that data may be transferred, while the execution body assembly (110) reads data requiring processing for the other output data cache among the at least two output data caches (1221, 1222) of the first execution body (120) and executes predetermined operation processing to obtain operation result data after obtaining the message fed back by the second execution body (130), the first execution body (120) transfers the data requiring processing to the output data cache among the at least two output data caches (1221, 1222) that is in the vacant state by means of a communication channel between the coprocessor (100-1, 100-2) and a peripheral device.

Description

Coprocessor and its data processing acceleration method

Technical field

The present disclosure relates to a coprocessor, and more specifically, the present disclosure relates to a method for accelerating data processing in the coprocessor.

Background technique

In the existing large number of data processors, in addition to the CPU, there are various coprocessors for sharing the data processing functions of the CPU. Such as GPU (Graphic Processing Unit), APU, etc. For example, GPUs are currently used for big data calculations and deep learning, because GPUs are actually a collection of graphics functions, and these functions are implemented by hardware. GPU has a highly parallel structure (highly parallel structure), so GPU has higher efficiency than CPU in processing graphics data and complex algorithms. When the CPU executes a computing task, it only processes one piece of data at a time, and there is no real parallelism, while the GPU has multiple processor cores, which can process multiple data in parallel at a time. Compared with CPU, GPU has more ALU (Arithmetic Logic Unit) for data processing instead of data cache and flow control. Such a structure is very suitable for large-scale data that is highly uniform in type, independent of each other, and a pure computing environment that does not need to be interrupted.

technical problem

However, in the application process of deep learning and big data calculation of GPUs, because multiple GPUs adopt data parallelism, there are a lot of parameter interactions between GPUs. This parameter interaction will occupy the interconnection bandwidth between devices or between chips, such as PCIe or NvLink. At the same time, in big data calculations and deep learning, GPUs have to constantly occupy inter-chip interconnect bandwidth to copy data. In the existing deep learning system, parameter exchange and data replication between multiple devices or GPUs will be performed simultaneously. When the parameter exchange and the copy of the data piece are performed at the same time, it will cause bandwidth preemption between the two. In the case of a certain bandwidth, such a large amount of data replication and parameter interaction will slow down the overall rate of data communication, thereby delaying the rate of parameter interaction between multiple GPUs and other coprocessor arrays, thereby slowing down the coprocessor The overall data processing speed of the array. In deep learning, the interaction of parameters is more important. Therefore, from the perspective of advancing the overall task processing progress, the priority of parameter interaction is higher than the priority of copying data. However, the existing deep learning system does not consider the priority of parameter interaction relative to data copy, let alone propose any solution. Therefore, on the one hand, how to increase the rate of parameter interaction between devices has become an urgent problem in the field of deep learning when the interconnection bandwidth between chips remains unchanged. On the other hand, under the premise of increasing the rate of parameter interaction between devices, how to increase the data processing speed of coprocessors such as GPUs without affecting data replication is also an urgent problem in this field.

Technical solutions

The purpose of the present disclosure is to solve at least one of the above-mentioned problems and to provide at least the advantages described later. According to one aspect of the present disclosure, a coprocessor is provided, which includes an executive body component, a first executive body having at least two output data buffers, and a second executive body, wherein the executive body component alternates from the One of the at least two output data buffers of the first executive body reads the data to be processed, and executes predetermined operation processing to obtain the operation result data, and after obtaining the operation result data, feeds back a message to the first executive body to inform The first executor puts one of the at least two output data buffers in a vacant state; the second executor passes the operation result data of the executor component through the executor after obtaining the message from the executor component The communication channel between the coprocessor and the peripheral device is output to the peripheral device, and after the operation result data is output, a message is fed back to the execution component and a message that can carry data is sent to the first execution body; and After an executive body obtains a message from the second executive body that can carry data, after the executive body component obtains the message fed back by the second executive body, it responds to at least two output data buffers of the first executive body While the other reads the data to be processed and executes the predetermined operation processing to obtain another operation result data, the first executor transports the data to be processed via the communication channel between the coprocessor and the peripheral device One of the at least two output data buffers in a vacant state.

According to the coprocessor of the present disclosure, the number of the at least two output data buffers is three, four or five.

According to the coprocessor of the present disclosure, the data carried by the first executor is the result data output by other coprocessors.

According to the coprocessor of the present disclosure, the executive body component includes at least one executive body.

According to the coprocessor of the present disclosure, each executive body performs a prescribed operation when its own finite state machine reaches an execution trigger condition.

According to another aspect of the present disclosure, a data processing acceleration method for a coprocessor is provided. The coprocessor includes an executive body component, a first executive body having at least two output data buffers, and a second executive body. , The method includes: reading the first data that needs to be processed from the first output data buffer in the at least two output data buffers of the first executor through the executive body component, and performing predetermined operation processing to obtain The first operation result data, and after obtaining the first operation result data, send a feedback message to the first executor and a message that can perform the operation to the second executor; the first executor will send the feedback message from the executor component The first output data buffer storing the first data is placed in a vacant state; the second executive body outputs the first operation result data of the executive body component through the communication channel after obtaining the message from the executive body component , And after the first operation result data is output, feedback a message to the execution component and send a message that can carry data to the first execution body; and after obtaining a message that can carry data from the second execution body, The executor component reads the second data that needs to be processed and executes a predetermined operation for the second output data buffer of the at least two output data buffers of the first executor after obtaining the message fed back by the second executor While processing to obtain the second operation result data, the first executive body transports the data to be processed as new first data to the vacant state via the communication channel between the coprocessor and the peripheral device. The first output data buffer of at least two output data buffers; the executor component feeds back a message to the first executor after obtaining the second operation result data and sends a message that can perform the operation to the second executor; the first executor After obtaining the feedback message from the executor component, the second output data buffer storing the second data is placed in a vacant state; the second executor after obtaining the message from the executive component The second operation result data is output to the peripheral device through the communication channel between the coprocessor and the peripheral device, and after the second operation result data is output, a message is fed back to the execution component and the first execution body Sending a message that can carry data; and after obtaining a message that can carry data from a second executive body, the first executive body uses the communication channel between the coprocessor and the peripheral device as the data to be processed as When new second data is transferred to the second output data buffer of the at least two output data buffers that are in a vacant state, the above steps are started to be repeated.

According to the data processing acceleration method for a coprocessor according to the present disclosure, the number of the at least two output data buffers is three, four or five.

According to the data processing acceleration method for a coprocessor according to the present disclosure, the data carried by the first execution body is the result data output by other coprocessors.

According to the data processing acceleration method for a coprocessor according to the present disclosure, the execution body component includes at least one execution body. .

According to the data processing acceleration method for a coprocessor according to the present disclosure, each executor executes a prescribed operation when its own finite state machine reaches an execution trigger condition.

According to another aspect of the present disclosure, a coprocessor is provided, including an executive body component, a first executive body, and a second executive body, wherein the executive body component reads from the output data cache of the first executive body. Fetch the data to be processed, perform predetermined operation processing to obtain operation result data, and feedback a message to the first executive body after obtaining the operation result data, so as to inform the first executive body to put the output data buffer in a vacant state; The second executor outputs the operation result data of the executor component to the peripheral device through the communication channel between the coprocessor and the peripheral device after obtaining the message from the executor component, and sends it in the After the operation result data is output, a feedback message is sent to the execution component and a message that can carry data is sent to the first execution body; and after the first execution body obtains the message that can carry data from the second execution body, the first The executive body transports the data to be processed to the output data buffer in a vacant state via the communication channel between the coprocessor and the peripheral device.

Beneficial effect

According to the coprocessor and the method for accelerating the data processing of the coprocessor according to the present disclosure, two or more output data buffers are configured for a first executive body, and the second executive body executes the interaction and sends out to the first executive body. The message to start the transfer of the data to be processed to each output data cache and at the same time make the executive component perform data operations, which can greatly improve the efficiency of the executive component, so that the executive component only needs to wait for one data output or interaction time interval There is no need to wait for the handling of processing data, thereby shortening the waiting time of the executive body component, improving the time utilization efficiency of the executive body component, and thus also improving the efficiency of the coprocessor.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments in accordance with the disclosure, and together with the specification are used to explain the principle of the disclosure.

The present disclosure will be described in detail below through embodiments with reference to the drawings, in which:

Figure 1 shows a schematic structural diagram of a coprocessor according to the present disclosure;

Figure 2 shows a schematic diagram of the executive body in the coprocessor according to the present disclosure;

Figure 3 shows a timing diagram of data processing performed by the coprocessor according to the present disclosure; and

Figure 4 shows a flowchart of a method for data processing in a coprocessor according to the present disclosure.

Embodiments of the invention

The present disclosure will be further described in detail below in conjunction with the embodiments and drawings, so that those skilled in the art can implement it with reference to the text of the description.

Here, exemplary embodiments will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. Rather, they are merely examples of devices and methods consistent with some aspects of the present disclosure as detailed in the appended claims.

The terms used in the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit the present disclosure. The singular forms "a", "said" and "the" used in the present disclosure and appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used herein refers to and includes any or all possible combinations of one or more associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present disclosure, in the following, one of the two possible positions may be referred to as the first operation result or the second operation result. Similarly, the other of the two possible positions may be What is called the second operation result may also be called the first operation result. Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to determination".

In order to enable those skilled in the art to better understand the present disclosure, the present disclosure will be further described in detail below with reference to the accompanying drawings and specific embodiments.

Figure 1 shows a schematic structural diagram of a coprocessor according to the present disclosure. As shown in FIG. 1, the coprocessor 100-1 includes an executive body component 110, a first executive body 120, and a second executive body 130. The first executive body 120 includes at least two output data buffers 121 and 12. Optionally, according to actual application requirements, the first executive body 120 may include three or more output data buffers. For the convenience of description, in the specification of the present disclosure, only two output data buffers 121 and 12 are used for description. For the convenience of description, the output data buffer 121 is referred to as the first output data buffer, and the output data buffer 122 may be referred to as the second output data buffer. Alternatively, the output data buffer 121 can also be referred to as the second output data buffer, and the output data buffer 122 can be referred to as the first output data buffer. The first output data buffer 121 and the second output data buffer 122 are used to store data that needs to be processed. These data need to be processed through the first execution body 120 via the communication channel PCIe between the coprocessor 100-1 and the peripheral device. Transported from outside.

According to the coprocessor 100-1 of the present disclosure, the inter-device or inter-chip interconnection bandwidth, such as PCIe or NvLink inter-device or inter-chip interconnection bandwidth, and other similar coprocessors 100-2, 100-3..., 100-N They are connected to each other and can process computing tasks distributed via the host's CPU in parallel. Therefore, the coprocessors 100-1, 100-2, 100-3..., 100-N have a large amount of operation result data and/or parameter interactions between each other in the process of data processing, and need to repeatedly pass from the outside through such as PCIe's inter-chip interconnect bandwidth communication channel for data transfer operations. The parameter is also a kind of operation result data. Therefore, the coprocessor 100 generally includes three stages in the process of data processing, namely, the stage of copying (COPY) the data to be processed from the outside, the stage of data operation processing (PROCESS), and the output of the result data to the outside or communication with the outside. The phase of parameter interaction (EXCHANGE). The time usually spent in these three stages is T ₁ , T ₂ and T _{3 respectively} .

Generally, the operation time occupies a large part of the data processing process, and relatively speaking, the time spent in the data copy and data transmission process is very short compared with the operation time. Therefore, in order to increase the data processing speed, people are more to increase the speed of the operation unit to perform operations, such as computing speed, so as to reduce the operating time, thereby reducing the overall time for data processing. With the substantial increase in the computing performance of the actuator components, the time spent on data manipulation and processing is getting shorter and shorter than the time spent in data processing, and has basically reached the same time as the time spent on data handling or data communication. degree. Therefore, it is very difficult to shorten the data processing time by further increasing the calculation speed. For example, in the aforementioned copy time T ₁ , operation time T _2, and result data processing time or parameter interaction time T ₃ , the total time T of a round of data processing in the coprocessor 100 is generally equal to T ₁ +T ₂ +T ₃ . Although the copy time T ₁ and the parameter interaction time T ₃ are described separately here, in a conventional deep learning system, T ₁ + T _{3 is} actually a whole. In other words, in the field of conventional deep learning technology, since copying data and parameter interaction are carried out at the same time, the inter-chip interconnect bandwidth is occupied at the same time. As a result, with a certain bandwidth, the total time spent is basically not shorter than the bandwidth. Separately process the integration of the time spent on copying data and parameter interaction. For example, in the case of a certain bandwidth, it takes 2 milliseconds to process data replication alone, and 5 milliseconds for parameter interaction, but when the inter-chip interconnect bandwidth remains unchanged, data replication and data replication of the same amount of data are performed at the same time. For parameter interaction with the same amount of data, the total time spent is usually equal to or greater than 7 milliseconds. Therefore, T ₁ and T ₃ are described separately here for ease of understanding.

Therefore, when the copy process and the output or interaction of the result data are performed at the same time, it may be because the two operations occupy the communication channel PCIe with a fixed bandwidth at the same time. Due to the larger amount of data, the total time T ₁₃ spent on data communication may be greater than T ₁₃ ₁ + T _{3 is} longer. Since it is more difficult to increase the computing speed, the inventors of the present disclosure attempt to shorten the total time of data transfer time and data output or interaction time spent in each round of operation.

To this end, the present disclosure makes it possible to save the total time T of each round of operation by making the first execution body 120 include the first output data buffer 121 and the second output data buffer 122. As shown in FIG. 1, the first output data buffer 121 and the second output data buffer 122 store data D to be processed. The executor component 110 reads the data D to be processed from the first output data buffer 121 and the second output data buffer 122 in turn, and executes predetermined operation processing. Specifically, the executive component 110 first reads the first data D1 that needs to be processed stored in the first output data buffer 121, and executes a predetermined first round of operation processing to obtain the first round of operation result data R1. The time spent at this time is, where the superscript marks the operation rounds, and the lower superscripts represent the stages in each round of operation. As described above, each round of operation is divided into three stages, namely the first stage is the copy time T ₁ , the second stage is the operation time T ₂ and the third stage is the result data processing time T ₃ . After the executor component 110 has performed the operation on the first data D1, it will feed back a message to the first executor 120 to inform the first executor 120 that the executor component 110 has used the first data D1. After receiving the feedback message, the first executive body 120 has a limited state opportunity to modify its state and make the first output data buffer 121 vacant, thereby preparing storage space for the data required for the next round of operations (except the initial state) .

The second executor 130 outputs the first round of operation result data via the PCIe communication channel so as to interact with other coprocessor 100 or the CPU of the host. The time taken for the output or interaction of the result data R1 performed by the second executive body 130 is T ₃ , and this time must be performed after the operating time T ₂ , for example, marked as T ₃ ¹ , where the superscript represents the data The rounds of processing, and the subscripts represent the stages in a round of data processing.

After the second executor 130 completes the output of the operation result data, the second executor 130 immediately feeds back a message to the executor component 110, and the executor component 110 can respond based on the received feedback message. Immediately proceed to the next round of operations. At the same time, the second executor 130 immediately sends a message to the first executor 120, so that the finite state machine of the first executor 120 triggers the trigger condition for executing the data transfer, so that the first executor 120 is in the execution body component 110. During the next round of operation, the data that needs to be processed is transferred to the first output data buffer 121 in an empty state via the communication channel PCIe, so as to prepare data for the operation after the second round of operation (for example, the third round of operation).

Specifically, the second executive body 130 immediately feeds back a message that the second-round operation can be performed to the executive body component 110 after the first round of result data R1 is output or interacted, and the executive component 110 receives the feedback message. Thus, the data D2 to be processed is read from the second output data buffer 122 and the second round of operations is performed. At this time, the time taken by the actuator component 110 to perform the second round of operations is T ₂ ² . At the same time, the second executor 130 also sends a message to the first executor 120 that the data can be transferred, which informs the first executor 120 that the data transfer operation can be performed immediately. After the first executor 120 obtains the transportable data message from the second executor 130, its finite state machine modifies the state, thereby triggering the first executor 120 to perform the data transfer operation, thereby transferring the data to be processed via the communication channel PCIe D3 is transported to the vacant first output data buffer 121 for storage, so as to prepare data for the operation (for example, the third operation) of the actuator component 110 after the second round of operation is performed.

From the above description, it can be seen that the time taken by the first executive body 120 to perform the data D3 transfer is T ₁ ³ . As a result, while the second round of data operation is in progress, the third round of required data is copied, so there is no need to wait for the required data to be copied to the free storage space before the third round of operation starts. The three-round operation saves the copying time T ₁ ³ , thereby shortening the actual data processing time T of the third-round operation from the conventional time T ₁ ³ +T ₂ ³ +T ₃ ³ to T ₂ ³ +T ₃ ³ .

Similarly, after the second round of operations is over, that is, after the executor component 110 completes the operation on the second data D2, the executor component 110 feeds back a message to the first executor 120, notifying the first executor 120 that it has completed the second data D2 Use of data D2. After obtaining the feedback message, the first executive body 120 modifies the state of its finite state machine and makes the second output data buffer 122 in an empty state, so as to prepare storage space for the data required for the fourth round of operations. After obtaining the second operation result data R2, the second executor 130 starts the result data interaction process of the second round of operation. At the end of the interactive process result data, the second body 130 performs a feedback message sent to the execution and the assembly 110 may be sent to the first message transport data executable to perform assembly 110 immediately executes the third time period T ₂ ³ The round operation and the first executive body 120 simultaneously execute the transfer of the data D4 required for the fourth round operation to the second output data buffer 122 within the time period T ₂ ³ . The time taken for transportation is T ₁ ⁴ .

As described above, although the data transfer time T ₁ ³ required for the third round of operation actually occurred, it was at the same time as T ₂ ² . Therefore, the actual data processing time T of the third round of operation is shortened from the conventional T ₁ ³ +T ₂ ³ +T ₃ ³ to T ₂ ³ +T ₃ ³ . By analogy, T ₁ ^{4 is} parallel to T ₂ ³ , T ₁ ^{5 is} parallel to T ₂ ⁴ ..., T ₁ ⁿ⁺¹ is parallel to T ₂ ⁿ . Moreover, in this way, the handling of the original data and the interaction of the result data or parameters are performed in a time-sharing manner, so that the coprocessor 100 performs data communication with peripheral devices (such as other coprocessors or CPUs). The actual execution time of time is obviously less than the execution time required for data handling and data interaction at the same time. Specifically, on the one hand, compared with the prior art, the present disclosure does not include that T ₁ is eliminated (actually exists in parallel in T ₂ ). On the other hand, T ₃ according to the present disclosure is also better than the prior art. T _{3 in is} much smaller.

Figure 2 shows a schematic diagram of the executive body in the coprocessor according to the present disclosure. As shown in Figure 2, the large dashed box represents an executive body. In the executive network component shown in Figure 2, only five executives are shown. In fact, corresponding to the task topology, as many nodes as the neural network has, there are as many executive bodies in the executive body network components. Fig. 2 schematically shows the composition of each executive body constituting the present disclosure, which includes a message warehouse, a finite state machine, a processing component, and an output data buffer. As can be seen from Figure 2, each execution body seems to contain an input data buffer, but it is marked by a dotted line. In fact, this is an imaginary component, which will be explained in detail later. Each executor in the data processing path, such as the first executor in Figure 2, is established based on a node in a neural network of a task topology, and based on the complete node attributes, the first executor and its upper The topological relationship, message warehouse, finite state machine and processing method (processing component) of the downstream executive body and the cache location of the generated data (output data cache). Specifically, when the first executor performs data processing, for example, its task requires two input data, that is, the output data of the second executor and the fourth executor upstream of it. When the second executor generates data to be output to the first executor, for example, when the second data is generated, the second executor will send a data ready message to the first executor to the message bin of the first executor to inform the first executor The second data of an executor is already in the output data buffer of the second executor and is in an available state, so that the first executor can execute the reading of the second data at any time. At this time, the second data will always be waiting for the second data. An executive body reads the status. The finite state machine of the first executor modifies its state after the message bin obtains the message of the second executor. Similarly, when the fourth executor generates data to be output to the first executor, for example, when the fourth data is generated, the fourth executor will send a data-ready message to the first executor to the message bin of the first executor. Inform the first executor that the fourth data is already in the output data buffer of the fourth executor and is in an available state, so that the first executor can read the fourth data at any time, and the fourth data will always be in Wait for the first executor to read the status. The finite state machine of the first executor modifies its state after the message bin obtains the message of the fourth executor. Similarly, if the processing component of the first executive body generates data after the last execution of the computing task, such as the first data, and buffers it in its output data buffer, and sends it to the downstream executive body of the first executive body, such as the third The executive body and the fifth executive body send a message that the first data can be read.

When the third executor and the fifth executor read the first data and use it, they will feedback a message to the first executor to inform the first executor that the first data is used up. Therefore, the first executor's The output data buffer is empty. At this time, the finite state machine of the first executive body will also modify its state.

In this way, when the state change of the finite state machine reaches a predetermined state, for example, the input data (such as the second data and the fourth data) required for the execution of the operation of the first executor are all in the available state and the output data buffer is empty In the state, the processing component is notified to read the second data in the output data buffer of the second executor and the fourth data in the output data buffer of the fourth executor, and execute the specified calculation task, thereby generating the execution body’s The output data, such as new first data, is stored in the output data buffer of the first execution body.

Similarly, when the first executor completes the specified calculation task, the finite state machine returns to its initial state, waiting for the next state change cycle, and the first executor feeds back to the second executor that the use of the second data is complete. The message bin of the second executor and the message of the fourth executor that feeds back the completion of the fourth data usage to the message bin of the fourth executor, and the downstream executors, such as the third executor and the fifth executor, send The message that the first data has been generated informs the third executive body and the fifth executive body that the first data is already in a state that can be read.

When the second executor obtains the message that the first executor has used the second data, the output data buffer of the second executor is left in a vacant state. Similarly, after the fourth executor obtains the message that the first executor has used the fourth data, the output data buffer of the fourth executor is in a vacant state.

The above-mentioned task execution process of the first executive body also occurs in other executive bodies. Therefore, under the control of the finite state machine in each executive body, similar tasks are cyclically processed based on the output results of the upstream executive body. Thus, each executive body is like a fixed-task post worker on a data processing path, thereby forming a data pipeline processing without any other external instructions.

As mentioned earlier, each execution body shown in Figure 2 contains an input data buffer, which is actually not included, because each execution body does not need any buffer to store the data to be used, but only It is sufficient to obtain the data needed to be used in a state that can be read. Therefore, when the data to be used by each execution body is not in a specific execution state, the data is still stored in the output data buffer of the upstream execution body. Therefore, for visual display, the input data buffer in each executive body is represented by a dotted line, which does not actually exist in the executive body. In other words, the output data cache of the upstream executive is the virtual input data cache of the downstream executive. Therefore, in Figure 2, dashed lines are used for the input data buffer.

FIG. 3 shows a timing diagram when the coprocessor according to the present disclosure performs continuous data processing. As shown in FIG. 3, in the initial state, two data D ₁ and D _{2 are} stored in the first output data buffer 121 and the second output data buffer 122 at one time. After the execution of the operation is started, in the first round of data processing step L ₁ , the executor component 110 reads the data D ₁ in the first output data buffer 121 and executes the first round of operation, and then the result data in the second executor 130 R1 output or data exchange with other parallel devices. Then, perform the second round of data processing step L ₂ , the executive component 110 reads the data D ₂ in the second output data buffer 122 when the feedback message sent by the second executive 130 is sent, and executes the second round of operations to obtain the result data R2, while performing the first body 120 in the second transmission time can be performed by message 130 carrying a data of the external data D ₃ is transported to the first output data buffer 121 for use in processing of the third round of data processing steps. In this cycle, in the nth round of data processing step L _n , the executive component 110 reads the data D in one of the first output data buffer 121 and the second output data buffer 122 when the feedback message sent by the second executive body 130 is And execute the nth round of operation to obtain the result data R. At the same time, the first executor 120 transfers the external data D _n+1 to the first output data buffer 121 and the second output when the second executor 130 sends a message that can carry data. The output data buffer in an empty state in one of the data buffers 122 is used for the processing of the third round of data processing steps.

Although the specific content of the present disclosure is described above according to FIGS. 1-3, in consideration of the priority of parameter interaction, an alternative solution can also be provided, that is, the first executive body 120 may only have one output data buffer, for example, the first output Data cache 121. Specifically, the executor component 110 reads the data to be processed from the output data buffer 121 of the first executor 120, executes predetermined operation processing to obtain the operation result data, and sends the data to the first executor after obtaining the operation result data. The executive body 120 feeds back a message to inform the first executive body 120 to place the output data buffer 121 in a vacant state. After the second executive body 130 obtains the message from the executive body component 110, the operation result data of the executive body component 110 is output to the peripheral device through the communication channel between the coprocessor and the peripheral device, and the After the operation result data is output (or after the parameter interaction is performed), a message is fed back to the execution component 110 and a message that can carry data is sent to the first execution body 120. The first execution body 120 receives information from the second execution After the message of the body 130 that can carry data, the first executive body 120 transfers the data to be processed to the output data buffer 121 in an empty state via the communication channel between the coprocessor and the peripheral device. In this way, the coprocessor prioritizes the parameter interaction in the deep learning system, and then copies the new data.

Fig. 4 shows a flowchart of a data processing method in a coprocessor according to the present disclosure. As shown in FIG. 4, at step S410, the coprocessor is in the initial state, and the first executive body 120 stores the data D (for example, data D ₁ and D ₂ ) to be processed into the first output data buffer 121 and The second output data buffer 122. Alternatively, it is also possible to store data in only one output data buffer, while the other output data buffer is in a vacant state. Subsequently, at step S415, the executive component 110 reads data from one of the first output data buffer 121 and the second output data buffer 122. In the initial state, which of the first output data buffer 121 and the second output data buffer 122 is read first for performing the operation has no influence on the implementation of the technical solution of the present disclosure. For the convenience of presentation, the first output data buffer 121 is described here first, that is, the executive component 110 reads data from the first output data buffer 121 and executes the nth round of data operations to obtain the nth result data R. After the executor component 110 obtains the result data R at step S415, on the one hand, it will feed back a message to the first executor 120, so that the first executor 120 puts the first output data buffer 121 in the control state at step S420. So that the first output data buffer 121 is in a state where data can be written. On the other hand, the executive component 110 will simultaneously send a data exchange message to the second executive 130 to inform the second executive 130 that it can perform data exchange in step S425. It outputs the nth result data R or performs data interaction with the external CPU or parallel coprocessor (such as GPU). After step S425 ends, the second executive body 130 sends a feedback message to the executive body component 110 on the one hand to inform the executive body component 110 that it can perform the next round of data processing, and on the other hand, it sends a message to the first executive body 120 that it can be transported from outside. Data message. Subsequently, at step S430, after obtaining the feedback message from the second executive body 130, the executive component 110 executes the n+1th round of operation processing on the data D in the second output data buffer 122 and obtains the n+1th data Result R. While performing the n+1th round of operation processing, the first executor 120 transports the next data D to be processed from the outside to the vacant first output data buffer 121 at step S435 for use in the n+th 2 rounds of operation processing.

After the executor component 110 obtains the result data R at step S430, on the one hand, it will feed back a message to the first executor 120, so that the first executor 120 puts the second output data buffer 122 in the control state at step S440. So that the second output data buffer 122 is in a state where data can be written. On the other hand, the executive component 110 will simultaneously send a data exchange message to the second executive 130 to inform the second executive 130 that it can perform data exchange in step S445 It outputs the n+1th result data R or performs data interaction with the external CPU or parallel coprocessor (such as GPU).

After the end of step S445, the second executive body 130 sends a feedback message to the executive body component 110 on the one hand to inform the executive body component 110 that it can perform the next round of data processing, and on the other hand, it sends a message to the first executive body 120 that it can be transported from outside. Data message. Subsequently, at step S450, after obtaining the feedback message from the second executive body 130, the executive component 110 executes the n+2th round of operation processing on the data D in the first output data buffer 122 and obtains the n+2th data Result R. While performing the n+2th round of operation processing, the first executor 120 transports the next data D to be processed from the outside to the second output data buffer 122 in the vacant state at step S455 for use in the n+th 3 rounds of operation processing.

Subsequently, at step S460, after adjusting the counter "n=n+2", return to step S420 to repeat steps S422-S460.

Although the coprocessor of the present disclosure is described in detail above, it should be pointed out that the second execution body may not only include two output data buffers 121 and 12, but may include three or more than three. For example, when the executive component 110 needs to read more than two data in one round of operation, four output data buffers can be set for the first executive 120. This can be set according to actual needs.

Although the executive body assembly 110 and the second executive body 130 are described as two independent individuals in the description of the present disclosure, alternatively, the second executive body 130 itself may be a part of the executive body assembly 110, and both execute The operation processing process and the output and interaction process of the result data are executed one after another. Therefore, although the present disclosure separates the two for the convenience of description, it does not mean that the separate existence of the two is a necessary arrangement for realizing the present disclosure.

When the coprocessor according to the present disclosure, such as the GPU, is used in the field of big data technology and deep learning, due to the large amount of data handling and data interaction, how to reduce the interaction between the GPU and the external communication bandwidth is fixed The amount of data is very important to increase the speed of data transmission. In the big data calculation and deep learning, the coprocessor according to the present disclosure is used. Since the copy of the data and the interaction of the data are performed in a time-sharing manner, the data communication volume can be reduced under the fixed bandwidth situation, thereby increasing the data communication speed. At the same time, because the copy of the data is performed at the same time as the previous round of operation, the present disclosure makes full use of the situation that there is no data interaction during the data operation that causes bandwidth occupation, thereby moving the next round of data copy forward to Bandwidth was occupied during the last round of operation, thereby realizing data copy and data interaction for bandwidth occupancy time-sharing processing. Thus, the objective of the present disclosure is finally achieved.

the more important thing is. In order to achieve the purpose of the present disclosure, two or more output data buffers are provided. Therefore, in combination with the above control process, data streaming is realized, which greatly improves the data processing speed.

Although the executive component 110 is a component in the present disclosure, it may only include one executive, or there may be multiple executives to form a data processing path or a data processing network.

So far, this specification describes a coprocessor and a method for accelerating data processing speed in the coprocessor according to embodiments of the present disclosure. According to the coprocessor and the method for accelerating the data processing speed in the coprocessor according to the present disclosure, two or more output data buffers are configured for an executive body component, and the time for copying data of each output data buffer is controlled by a control signal Segment, can greatly improve the efficiency of the use of the executive component, so that the executive component only needs to wait for a data output or interaction time interval without waiting for the copy of the processing data, thereby shortening the waiting time of the executive component and improving the executive component The efficiency of the time utilization of the components also improves the efficiency of the coprocessor.

The above describes the basic principles of the present disclosure in conjunction with specific embodiments. However, it should be pointed out that for those of ordinary skill in the art, all or any steps or components of the method and device of the present disclosure can be understood, and they can be used in any computing device. (Including processors, storage media, etc.) or a network of computing devices, implemented by hardware, firmware, software, or a combination thereof, this is the basic for those of ordinary skill in the art to use them after reading the description of the present disclosure Programming skills can be achieved.

Therefore, the purpose of the present disclosure can also be realized by running a program or a group of programs on any computing device. The computing device may be a well-known general-purpose device. Therefore, the purpose of the present disclosure can also be achieved only by providing a program product containing program code for implementing the method or device. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. Obviously, the storage medium may be any well-known storage medium or any storage medium developed in the future.

It should also be pointed out that, in the device and method of the present disclosure, obviously, each component or each step can be decomposed and/or recombined. These decomposition and/or recombination should be regarded as equivalent solutions of the present disclosure. In addition, the steps of performing the above-mentioned series of processing can naturally be performed in chronological order in the order of description, but it is not necessarily performed in chronological order. Some steps can be performed in parallel or independently of each other.

The foregoing specific implementations do not constitute a limitation on the protection scope of the present disclosure. Those skilled in the art should understand that, depending on design requirements and other factors, various modifications, combinations, sub-combinations, and substitutions can occur. Any modifications, equivalent substitutions and improvements made within the spirit and principles of this disclosure should be included in the protection scope of this disclosure.

Claims

A coprocessor includes an executive body component, a first executive body having at least two output data caches, and a second executive body, wherein,

The executor component reads the data to be processed from one of the at least two output data buffers of the first executor in turn, and executes predetermined operation processing to obtain operation result data, and after obtaining the operation result data Feedback a message to the first executive body, so as to inform the first executive body to place one of the at least two output data buffers in a vacant state;

The second executor outputs the operation result data of the executor component to the peripheral device through the communication channel between the coprocessor and the peripheral device after obtaining the message from the executor component, and sends it in the After the operation result data is output, a message is fed back to the execution component and a message that can carry the data is sent to the first execution body; and

After the first executor obtains a message that can carry data from the second executor, after the executor component obtains the message fed back by the second executor, it outputs at least two data of the first executor The other one in the cache reads the data to be processed and executes the predetermined operation processing to obtain another operation result data. At the same time, the first executor transmits the data to be processed through the communication channel between the coprocessor and the peripheral device. The data is transferred to one of the at least two output data buffers in an empty state.
The coprocessor according to claim 1, wherein the number of the at least two output data buffers is three, four or five.
5. The coprocessor according to claim 1, wherein the data carried by the first execution body is operation result data output by other coprocessors.
The coprocessor according to claim 1, wherein the executive component includes at least one executive.
The coprocessor according to any one of claims 1 to 4, wherein each executive body performs a prescribed operation when its own finite state machine reaches an execution trigger condition.
A data processing acceleration method for a coprocessor. The coprocessor includes an executive component, a first executive and a second executive with at least two output data buffers, the method comprising:

Read the first data to be processed from the first output data buffer in the at least two output data buffers of the first executor through the executive body component, and perform predetermined operation processing to obtain the first operation result data, And after obtaining the first operation result data, feedback a message to the first executive body and send a message that the operation can be performed to the second executive body;

The first execution body places the first output data buffer storing the first data in an empty state after obtaining the feedback message from the execution body component;

The second executive body outputs the first operation result data of the executive body component to the peripheral device through the communication channel between the coprocessor and the peripheral device after obtaining the message from the executive body component, and then After the first operation result data is output, a message is fed back to the execution component and a message that can carry data is sent to the first execution body; and

After obtaining a message from the second executor that can carry data, after the executor component obtains the message fed back by the second executor, the second one of the at least two output data buffers of the first executor The output data buffer reads the second data to be processed and executes predetermined operation processing to obtain the second operation result data. At the same time, the first executor will use the communication channel between the coprocessor and the peripheral device. The processed data is transported as new first data to the first output data buffer of the at least two output data buffers that are in a vacant state;

The executor component feeds back a message to the first executor and sends an executable message to the second executor after obtaining the second operation result data;

The first executive body places the second output data buffer storing the second data in a vacant state after obtaining the feedback message from the executive body component;

The second executor outputs the second operation result data of the executor component through the communication channel after obtaining the message from the executor component, and sends the second operation result data to the Said executive component feedback message and sending a message that can carry data to the first executive body; and

After obtaining the message that can carry data from the second executive body, the first executive body transfers the required processing data as new second data to the computer via the communication channel between the coprocessor and the peripheral device. At the same time as the second output data buffer of the at least two output data buffers in the vacant state, the above steps are repeated.
8. The data processing acceleration method for a coprocessor according to claim 6, wherein the number of the at least two output data buffers is three, four or five.
8. The method for accelerating data processing for a coprocessor according to claim 6, wherein the data carried by the first execution body is operation result data output by other coprocessors.
8. The method for accelerating data processing for a coprocessor according to claim 6, wherein the executive component includes at least one executive.
8. The method for accelerating data processing for a coprocessor according to any one of claims 6-9, wherein each executive body executes a prescribed operation when its own finite state machine reaches an execution trigger condition.