WO2021008257A1 - Coprocessor and data processing acceleration method therefor - Google Patents

Coprocessor and data processing acceleration method therefor Download PDF

Info

Publication number
WO2021008257A1
WO2021008257A1 PCT/CN2020/093840 CN2020093840W WO2021008257A1 WO 2021008257 A1 WO2021008257 A1 WO 2021008257A1 CN 2020093840 W CN2020093840 W CN 2020093840W WO 2021008257 A1 WO2021008257 A1 WO 2021008257A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
executor
message
component
executive
Prior art date
Application number
PCT/CN2020/093840
Other languages
French (fr)
Chinese (zh)
Inventor
袁进辉
成诚
Original Assignee
北京一流科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京一流科技有限公司 filed Critical 北京一流科技有限公司
Publication of WO2021008257A1 publication Critical patent/WO2021008257A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7839Architectures of general purpose stored program computers comprising a single central processing unit with memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to a coprocessor, and more specifically, the present disclosure relates to a method for accelerating data processing in the coprocessor.
  • GPU Graphic Processing Unit
  • APU Arithmetic Processing Unit
  • GPUs are currently used for big data calculations and deep learning, because GPUs are actually a collection of graphics functions, and these functions are implemented by hardware.
  • GPU has a highly parallel structure (highly parallel structure), so GPU has higher efficiency than CPU in processing graphics data and complex algorithms.
  • the CPU executes a computing task, it only processes one piece of data at a time, and there is no real parallelism, while the GPU has multiple processor cores, which can process multiple data in parallel at a time.
  • GPU has more ALU (Arithmetic Logic Unit) for data processing instead of data cache and flow control.
  • ALU Arimetic Logic Unit
  • Such a structure is very suitable for large-scale data that is highly uniform in type, independent of each other, and a pure computing environment that does not need to be interrupted.
  • a coprocessor which includes an executive body component, a first executive body having at least two output data buffers, and a second executive body, wherein the executive body component alternates from the One of the at least two output data buffers of the first executive body reads the data to be processed, and executes predetermined operation processing to obtain the operation result data, and after obtaining the operation result data, feeds back a message to the first executive body to inform
  • the first executor puts one of the at least two output data buffers in a vacant state; the second executor passes the operation result data of the executor component through the executor after obtaining the message from the executor component
  • the communication channel between the coprocessor and the peripheral device is output to the peripheral device, and after the operation result data is output, a message is fed back to the execution component and a message that can carry data is sent to the first execution body; and After an executive body obtains
  • the number of the at least two output data buffers is three, four or five.
  • the data carried by the first executor is the result data output by other coprocessors.
  • the executive body component includes at least one executive body.
  • each executive body performs a prescribed operation when its own finite state machine reaches an execution trigger condition.
  • a data processing acceleration method for a coprocessor includes an executive body component, a first executive body having at least two output data buffers, and a second executive body.
  • the method includes: reading the first data that needs to be processed from the first output data buffer in the at least two output data buffers of the first executor through the executive body component, and performing predetermined operation processing to obtain The first operation result data, and after obtaining the first operation result data, send a feedback message to the first executor and a message that can perform the operation to the second executor; the first executor will send the feedback message from the executor component
  • the first output data buffer storing the first data is placed in a vacant state; the second executive body outputs the first operation result data of the executive body component through the communication channel after obtaining the message from the executive body component , And after the first operation result data is output, feedback a message to the execution component and send a message that can carry data to the first execution body; and after obtaining a message that can carry data from the
  • the second operation result data is output to the peripheral device through the communication channel between the coprocessor and the peripheral device, and after the second operation result data is output, a message is fed back to the execution component and the first execution body Sending a message that can carry data; and after obtaining a message that can carry data from a second executive body, the first executive body uses the communication channel between the coprocessor and the peripheral device as the data to be processed as When new second data is transferred to the second output data buffer of the at least two output data buffers that are in a vacant state, the above steps are started to be repeated.
  • the number of the at least two output data buffers is three, four or five.
  • the data carried by the first execution body is the result data output by other coprocessors.
  • the execution body component includes at least one execution body. .
  • each executor executes a prescribed operation when its own finite state machine reaches an execution trigger condition.
  • a coprocessor including an executive body component, a first executive body, and a second executive body, wherein the executive body component reads from the output data cache of the first executive body. Fetch the data to be processed, perform predetermined operation processing to obtain operation result data, and feedback a message to the first executive body after obtaining the operation result data, so as to inform the first executive body to put the output data buffer in a vacant state;
  • the second executor outputs the operation result data of the executor component to the peripheral device through the communication channel between the coprocessor and the peripheral device after obtaining the message from the executor component, and sends it in the After the operation result data is output, a feedback message is sent to the execution component and a message that can carry data is sent to the first execution body; and after the first execution body obtains the message that can carry data from the second execution body, the first
  • the executive body transports the data to be processed to the output data buffer in a vacant state via the communication channel between the coprocessor and the peripheral device.
  • two or more output data buffers are configured for a first executive body, and the second executive body executes the interaction and sends out to the first executive body.
  • the message to start the transfer of the data to be processed to each output data cache and at the same time make the executive component perform data operations, which can greatly improve the efficiency of the executive component, so that the executive component only needs to wait for one data output or interaction time interval There is no need to wait for the handling of processing data, thereby shortening the waiting time of the executive body component, improving the time utilization efficiency of the executive body component, and thus also improving the efficiency of the coprocessor.
  • Figure 1 shows a schematic structural diagram of a coprocessor according to the present disclosure
  • Figure 2 shows a schematic diagram of the executive body in the coprocessor according to the present disclosure
  • Figure 3 shows a timing diagram of data processing performed by the coprocessor according to the present disclosure.
  • Figure 4 shows a flowchart of a method for data processing in a coprocessor according to the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • one of the two possible positions may be referred to as the first operation result or the second operation result.
  • the other of the two possible positions may be What is called the second operation result may also be called the first operation result.
  • the word "if” as used herein can be interpreted as "when” or "when” or "in response to determination”.
  • FIG. 1 shows a schematic structural diagram of a coprocessor according to the present disclosure.
  • the coprocessor 100-1 includes an executive body component 110, a first executive body 120, and a second executive body 130.
  • the first executive body 120 includes at least two output data buffers 121 and 12.
  • the first executive body 120 may include three or more output data buffers.
  • the output data buffer 121 is referred to as the first output data buffer
  • the output data buffer 122 may be referred to as the second output data buffer.
  • the output data buffer 121 can also be referred to as the second output data buffer, and the output data buffer 122 can be referred to as the first output data buffer.
  • the first output data buffer 121 and the second output data buffer 122 are used to store data that needs to be processed. These data need to be processed through the first execution body 120 via the communication channel PCIe between the coprocessor 100-1 and the peripheral device. Transported from outside.
  • the inter-device or inter-chip interconnection bandwidth such as PCIe or NvLink inter-device or inter-chip interconnection bandwidth
  • other similar coprocessors 100-2, 100-3..., 100-N They are connected to each other and can process computing tasks distributed via the host's CPU in parallel. Therefore, the coprocessors 100-1, 100-2, 100-3..., 100-N have a large amount of operation result data and/or parameter interactions between each other in the process of data processing, and need to repeatedly pass from the outside through such as PCIe's inter-chip interconnect bandwidth communication channel for data transfer operations.
  • the parameter is also a kind of operation result data.
  • the coprocessor 100 generally includes three stages in the process of data processing, namely, the stage of copying (COPY) the data to be processed from the outside, the stage of data operation processing (PROCESS), and the output of the result data to the outside or communication with the outside.
  • the time usually spent in these three stages is T 1 , T 2 and T 3 respectively .
  • the operation time occupies a large part of the data processing process, and relatively speaking, the time spent in the data copy and data transmission process is very short compared with the operation time. Therefore, in order to increase the data processing speed, people are more to increase the speed of the operation unit to perform operations, such as computing speed, so as to reduce the operating time, thereby reducing the overall time for data processing. With the substantial increase in the computing performance of the actuator components, the time spent on data manipulation and processing is getting shorter and shorter than the time spent in data processing, and has basically reached the same time as the time spent on data handling or data communication. degree. Therefore, it is very difficult to shorten the data processing time by further increasing the calculation speed.
  • the total time T of a round of data processing in the coprocessor 100 is generally equal to T 1 +T 2 +T 3 .
  • T 1 + T 3 is actually a whole.
  • the inter-chip interconnect bandwidth is occupied at the same time.
  • the total time spent is basically not shorter than the bandwidth.
  • T 1 and T 3 are described separately here for ease of understanding.
  • the inventors of the present disclosure attempt to shorten the total time of data transfer time and data output or interaction time spent in each round of operation.
  • the present disclosure makes it possible to save the total time T of each round of operation by making the first execution body 120 include the first output data buffer 121 and the second output data buffer 122.
  • the first output data buffer 121 and the second output data buffer 122 store data D to be processed.
  • the executor component 110 reads the data D to be processed from the first output data buffer 121 and the second output data buffer 122 in turn, and executes predetermined operation processing.
  • the executive component 110 first reads the first data D1 that needs to be processed stored in the first output data buffer 121, and executes a predetermined first round of operation processing to obtain the first round of operation result data R1.
  • each round of operation is divided into three stages, namely the first stage is the copy time T 1 , the second stage is the operation time T 2 and the third stage is the result data processing time T 3 .
  • the executor component 110 After the executor component 110 has performed the operation on the first data D1, it will feed back a message to the first executor 120 to inform the first executor 120 that the executor component 110 has used the first data D1.
  • the first executive body 120 After receiving the feedback message, the first executive body 120 has a limited state opportunity to modify its state and make the first output data buffer 121 vacant, thereby preparing storage space for the data required for the next round of operations (except the initial state) .
  • the second executor 130 outputs the first round of operation result data via the PCIe communication channel so as to interact with other coprocessor 100 or the CPU of the host.
  • the time taken for the output or interaction of the result data R1 performed by the second executive body 130 is T 3 , and this time must be performed after the operating time T 2 , for example, marked as T 3 1 , where the superscript represents the data
  • the rounds of processing, and the subscripts represent the stages in a round of data processing.
  • the second executor 130 After the second executor 130 completes the output of the operation result data, the second executor 130 immediately feeds back a message to the executor component 110, and the executor component 110 can respond based on the received feedback message. Immediately proceed to the next round of operations. At the same time, the second executor 130 immediately sends a message to the first executor 120, so that the finite state machine of the first executor 120 triggers the trigger condition for executing the data transfer, so that the first executor 120 is in the execution body component 110. During the next round of operation, the data that needs to be processed is transferred to the first output data buffer 121 in an empty state via the communication channel PCIe, so as to prepare data for the operation after the second round of operation (for example, the third round of operation).
  • the second executive body 130 immediately feeds back a message that the second-round operation can be performed to the executive body component 110 after the first round of result data R1 is output or interacted, and the executive component 110 receives the feedback message.
  • the data D2 to be processed is read from the second output data buffer 122 and the second round of operations is performed.
  • the time taken by the actuator component 110 to perform the second round of operations is T 2 2 .
  • the second executor 130 also sends a message to the first executor 120 that the data can be transferred, which informs the first executor 120 that the data transfer operation can be performed immediately.
  • the first executor 120 After the first executor 120 obtains the transportable data message from the second executor 130, its finite state machine modifies the state, thereby triggering the first executor 120 to perform the data transfer operation, thereby transferring the data to be processed via the communication channel PCIe D3 is transported to the vacant first output data buffer 121 for storage, so as to prepare data for the operation (for example, the third operation) of the actuator component 110 after the second round of operation is performed.
  • the time taken by the first executive body 120 to perform the data D3 transfer is T 1 3 .
  • the third round of required data is copied, so there is no need to wait for the required data to be copied to the free storage space before the third round of operation starts.
  • the three-round operation saves the copying time T 1 3 , thereby shortening the actual data processing time T of the third-round operation from the conventional time T 1 3 +T 2 3 +T 3 3 to T 2 3 +T 3 3 .
  • the executor component 110 feeds back a message to the first executor 120, notifying the first executor 120 that it has completed the second data D2 Use of data D2.
  • the first executive body 120 modifies the state of its finite state machine and makes the second output data buffer 122 in an empty state, so as to prepare storage space for the data required for the fourth round of operations.
  • the second executor 130 starts the result data interaction process of the second round of operation.
  • the second body 130 performs a feedback message sent to the execution and the assembly 110 may be sent to the first message transport data executable to perform assembly 110 immediately executes the third time period T 2 3
  • the round operation and the first executive body 120 simultaneously execute the transfer of the data D4 required for the fourth round operation to the second output data buffer 122 within the time period T 2 3 .
  • the time taken for transportation is T 1 4 .
  • T 1 3 is parallel to T 2 3
  • T 1 5 is parallel to T 2 4
  • T 1 n+1 is parallel to T 2 n .
  • the handling of the original data and the interaction of the result data or parameters are performed in a time-sharing manner, so that the coprocessor 100 performs data communication with peripheral devices (such as other coprocessors or CPUs).
  • the actual execution time of time is obviously less than the execution time required for data handling and data interaction at the same time.
  • the present disclosure does not include that T 1 is eliminated (actually exists in parallel in T 2 ).
  • T 3 according to the present disclosure is also better than the prior art. T 3 in is much smaller.
  • FIG. 2 shows a schematic diagram of the executive body in the coprocessor according to the present disclosure.
  • the large dashed box represents an executive body.
  • the executive network component shown in Figure 2 only five executives are shown.
  • corresponding to the task topology as many nodes as the neural network has, there are as many executive bodies in the executive body network components.
  • Fig. 2 schematically shows the composition of each executive body constituting the present disclosure, which includes a message warehouse, a finite state machine, a processing component, and an output data buffer.
  • each execution body seems to contain an input data buffer, but it is marked by a dotted line. In fact, this is an imaginary component, which will be explained in detail later.
  • Each executor in the data processing path such as the first executor in Figure 2
  • Each executor in the data processing path is established based on a node in a neural network of a task topology, and based on the complete node attributes, the first executor and its upper The topological relationship, message warehouse, finite state machine and processing method (processing component) of the downstream executive body and the cache location of the generated data (output data cache).
  • processing component processing method
  • the first executor performs data processing, for example, its task requires two input data, that is, the output data of the second executor and the fourth executor upstream of it.
  • the second executor When the second executor generates data to be output to the first executor, for example, when the second data is generated, the second executor will send a data ready message to the first executor to the message bin of the first executor to inform the first executor
  • the second data of an executor is already in the output data buffer of the second executor and is in an available state, so that the first executor can execute the reading of the second data at any time. At this time, the second data will always be waiting for the second data.
  • An executive body reads the status.
  • the finite state machine of the first executor modifies its state after the message bin obtains the message of the second executor.
  • the fourth executor when the fourth executor generates data to be output to the first executor, for example, when the fourth data is generated, the fourth executor will send a data-ready message to the first executor to the message bin of the first executor. Inform the first executor that the fourth data is already in the output data buffer of the fourth executor and is in an available state, so that the first executor can read the fourth data at any time, and the fourth data will always be in Wait for the first executor to read the status.
  • the finite state machine of the first executor modifies its state after the message bin obtains the message of the fourth executor.
  • the processing component of the first executive body generates data after the last execution of the computing task, such as the first data, and buffers it in its output data buffer, and sends it to the downstream executive body of the first executive body, such as the third
  • the executive body and the fifth executive body send a message that the first data can be read.
  • the third executor and the fifth executor When the third executor and the fifth executor read the first data and use it, they will feedback a message to the first executor to inform the first executor that the first data is used up. Therefore, the first executor's The output data buffer is empty. At this time, the finite state machine of the first executive body will also modify its state.
  • the processing component is notified to read the second data in the output data buffer of the second executor and the fourth data in the output data buffer of the fourth executor, and execute the specified calculation task, thereby generating the execution body’s
  • the output data such as new first data, is stored in the output data buffer of the first execution body.
  • the finite state machine returns to its initial state, waiting for the next state change cycle, and the first executor feeds back to the second executor that the use of the second data is complete.
  • the output data buffer of the second executor is left in a vacant state.
  • the output data buffer of the fourth executor is in a vacant state.
  • each executive body is like a fixed-task post worker on a data processing path, thereby forming a data pipeline processing without any other external instructions.
  • each execution body shown in Figure 2 contains an input data buffer, which is actually not included, because each execution body does not need any buffer to store the data to be used, but only It is sufficient to obtain the data needed to be used in a state that can be read. Therefore, when the data to be used by each execution body is not in a specific execution state, the data is still stored in the output data buffer of the upstream execution body. Therefore, for visual display, the input data buffer in each executive body is represented by a dotted line, which does not actually exist in the executive body. In other words, the output data cache of the upstream executive is the virtual input data cache of the downstream executive. Therefore, in Figure 2, dashed lines are used for the input data buffer.
  • FIG. 3 shows a timing diagram when the coprocessor according to the present disclosure performs continuous data processing.
  • two data D 1 and D 2 are stored in the first output data buffer 121 and the second output data buffer 122 at one time.
  • the executor component 110 reads the data D 1 in the first output data buffer 121 and executes the first round of operation, and then the result data in the second executor 130 R1 output or data exchange with other parallel devices.
  • the executive component 110 reads the data D 2 in the second output data buffer 122 when the feedback message sent by the second executive 130 is sent, and executes the second round of operations to obtain the result data R2, while performing the first body 120 in the second transmission time can be performed by message 130 carrying a data of the external data D 3 is transported to the first output data buffer 121 for use in processing of the third round of data processing steps.
  • the executive component 110 reads the data D in one of the first output data buffer 121 and the second output data buffer 122 when the feedback message sent by the second executive body 130 is And execute the nth round of operation to obtain the result data R.
  • the first executor 120 transfers the external data D n+1 to the first output data buffer 121 and the second output when the second executor 130 sends a message that can carry data.
  • the output data buffer in an empty state in one of the data buffers 122 is used for the processing of the third round of data processing steps.
  • the first executive body 120 may only have one output data buffer, for example, the first output Data cache 121.
  • the executor component 110 reads the data to be processed from the output data buffer 121 of the first executor 120, executes predetermined operation processing to obtain the operation result data, and sends the data to the first executor after obtaining the operation result data.
  • the executive body 120 feeds back a message to inform the first executive body 120 to place the output data buffer 121 in a vacant state.
  • the operation result data of the executive body component 110 is output to the peripheral device through the communication channel between the coprocessor and the peripheral device, and the After the operation result data is output (or after the parameter interaction is performed), a message is fed back to the execution component 110 and a message that can carry data is sent to the first execution body 120.
  • the first execution body 120 receives information from the second execution After the message of the body 130 that can carry data, the first executive body 120 transfers the data to be processed to the output data buffer 121 in an empty state via the communication channel between the coprocessor and the peripheral device. In this way, the coprocessor prioritizes the parameter interaction in the deep learning system, and then copies the new data.
  • Fig. 4 shows a flowchart of a data processing method in a coprocessor according to the present disclosure.
  • the coprocessor is in the initial state, and the first executive body 120 stores the data D (for example, data D 1 and D 2 ) to be processed into the first output data buffer 121 and The second output data buffer 122.
  • the executive component 110 reads data from one of the first output data buffer 121 and the second output data buffer 122.
  • the first output data buffer 121 is described here first, that is, the executive component 110 reads data from the first output data buffer 121 and executes the nth round of data operations to obtain the nth result data R.
  • the executor component 110 obtains the result data R at step S415, on the one hand, it will feed back a message to the first executor 120, so that the first executor 120 puts the first output data buffer 121 in the control state at step S420. So that the first output data buffer 121 is in a state where data can be written.
  • the executive component 110 will simultaneously send a data exchange message to the second executive 130 to inform the second executive 130 that it can perform data exchange in step S425. It outputs the nth result data R or performs data interaction with the external CPU or parallel coprocessor (such as GPU).
  • the second executive body 130 sends a feedback message to the executive body component 110 on the one hand to inform the executive body component 110 that it can perform the next round of data processing, and on the other hand, it sends a message to the first executive body 120 that it can be transported from outside. Data message.
  • step S430 after obtaining the feedback message from the second executive body 130, the executive component 110 executes the n+1th round of operation processing on the data D in the second output data buffer 122 and obtains the n+1th data Result R. While performing the n+1th round of operation processing, the first executor 120 transports the next data D to be processed from the outside to the vacant first output data buffer 121 at step S435 for use in the n+th 2 rounds of operation processing.
  • the executor component 110 After the executor component 110 obtains the result data R at step S430, on the one hand, it will feed back a message to the first executor 120, so that the first executor 120 puts the second output data buffer 122 in the control state at step S440. So that the second output data buffer 122 is in a state where data can be written.
  • the executive component 110 will simultaneously send a data exchange message to the second executive 130 to inform the second executive 130 that it can perform data exchange in step S445 It outputs the n+1th result data R or performs data interaction with the external CPU or parallel coprocessor (such as GPU).
  • the second executive body 130 sends a feedback message to the executive body component 110 on the one hand to inform the executive body component 110 that it can perform the next round of data processing, and on the other hand, it sends a message to the first executive body 120 that it can be transported from outside. Data message.
  • the executive component 110 executes the n+2th round of operation processing on the data D in the first output data buffer 122 and obtains the n+2th data Result R. While performing the n+2th round of operation processing, the first executor 120 transports the next data D to be processed from the outside to the second output data buffer 122 in the vacant state at step S455 for use in the n+th 3 rounds of operation processing.
  • the second execution body may not only include two output data buffers 121 and 12, but may include three or more than three.
  • the second execution body may not only include two output data buffers 121 and 12, but may include three or more than three.
  • four output data buffers can be set for the first executive 120. This can be set according to actual needs.
  • the second executive body 130 itself may be a part of the executive body assembly 110, and both execute The operation processing process and the output and interaction process of the result data are executed one after another. Therefore, although the present disclosure separates the two for the convenience of description, it does not mean that the separate existence of the two is a necessary arrangement for realizing the present disclosure.
  • the coprocessor according to the present disclosure such as the GPU
  • the coprocessor according to the present disclosure is used in the field of big data technology and deep learning
  • the coprocessor according to the present disclosure is used. Since the copy of the data and the interaction of the data are performed in a time-sharing manner, the data communication volume can be reduced under the fixed bandwidth situation, thereby increasing the data communication speed.
  • the present disclosure makes full use of the situation that there is no data interaction during the data operation that causes bandwidth occupation, thereby moving the next round of data copy forward to Bandwidth was occupied during the last round of operation, thereby realizing data copy and data interaction for bandwidth occupancy time-sharing processing.
  • the objective of the present disclosure is finally achieved.
  • the executive component 110 is a component in the present disclosure, it may only include one executive, or there may be multiple executives to form a data processing path or a data processing network.
  • this specification describes a coprocessor and a method for accelerating data processing speed in the coprocessor according to embodiments of the present disclosure.
  • two or more output data buffers are configured for an executive body component, and the time for copying data of each output data buffer is controlled by a control signal Segment, can greatly improve the efficiency of the use of the executive component, so that the executive component only needs to wait for a data output or interaction time interval without waiting for the copy of the processing data, thereby shortening the waiting time of the executive component and improving the executive component
  • the efficiency of the time utilization of the components also improves the efficiency of the coprocessor.
  • the purpose of the present disclosure can also be realized by running a program or a group of programs on any computing device.
  • the computing device may be a well-known general-purpose device. Therefore, the purpose of the present disclosure can also be achieved only by providing a program product containing program code for implementing the method or device. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure.
  • the storage medium may be any well-known storage medium or any storage medium developed in the future.
  • each component or each step can be decomposed and/or recombined.
  • These decomposition and/or recombination should be regarded as equivalent solutions of the present disclosure.
  • the steps of performing the above-mentioned series of processing can naturally be performed in chronological order in the order of description, but it is not necessarily performed in chronological order. Some steps can be performed in parallel or independently of each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Multi Processors (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A coprocessor (100-1, 100-2), comprising an execution body assembly (110), a first execution body (120) provided with at least two output data caches (1221, 1222), and a second execution body (130). The execution body assembly (110) alternatingly reads data requiring processing from one of the at least two output data caches (1221, 1222) of the first execution body (120), executes predetermined operation processing to obtain operation result data, and feeds back a message to the first execution body (120) after obtaining the operation result data so as to inform the first execution body (120) to set said output data cache among the at least two output data caches (1221, 1222) to a vacant state; the second execution body (130) outputs the operation result data of the execution body assembly (110) by means of a communication channel after obtaining the message from the execution body assembly (110), and after the operation result data is outputted, feeds back a message to the execution body assembly (110) and sends to the first execution body (120) a message that data may be transferred; and after the first execution body (120) obtains the message from the second execution body (120) that data may be transferred, while the execution body assembly (110) reads data requiring processing for the other output data cache among the at least two output data caches (1221, 1222) of the first execution body (120) and executes predetermined operation processing to obtain operation result data after obtaining the message fed back by the second execution body (130), the first execution body (120) transfers the data requiring processing to the output data cache among the at least two output data caches (1221, 1222) that is in the vacant state by means of a communication channel between the coprocessor (100-1, 100-2) and a peripheral device.

Description

协处理器及其数据处理加速方法Coprocessor and its data processing acceleration method 技术领域Technical field
本公开涉及一种协处理器,更具体地说,本公开涉及一种在协处理器中加速数据处理的方法。The present disclosure relates to a coprocessor, and more specifically, the present disclosure relates to a method for accelerating data processing in the coprocessor.
背景技术Background technique
在现有的大量的数据处理器中,除了CPU,还有各种协处理器,用于分担CPU的数据处理功能。例如GPU (Graphic Processing Unit)、APU等。举例而言,GPU目前被用于大数据计算以及深度学习,这是因为GPU实际上是一组图形函数的集合,而这些函数由硬件实现。GPU具有高并行结构(highly parallel structure),所以GPU在处理图形数据和复杂算法方面拥有比CPU更高的效率。CPU执行计算任务时,一个时刻只处理一个数据,不存在真正意义上的并行,而GPU具有多个处理器核,在一个时刻可以并行处理多个数据。与CPU相比,GPU拥有更多的ALU(Arithmetic Logic Unit,算术逻辑单元)用于数据处理,而非数据高速缓存和流控制。这样的结构非常适合于对于类型高度统一的、相互无依赖的大规模数据和不需要被打断的纯净的计算环境。In the existing large number of data processors, in addition to the CPU, there are various coprocessors for sharing the data processing functions of the CPU. Such as GPU (Graphic Processing Unit), APU, etc. For example, GPUs are currently used for big data calculations and deep learning, because GPUs are actually a collection of graphics functions, and these functions are implemented by hardware. GPU has a highly parallel structure (highly parallel structure), so GPU has higher efficiency than CPU in processing graphics data and complex algorithms. When the CPU executes a computing task, it only processes one piece of data at a time, and there is no real parallelism, while the GPU has multiple processor cores, which can process multiple data in parallel at a time. Compared with CPU, GPU has more ALU (Arithmetic Logic Unit) for data processing instead of data cache and flow control. Such a structure is very suitable for large-scale data that is highly uniform in type, independent of each other, and a pure computing environment that does not need to be interrupted.
技术问题technical problem
然而, GPU在进行深度学习以及大数据计算的应用过程中,由于多个GPU之间采用数据并行,因此,GPU之间存在大量的参数交互。这种参数交互将会占用设备间或片间互联带宽,例如PCIe或NvLink这样的设备间或片间互联带宽。同时,在大数据计算以及深度学习中,GPU也要不断地占用片间互联带宽来复制数据。在现有的深度学习系统中,多个设备或GPU之间进行参数交换以及数据复制会同时进行。当参数交互与数据片的复制同时进行时,会导致两者之间出现抢占带宽。在带宽一定的情况下,这样大量的数据复制和参数交互会减缓数据通讯的整体速率,从而延缓使得多个GPU之类的协处理器阵列之间的参数交互的速率,从而拖慢协处理器阵列的整体数据处理速度。而在深度学习方面,参数的交互更为重要。因此,从推进整体任务处理进度方面考虑,参数交互的优先级比复制数据的优先级要高。然而现有的深度学习系统并不考虑参数交互相对于数据拷贝的优先性,更不会提出任何解决方案。因此,一方面,在片间互联带宽保持不变的情况下,如何提高设备间参数交互的速率成为深度学习领域一项急需解决的问题。另一方面,在提高设备间参数交互的速率的前提下,如何不影响数据的复制同时提高GPU这样的协处理器的数据处理速度也是本领域急需解决的问题。However, in the application process of deep learning and big data calculation of GPUs, because multiple GPUs adopt data parallelism, there are a lot of parameter interactions between GPUs. This parameter interaction will occupy the interconnection bandwidth between devices or between chips, such as PCIe or NvLink. At the same time, in big data calculations and deep learning, GPUs have to constantly occupy inter-chip interconnect bandwidth to copy data. In the existing deep learning system, parameter exchange and data replication between multiple devices or GPUs will be performed simultaneously. When the parameter exchange and the copy of the data piece are performed at the same time, it will cause bandwidth preemption between the two. In the case of a certain bandwidth, such a large amount of data replication and parameter interaction will slow down the overall rate of data communication, thereby delaying the rate of parameter interaction between multiple GPUs and other coprocessor arrays, thereby slowing down the coprocessor The overall data processing speed of the array. In deep learning, the interaction of parameters is more important. Therefore, from the perspective of advancing the overall task processing progress, the priority of parameter interaction is higher than the priority of copying data. However, the existing deep learning system does not consider the priority of parameter interaction relative to data copy, let alone propose any solution. Therefore, on the one hand, how to increase the rate of parameter interaction between devices has become an urgent problem in the field of deep learning when the interconnection bandwidth between chips remains unchanged. On the other hand, under the premise of increasing the rate of parameter interaction between devices, how to increase the data processing speed of coprocessors such as GPUs without affecting data replication is also an urgent problem in this field.
技术解决方案Technical solutions
本公开的目的是解决至少上述问题之一,并提供至少后面将说明的优点。根据本公开的一个方面,提供了一种协处理器,包括执行体组件、具有至少两个输出数据缓存的第一执行体以及第二执行体,其中,所述执行体组件轮流地从所述第一执行体的至少两个输出数据缓存之一读取所需处理的数据,并执行预定的操作处理以获得操作结果数据,并在获得操作结果数据之后向第一执行体反馈消息,以便告知第一执行体将所述至少两个输出数据缓存之一置于空置状态;所述第二执行体在获得来自所述执行体组件的消息后将所述执行体组件的操作结果数据通过所述协处理器与外围设备之间的通信信道输出到外围设备,并在所述操作结果数据被输出后向所述执行组件反馈消息以及向第一执行体发送可以搬运数据的消息;以及所述第一执行体在获得来自第二执行体的可以搬运数据的消息后,在所述执行体组件在获得第二执行体反馈的消息后针对所述第一执行体的至少两个输出数据缓存中的另一个读取所需处理的数据并执行预定的操作处理获得另一个操作结果数据的同时,第一执行体经由所述协处理器与外围设备之间的通信信道将所需要处理的数据搬运到处于空置状态的所述至少两个输出数据缓存之一。The purpose of the present disclosure is to solve at least one of the above-mentioned problems and to provide at least the advantages described later. According to one aspect of the present disclosure, a coprocessor is provided, which includes an executive body component, a first executive body having at least two output data buffers, and a second executive body, wherein the executive body component alternates from the One of the at least two output data buffers of the first executive body reads the data to be processed, and executes predetermined operation processing to obtain the operation result data, and after obtaining the operation result data, feeds back a message to the first executive body to inform The first executor puts one of the at least two output data buffers in a vacant state; the second executor passes the operation result data of the executor component through the executor after obtaining the message from the executor component The communication channel between the coprocessor and the peripheral device is output to the peripheral device, and after the operation result data is output, a message is fed back to the execution component and a message that can carry data is sent to the first execution body; and After an executive body obtains a message from the second executive body that can carry data, after the executive body component obtains the message fed back by the second executive body, it responds to at least two output data buffers of the first executive body While the other reads the data to be processed and executes the predetermined operation processing to obtain another operation result data, the first executor transports the data to be processed via the communication channel between the coprocessor and the peripheral device One of the at least two output data buffers in a vacant state.
根据本公开的协处理器,其中所述至少两个输出数据缓存的数量为三个、四个或五个。According to the coprocessor of the present disclosure, the number of the at least two output data buffers is three, four or five.
根据本公开的协处理器,其中所述被第一执行体所搬运的数据是其他协处理器的输出的结果数据。According to the coprocessor of the present disclosure, the data carried by the first executor is the result data output by other coprocessors.
根据本公开的协处理器,其中所述执行体组件包含至少一个执行体。According to the coprocessor of the present disclosure, the executive body component includes at least one executive body.
根据本公开的协处理器,其中每个执行体都在自身的有限状态机达到执行触发条件时执行规定操作。According to the coprocessor of the present disclosure, each executive body performs a prescribed operation when its own finite state machine reaches an execution trigger condition.
根据本公开的另一个方面,提供了一种用于协处理器的数据处理加速方法,所述协处理器包括执行体组件、具有至少两个输出数据缓存的第一执行体以及第二执行体,所述方法包括:通过所述执行体组件从所述第一执行体的至少两个输出数据缓存中的第一输出数据缓存读取所需处理的第一数据,并执行预定的操作处理获得第一操作结果数据,并在获得第一操作结果数据之后向第一执行体反馈消息以及向第二执行体发送可以执行操作的消息;第一执行体在获得来自执行体组件的反馈消息后将存储第一数据的第一输出数据缓存置于空置状态;所述第二执行体在获得来自所述执行体组件的消息后将所述执行体组件的第一操作结果数据通过所述通信信道输出,并在所述第一操作结果数据被输出后向所述执行组件反馈消息以及向第一执行体发送可以搬运数据的消息;以及在获得来自第二执行体的可以搬运数据的消息后,在所述执行体组件在获得第二执行体反馈的消息针对所述第一执行体的至少两个输出数据缓存中的第二个输出数据缓存读取所需处理的第二数据并执行预定的操作处理获得第二操作结果数据的同时,所述第一执行体经由所述协处理器与外围设备之间的通信信道将所需要处理的数据作为新的第一数据搬运到处于空置状态的所述至少两个输出数据缓存的第一输出数据缓存;所述执行体组件在获得第二操作结果数据之后向第一执行体反馈消息以及向第二执行体发送可以执行操作的消息;第一执行体在获得来自执行体组件的反馈消息后将存储第二数据的第二输出数据缓存置于空置状态;所述第二执行体在获得来自所述执行体组件的消息后将所述执行体组件的第二操作结果数据通过所述协处理器与外围设备之间的通信信道输出所述外围设备,并在所述第二操作结果数据被输出后向所述执行组件反馈消息以及向第一执行体发送可以搬运数据的消息;以及在获得来自第二执行体的可以搬运数据的消息后,在所述第一执行体经由所述协处理器与外围设备之间的通信信道将所需要处理数据作为新的第二数据搬运到处于空置状态的所述至少两个输出数据缓存的第二输出数据缓存的同时,开始重复以上步骤。According to another aspect of the present disclosure, a data processing acceleration method for a coprocessor is provided. The coprocessor includes an executive body component, a first executive body having at least two output data buffers, and a second executive body. , The method includes: reading the first data that needs to be processed from the first output data buffer in the at least two output data buffers of the first executor through the executive body component, and performing predetermined operation processing to obtain The first operation result data, and after obtaining the first operation result data, send a feedback message to the first executor and a message that can perform the operation to the second executor; the first executor will send the feedback message from the executor component The first output data buffer storing the first data is placed in a vacant state; the second executive body outputs the first operation result data of the executive body component through the communication channel after obtaining the message from the executive body component , And after the first operation result data is output, feedback a message to the execution component and send a message that can carry data to the first execution body; and after obtaining a message that can carry data from the second execution body, The executor component reads the second data that needs to be processed and executes a predetermined operation for the second output data buffer of the at least two output data buffers of the first executor after obtaining the message fed back by the second executor While processing to obtain the second operation result data, the first executive body transports the data to be processed as new first data to the vacant state via the communication channel between the coprocessor and the peripheral device. The first output data buffer of at least two output data buffers; the executor component feeds back a message to the first executor after obtaining the second operation result data and sends a message that can perform the operation to the second executor; the first executor After obtaining the feedback message from the executor component, the second output data buffer storing the second data is placed in a vacant state; the second executor after obtaining the message from the executive component The second operation result data is output to the peripheral device through the communication channel between the coprocessor and the peripheral device, and after the second operation result data is output, a message is fed back to the execution component and the first execution body Sending a message that can carry data; and after obtaining a message that can carry data from a second executive body, the first executive body uses the communication channel between the coprocessor and the peripheral device as the data to be processed as When new second data is transferred to the second output data buffer of the at least two output data buffers that are in a vacant state, the above steps are started to be repeated.
根据本公开的用于协处理器的数据处理加速方法,其中所述至少两个输出数据缓存的数量为三个、四个或五个。According to the data processing acceleration method for a coprocessor according to the present disclosure, the number of the at least two output data buffers is three, four or five.
根据本公开的用于协处理器的数据处理加速方法,其中所述被第一执行体所搬运的数据是其他协处理器的输出的结果数据。According to the data processing acceleration method for a coprocessor according to the present disclosure, the data carried by the first execution body is the result data output by other coprocessors.
根据本公开的用于协处理器的数据处理加速方法,其中所述执行体组件包含至少一个执行体。。According to the data processing acceleration method for a coprocessor according to the present disclosure, the execution body component includes at least one execution body. .
根据本公开的用于协处理器的数据处理加速方法,其中每个执行体都在自身的有限状态机达到执行触发条件时执行规定操作。According to the data processing acceleration method for a coprocessor according to the present disclosure, each executor executes a prescribed operation when its own finite state machine reaches an execution trigger condition.
根据本公开的另一个方面,提供了一种协处理器,包括执行体组件、第一执行体以及第二执行体,其中,所述执行体组件从所述第一执行体的输出数据缓存读取所需处理的数据,并执行预定的操作处理获得操作结果数据,并在获得操作结果数据之后向第一执行体反馈消息,以便告知第一执行体将所述输出数据缓存置于空置状态;所述第二执行体在获得来自所述执行体组件的消息后将所述执行体组件的操作结果数据通过所述协处理器与外围设备之间的通信信道输出到外围设备,并在所述操作结果数据被输出后向所述执行组件反馈消息以及向第一执行体发送可以搬运数据的消息;以及所述第一执行体在获得来自第二执行体的可以搬运数据的消息后,第一执行体经由所述协处理器与外围设备之间的通信信道将所需要处理的数据搬运到处于空置状态的输出数据缓存。According to another aspect of the present disclosure, a coprocessor is provided, including an executive body component, a first executive body, and a second executive body, wherein the executive body component reads from the output data cache of the first executive body. Fetch the data to be processed, perform predetermined operation processing to obtain operation result data, and feedback a message to the first executive body after obtaining the operation result data, so as to inform the first executive body to put the output data buffer in a vacant state; The second executor outputs the operation result data of the executor component to the peripheral device through the communication channel between the coprocessor and the peripheral device after obtaining the message from the executor component, and sends it in the After the operation result data is output, a feedback message is sent to the execution component and a message that can carry data is sent to the first execution body; and after the first execution body obtains the message that can carry data from the second execution body, the first The executive body transports the data to be processed to the output data buffer in a vacant state via the communication channel between the coprocessor and the peripheral device.
有益效果Beneficial effect
根据本公开协处理器以及协处理器的数据处理加速方法通过为一个第一执行体配置两个或两个以上的输出数据缓存,并通过第二执行体执行交互完成后向第一执行体发出的消息来启动将待处理的数据搬运到各个输出数据缓存以及同时使得执行体组件执行数据操作,能够极大提高执行体组件的使用效率,使得执行体组件只需要等待一个数据输出或交互时间间隔而不需要等待处理数据的搬运,从而缩短了执行体组件的等待时间,提高了执行体组件对时间的利用效率,由此也提高了协处理器的效率。According to the coprocessor and the method for accelerating the data processing of the coprocessor according to the present disclosure, two or more output data buffers are configured for a first executive body, and the second executive body executes the interaction and sends out to the first executive body. The message to start the transfer of the data to be processed to each output data cache and at the same time make the executive component perform data operations, which can greatly improve the efficiency of the executive component, so that the executive component only needs to wait for one data output or interaction time interval There is no need to wait for the handling of processing data, thereby shortening the waiting time of the executive body component, improving the time utilization efficiency of the executive body component, and thus also improving the efficiency of the coprocessor.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments in accordance with the disclosure, and together with the specification are used to explain the principle of the disclosure.
下面将参考附图通过实施例来详细介绍本公开,附图中:The present disclosure will be described in detail below through embodiments with reference to the drawings, in which:
图1所示的是根据本公开的协处理器的结构示意图;Figure 1 shows a schematic structural diagram of a coprocessor according to the present disclosure;
图2所示的是根据本公开的协处理器中的执行体原理示意图;Figure 2 shows a schematic diagram of the executive body in the coprocessor according to the present disclosure;
图3所示的是根据本公开的协处理器进行数据处理的时序图;以及Figure 3 shows a timing diagram of data processing performed by the coprocessor according to the present disclosure; and
图4所示的是根据本公开的 协处理器中进行数据处理方法的流程图。Figure 4 shows a flowchart of a method for data processing in a coprocessor according to the present disclosure.
本发明的实施方式Embodiments of the invention
下面结合实施例和附图对本公开做进一步的详细说明,以令本领域技术人员参照说明书文字能够据以实施。The present disclosure will be further described in detail below in conjunction with the embodiments and drawings, so that those skilled in the art can implement it with reference to the text of the description.
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。 Here, exemplary embodiments will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. Rather, they are merely examples of devices and methods consistent with some aspects of the present disclosure as detailed in the appended claims.
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。 The terms used in the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit the present disclosure. The singular forms "a", "said" and "the" used in the present disclosure and appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used herein refers to and includes any or all possible combinations of one or more associated listed items.
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,在下文中,两个可能位置之一可以被称为第一操作结果也可以被称为第二操作结果,类似地,两个可能位置的另一个可以被称为第二操作结果也可以被称为第一操作结果。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present disclosure, in the following, one of the two possible positions may be referred to as the first operation result or the second operation result. Similarly, the other of the two possible positions may be What is called the second operation result may also be called the first operation result. Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to determination".
为了使本领域技术人员更好地理解本公开,下面结合附图和具体实施方式对本公开作进一步详细说明。In order to enable those skilled in the art to better understand the present disclosure, the present disclosure will be further described in detail below with reference to the accompanying drawings and specific embodiments.
图1所示的是根据本公开的协处理器的结构示意图。如图1所示,协处理器100-1包括执行体组件110、第一执行体120以及第二执行体130。所述第一执行体120包括至少两个输出数据缓存121以及12。可选择地,根据实际应用需要,所述第一执行体120可以包括三个或三个以上的输出数据缓存。为了叙述方便,在本公开的说明书中,仅仅以两个输出数据缓存121以及12来进行描述。为了描述方便,下面的将输出数据缓存121称之为第一输出数据缓存,而输出数据缓存122可称之为第二输出数据缓存。可选择地,也可以将输出数据缓存121称之为第二输出数据缓存,而输出数据缓存122可称之为第一输出数据缓存。第一输出数据缓存121以及第二输出数据缓存122用于存储需要处理的数据,这些需要处理的数据通过第一执行体120经由所述协处理器100-1与外围设备之间的通信信道PCIe从外部搬运来。Figure 1 shows a schematic structural diagram of a coprocessor according to the present disclosure. As shown in FIG. 1, the coprocessor 100-1 includes an executive body component 110, a first executive body 120, and a second executive body 130. The first executive body 120 includes at least two output data buffers 121 and 12. Optionally, according to actual application requirements, the first executive body 120 may include three or more output data buffers. For the convenience of description, in the specification of the present disclosure, only two output data buffers 121 and 12 are used for description. For the convenience of description, the output data buffer 121 is referred to as the first output data buffer, and the output data buffer 122 may be referred to as the second output data buffer. Alternatively, the output data buffer 121 can also be referred to as the second output data buffer, and the output data buffer 122 can be referred to as the first output data buffer. The first output data buffer 121 and the second output data buffer 122 are used to store data that needs to be processed. These data need to be processed through the first execution body 120 via the communication channel PCIe between the coprocessor 100-1 and the peripheral device. Transported from outside.
根据本公开的协处理器100-1通过设备间或片间互联带宽,例如PCIe或NvLink设备间或片间互联带宽,与其他同类的协处理器100-2、100-3…、100-N之间彼此连接并可以并行处理经由主机的CPU分配的计算任务。因此,协处理器100-1、100-2、100-3…、100-N在进行数据处理过程中,彼此之间存在大量的操作结果数据和/或参数交互,并且需要反复从外界经由诸如PCIe的片间互联带宽通信信道进行数据搬运操作。参数也是一种操作结果数据。因此,协处理器100在进行数据处理过程中,一般包含三个阶段,即从外部拷贝(COPY)需要处理的数据的阶段、数据操作处理阶段(PROCESS)以及结果数据向外部输出或与外部进行参数交互的阶段(EXCHANGE)。这三个阶段通常所花费的时间分别为T 1、T 2以及T 3According to the coprocessor 100-1 of the present disclosure, the inter-device or inter-chip interconnection bandwidth, such as PCIe or NvLink inter-device or inter-chip interconnection bandwidth, and other similar coprocessors 100-2, 100-3..., 100-N They are connected to each other and can process computing tasks distributed via the host's CPU in parallel. Therefore, the coprocessors 100-1, 100-2, 100-3..., 100-N have a large amount of operation result data and/or parameter interactions between each other in the process of data processing, and need to repeatedly pass from the outside through such as PCIe's inter-chip interconnect bandwidth communication channel for data transfer operations. The parameter is also a kind of operation result data. Therefore, the coprocessor 100 generally includes three stages in the process of data processing, namely, the stage of copying (COPY) the data to be processed from the outside, the stage of data operation processing (PROCESS), and the output of the result data to the outside or communication with the outside. The phase of parameter interaction (EXCHANGE). The time usually spent in these three stages is T 1 , T 2 and T 3 respectively .
通常,操作时间在数据的处理过程中占用较大部分,而相对而言,数据的拷贝以及数据传输过程所花的时间与操作时间相比,非常短。因此,为了提高数据的处理速度,人们更多是采取提升操作单元执行操作的速度,例如运算速度,从而降低操作时间,由此降低数据处理的总体时间。随着执行体组件的运算性能的大幅度提升,数据操作处理所花的时间相对于数据处理过程中所花的时间越来越短,基本已经达到与数据搬运或数据通讯所花的时间相当的程度。因此,要通过进一步提升运算速度来缩短数据处理的时间已经非常困难。举例而言,在上述所提到的拷贝时间T 1、操作时间T 2以及结果数据处理时间或参数交互时间T 3中,在协处理器100中的一轮数据处理过程的总时间T一般等于T 1+T 2+T 3。尽管此处将拷贝时间T 1与参数交互时间T 3分开描述,但是在常规深度学习系统中,T 1 +T 3实际上是一个整体。换而言之,在常规深度学习技术领域,由于复制数据和参数交互同时进行,因此同时占用片间互联带宽,导致在带宽一定的情况下,总花费的时间基本上不会短于该带宽在单独处理复制数据和参数交互所花时间的综合。举例而言,在带宽一定的情况下,单独处理数据复制所花时间可以为2毫秒,参数交互为5毫秒,但是在片间互联带宽不变的情况下,同时进行同样数据量的数据复制和同样数据量的参数交互,其总花费时间通常会等于或大于7毫秒。因此,此处将T 1和T 3分开描述是为了便于理解。 Generally, the operation time occupies a large part of the data processing process, and relatively speaking, the time spent in the data copy and data transmission process is very short compared with the operation time. Therefore, in order to increase the data processing speed, people are more to increase the speed of the operation unit to perform operations, such as computing speed, so as to reduce the operating time, thereby reducing the overall time for data processing. With the substantial increase in the computing performance of the actuator components, the time spent on data manipulation and processing is getting shorter and shorter than the time spent in data processing, and has basically reached the same time as the time spent on data handling or data communication. degree. Therefore, it is very difficult to shorten the data processing time by further increasing the calculation speed. For example, in the aforementioned copy time T 1 , operation time T 2, and result data processing time or parameter interaction time T 3 , the total time T of a round of data processing in the coprocessor 100 is generally equal to T 1 +T 2 +T 3 . Although the copy time T 1 and the parameter interaction time T 3 are described separately here, in a conventional deep learning system, T 1 + T 3 is actually a whole. In other words, in the field of conventional deep learning technology, since copying data and parameter interaction are carried out at the same time, the inter-chip interconnect bandwidth is occupied at the same time. As a result, with a certain bandwidth, the total time spent is basically not shorter than the bandwidth. Separately process the integration of the time spent on copying data and parameter interaction. For example, in the case of a certain bandwidth, it takes 2 milliseconds to process data replication alone, and 5 milliseconds for parameter interaction, but when the inter-chip interconnect bandwidth remains unchanged, data replication and data replication of the same amount of data are performed at the same time. For parameter interaction with the same amount of data, the total time spent is usually equal to or greater than 7 milliseconds. Therefore, T 1 and T 3 are described separately here for ease of understanding.
因此,当拷贝处理与结果数据的输出或交互同时进行时,可能由于两个操作同时占用带宽固定的通信信道PCIe,由于数据量更多,因此可能使得数据通信所花的总时间T 13比T 1 +T 3更长。由于运算速度的提升难度更高,为此本公开发明人试图缩短每轮操作所花费的数据搬运时间与数据输出或交互时间的总时间。 Therefore, when the copy process and the output or interaction of the result data are performed at the same time, it may be because the two operations occupy the communication channel PCIe with a fixed bandwidth at the same time. Due to the larger amount of data, the total time T 13 spent on data communication may be greater than T 13 1 + T 3 is longer. Since it is more difficult to increase the computing speed, the inventors of the present disclosure attempt to shorten the total time of data transfer time and data output or interaction time spent in each round of operation.
为此,本公开通过使得第一执行体120包含第一输出数据缓存121以及第二输出数据缓存122这种方式为每轮操作的总时间T的节省提供了可能。如图1所示,第一输出数据缓存121以及第二输出数据缓存122存储了需要处理的数据D。执行体组件110轮流地从所述第一输出数据缓存121以及第二输出数据缓存122读取所需处理的数据D,并执行预定的操作处理。具体而言,执行体组件110首先从所述第一输出数据缓存121读取其中所存储的所需处理的第一数据D1,并执行预定的第一轮操作处理,获得第一轮操作结果数据R1。此时所花费的时间为,其中上角标标识操作轮次,下角标代表每轮操作中的阶段。如上所述,每轮操作分三个阶段,即第一阶段为拷贝时间T 1,第二阶段为操作时间T 2以及第三阶段为结果数据处理时间T 3。所述执行体组件110在执行完针对第一数据D1的操作之后,会向第一执行体120反馈消息,告知第一执行体120执行体组件110对第一数据D1的使用已经使用完毕。第一执行体120在接收到反馈消息后,其有限状态机会修改其状态,并使得第一输出数据缓存121处于空置状态,从而为下一轮操作所需的数据准备存储空间(初始状态除外)。 To this end, the present disclosure makes it possible to save the total time T of each round of operation by making the first execution body 120 include the first output data buffer 121 and the second output data buffer 122. As shown in FIG. 1, the first output data buffer 121 and the second output data buffer 122 store data D to be processed. The executor component 110 reads the data D to be processed from the first output data buffer 121 and the second output data buffer 122 in turn, and executes predetermined operation processing. Specifically, the executive component 110 first reads the first data D1 that needs to be processed stored in the first output data buffer 121, and executes a predetermined first round of operation processing to obtain the first round of operation result data R1. The time spent at this time is, where the superscript marks the operation rounds, and the lower superscripts represent the stages in each round of operation. As described above, each round of operation is divided into three stages, namely the first stage is the copy time T 1 , the second stage is the operation time T 2 and the third stage is the result data processing time T 3 . After the executor component 110 has performed the operation on the first data D1, it will feed back a message to the first executor 120 to inform the first executor 120 that the executor component 110 has used the first data D1. After receiving the feedback message, the first executive body 120 has a limited state opportunity to modify its state and make the first output data buffer 121 vacant, thereby preparing storage space for the data required for the next round of operations (except the initial state) .
所述第二执行体130将第一轮操作结果数据经由PCIe通信信道输出,以便与其他协处理器100或主机的CPU进行交互。所述第二执行体130所进行结果数据R1的输出或交互所花费的时间为T 3,这一时间必须在操作时间T 2之后进行,例如标记为T 3 1,其中,上角标代表数据处理的轮次,而下角标代表在一轮数据处理中的阶段。 The second executor 130 outputs the first round of operation result data via the PCIe communication channel so as to interact with other coprocessor 100 or the CPU of the host. The time taken for the output or interaction of the result data R1 performed by the second executive body 130 is T 3 , and this time must be performed after the operating time T 2 , for example, marked as T 3 1 , where the superscript represents the data The rounds of processing, and the subscripts represent the stages in a round of data processing.
在所述第二执行体130完成所述操作结果数据的输出之后,所述第二执行体130立即向所述执行体组件110反馈消息,所述执行体组件110基于所接收到的反馈消息可以立即进行下一轮操作。同时,所述第二执行体130立即向所述第一执行体120发送消息,以便第一执行体120的有限状态机触发执行数据搬运的触发条件从而使得第一执行体120在执行体组件110进行下一轮操作的同时经由通信信道PCIe将需要处理的数据搬运到处于空置状态的第一输出数据缓存121,从而为第二轮操作之后的操作(例如第三轮操作)准备数据。After the second executor 130 completes the output of the operation result data, the second executor 130 immediately feeds back a message to the executor component 110, and the executor component 110 can respond based on the received feedback message. Immediately proceed to the next round of operations. At the same time, the second executor 130 immediately sends a message to the first executor 120, so that the finite state machine of the first executor 120 triggers the trigger condition for executing the data transfer, so that the first executor 120 is in the execution body component 110. During the next round of operation, the data that needs to be processed is transferred to the first output data buffer 121 in an empty state via the communication channel PCIe, so as to prepare data for the operation after the second round of operation (for example, the third round of operation).
具体而言,第二执行体130在第一轮的结果数据R1被执行完输出或交互之后立即向执行体组件110反馈可以执行第二轮操作的消息,执行体组件110在获得该反馈消息后由此从第二输出数据缓存122中读取所需要处理的数据D2并执行第二轮操作。此时,执行体组件110执行第二轮操作所花费的时间为T 2 2。与此同时,第二执行体130也向第一执行体120发出可以搬运数据的消息,该消息告知第一执行体120可以立即执行数据搬运的操作。第一执行体120在从第二执行体130获得可搬运数据的消息后,其有限状态机修改状态,从而触发第一执行体120执行数据搬运操作,由此经由通信信道PCIe将需要处理的数据D3搬运到被空置的第一输出数据缓存121中进行存储,以便为执行体组件110在执行第二轮操作之后的操作(例如第三轮操作)准备数据。 Specifically, the second executive body 130 immediately feeds back a message that the second-round operation can be performed to the executive body component 110 after the first round of result data R1 is output or interacted, and the executive component 110 receives the feedback message. Thus, the data D2 to be processed is read from the second output data buffer 122 and the second round of operations is performed. At this time, the time taken by the actuator component 110 to perform the second round of operations is T 2 2 . At the same time, the second executor 130 also sends a message to the first executor 120 that the data can be transferred, which informs the first executor 120 that the data transfer operation can be performed immediately. After the first executor 120 obtains the transportable data message from the second executor 130, its finite state machine modifies the state, thereby triggering the first executor 120 to perform the data transfer operation, thereby transferring the data to be processed via the communication channel PCIe D3 is transported to the vacant first output data buffer 121 for storage, so as to prepare data for the operation (for example, the third operation) of the actuator component 110 after the second round of operation is performed.
通过上面的描述可知,第一执行体120执行数据D3搬运所花的时间为T 1 3。由此,在第二轮数据操作进行的同时,进行第三轮所需数据的拷贝,从而不需要在第三轮操作开始之前等待将所需数据的拷贝到空闲的存储空间中,从而为第三轮操作节省了拷贝时间T 1 3,由此,将第三轮操作的实际数据处理时间T从常规的时间T 1 3+T 2 3+T 3 3缩短为T 2 3+T 3 3From the above description, it can be seen that the time taken by the first executive body 120 to perform the data D3 transfer is T 1 3 . As a result, while the second round of data operation is in progress, the third round of required data is copied, so there is no need to wait for the required data to be copied to the free storage space before the third round of operation starts. The three-round operation saves the copying time T 1 3 , thereby shortening the actual data processing time T of the third-round operation from the conventional time T 1 3 +T 2 3 +T 3 3 to T 2 3 +T 3 3 .
同样,在第二轮操作结束后,也就是在执行体组件110对第二数据D2操作完成后,执行体组件110向第一执行体120反馈消息,告知第一执行体120已经完成对第二数据D2的使用。第一执行体120在获得该反馈消息后,修改其有限状态机的状态,并使得第二输出数据缓存122处于空置状态,以便为第四轮操作所需的数据准备存储空间。第二执行体130在获得第二操作结果数据R2之后,开始第二轮操作的结果数据交互过程。在结果数据的交互过程结束后,第二执行体130向执行体组件110发出反馈消息以及向第一执行体发出可搬运数据的消息,从而执行体组件110立即在时间段T 2 3执行第三轮操作以及第一执行体120在时间段T 2 3之内同时执行向第二输出数据缓存122搬运第四轮操作所需数据D4。搬运所花费的时间为T 1 4Similarly, after the second round of operations is over, that is, after the executor component 110 completes the operation on the second data D2, the executor component 110 feeds back a message to the first executor 120, notifying the first executor 120 that it has completed the second data D2 Use of data D2. After obtaining the feedback message, the first executive body 120 modifies the state of its finite state machine and makes the second output data buffer 122 in an empty state, so as to prepare storage space for the data required for the fourth round of operations. After obtaining the second operation result data R2, the second executor 130 starts the result data interaction process of the second round of operation. At the end of the interactive process result data, the second body 130 performs a feedback message sent to the execution and the assembly 110 may be sent to the first message transport data executable to perform assembly 110 immediately executes the third time period T 2 3 The round operation and the first executive body 120 simultaneously execute the transfer of the data D4 required for the fourth round operation to the second output data buffer 122 within the time period T 2 3 . The time taken for transportation is T 1 4 .
如上所述,尽管第三轮操作所需的数据搬运时间T 1 3实际发生了,但是其与T 2 2同时。因此,将第三轮操作的实际数据处理时间T从常规的T 1 3+T 2 3+T 3 3缩短为T 2 3+T 3 3。以此类推,T 1 4与T 2 3并行、T 1 5与T 2 4并行…、T 1 n+1与T 2 n并行。而且,通过这种方式,使得原始数据的搬运与结果数据或参数的交互分时进行,从而使得协处理器100在与外围设备(例如其他协处理器或CPU)进行数据通讯时,这种分时进行的实际执行时间明显要比同时进行数据搬运和数据交互所需的执行时间减少。具体而言,一方面,相对于现有技术而言,本公开不包含T 1被消除掉(实际上在T 2中并行存在),另一方面,根据本公开的T 3也比现有技术中的T 3要小很多。 As described above, although the data transfer time T 1 3 required for the third round of operation actually occurred, it was at the same time as T 2 2 . Therefore, the actual data processing time T of the third round of operation is shortened from the conventional T 1 3 +T 2 3 +T 3 3 to T 2 3 +T 3 3 . By analogy, T 1 4 is parallel to T 2 3 , T 1 5 is parallel to T 2 4 ..., T 1 n+1 is parallel to T 2 n . Moreover, in this way, the handling of the original data and the interaction of the result data or parameters are performed in a time-sharing manner, so that the coprocessor 100 performs data communication with peripheral devices (such as other coprocessors or CPUs). The actual execution time of time is obviously less than the execution time required for data handling and data interaction at the same time. Specifically, on the one hand, compared with the prior art, the present disclosure does not include that T 1 is eliminated (actually exists in parallel in T 2 ). On the other hand, T 3 according to the present disclosure is also better than the prior art. T 3 in is much smaller.
图2所示的是根据本公开的协处理器中的执行体原理示意图。如图2所示,大虚线框代表一个执行体。在图2所示的执行体网络组件中,仅仅显示了五个执行体。实际上,对应于任务拓扑图,神经网络有多少节点,在执行体网络组件中就存在多少执行体。图2原理性显示了构成本公开的每个执行体的构成,其包含有消息仓、有限状态机、处理组件以及输出数据缓存。从图2中可以看出,每个执行体似乎都包含有一个输入数据缓存,但是采用了虚线标识。实际上这是一个想象的构成部件,这将在后面详细解释。处于数据处理路径中的每个执行体,例如图2中的第一执行体,基于一种任务拓扑图的神经网络中的一个节点建立,并且基于完全节点属性,形成该第一执行体与其上下游执行体的拓扑关系、消息仓、有限状态机以及处理方式(处理组件)以及生成数据的缓存位置(输出数据缓存)。具体而言,第一执行体在执行数据处理时,举例而言,其任务需要两个输入数据,即其上游的第二执行体和第四执行体的输出数据。当第二执行体生成将要输出到第一执行体的数据,例如生成第二数据时,第二执行体将向第一执行体发出数据准备好的消息到第一执行体的消息仓,告知第一执行体第二数据已经处于第二执行体的输出数据缓存中并且处于可获取的状态,从而第一执行体可以随时执行该第二数据的读取,此时第二数据将一直处于等待第一执行体读取状态。第一执行体的有限状态机在消息仓获得第二执行体的消息后修改其状态。同样,当第四执行体生成将要输出到第一执行体的数据,例如生成第四数据时,第四执行体将向第一执行体发出数据准备好的消息到第一执行体的消息仓,告知第一执行体第四数据已经处于第四执行体的输出数据缓存中并且处于可获取的状态,从而第一执行体可以随时执行该第四数据的读取,此时第四数据将一直处于等待第一执行体读取状态。第一执行体的有限状态机在消息仓获得第四执行体的消息后修改其状态。同样,如果第一执行体的处理组件在上一次执行完运算任务之后产生了数据,例如第一数据,并缓存在其输出数据缓冲中,并向第一执行体的下游执行体,例如第三执行体以及第五执行体发出可以读取第一数据的消息。Figure 2 shows a schematic diagram of the executive body in the coprocessor according to the present disclosure. As shown in Figure 2, the large dashed box represents an executive body. In the executive network component shown in Figure 2, only five executives are shown. In fact, corresponding to the task topology, as many nodes as the neural network has, there are as many executive bodies in the executive body network components. Fig. 2 schematically shows the composition of each executive body constituting the present disclosure, which includes a message warehouse, a finite state machine, a processing component, and an output data buffer. As can be seen from Figure 2, each execution body seems to contain an input data buffer, but it is marked by a dotted line. In fact, this is an imaginary component, which will be explained in detail later. Each executor in the data processing path, such as the first executor in Figure 2, is established based on a node in a neural network of a task topology, and based on the complete node attributes, the first executor and its upper The topological relationship, message warehouse, finite state machine and processing method (processing component) of the downstream executive body and the cache location of the generated data (output data cache). Specifically, when the first executor performs data processing, for example, its task requires two input data, that is, the output data of the second executor and the fourth executor upstream of it. When the second executor generates data to be output to the first executor, for example, when the second data is generated, the second executor will send a data ready message to the first executor to the message bin of the first executor to inform the first executor The second data of an executor is already in the output data buffer of the second executor and is in an available state, so that the first executor can execute the reading of the second data at any time. At this time, the second data will always be waiting for the second data. An executive body reads the status. The finite state machine of the first executor modifies its state after the message bin obtains the message of the second executor. Similarly, when the fourth executor generates data to be output to the first executor, for example, when the fourth data is generated, the fourth executor will send a data-ready message to the first executor to the message bin of the first executor. Inform the first executor that the fourth data is already in the output data buffer of the fourth executor and is in an available state, so that the first executor can read the fourth data at any time, and the fourth data will always be in Wait for the first executor to read the status. The finite state machine of the first executor modifies its state after the message bin obtains the message of the fourth executor. Similarly, if the processing component of the first executive body generates data after the last execution of the computing task, such as the first data, and buffers it in its output data buffer, and sends it to the downstream executive body of the first executive body, such as the third The executive body and the fifth executive body send a message that the first data can be read.
当第三执行体和第五执行体在读取第一数据并使用完成之后,会分别向第一执行体反馈消息,告知第一执行体使用完该第一数据,因此,第一执行体的输出数据缓存处于空置状态。此时第一执行体的有限状态机也会修改其状态。When the third executor and the fifth executor read the first data and use it, they will feedback a message to the first executor to inform the first executor that the first data is used up. Therefore, the first executor's The output data buffer is empty. At this time, the finite state machine of the first executive body will also modify its state.
这样,当有限状态机的状态变化达到预定的状态时,例如第一执行体的执行运算所需的输入数据(例如第二数据和第四数据)都处于可获取状态以及其输出数据缓存处于空置状态时,则告知处理组件读取第二执行体的输出数据缓存中的第二数据以及第四执行体的输出数据缓存中的第四数据,并执行指定的运算任务,从而生成该执行体的输出数据,例如新的第一数据,并存储在第一执行体的输出数据缓冲中。In this way, when the state change of the finite state machine reaches a predetermined state, for example, the input data (such as the second data and the fourth data) required for the execution of the operation of the first executor are all in the available state and the output data buffer is empty In the state, the processing component is notified to read the second data in the output data buffer of the second executor and the fourth data in the output data buffer of the fourth executor, and execute the specified calculation task, thereby generating the execution body’s The output data, such as new first data, is stored in the output data buffer of the first execution body.
同样,当第一执行体完成指定的运算任务之后,有限状态机回归到其初始状态,等待下一次状态改变循环,同时第一执行体向第二执行体反馈对第二数据使用完成的消息到第二执行体的消息仓以及向第四执行体反馈对第四数据使用完成的消息到第四执行体的消息仓,以及向其下游执行体,例如第三执行体以及第五执行体,发送已经生成第一数据的消息,告知第三执行体以及第五执行体,第一数据已经处于可以被读取的状态。Similarly, when the first executor completes the specified calculation task, the finite state machine returns to its initial state, waiting for the next state change cycle, and the first executor feeds back to the second executor that the use of the second data is complete. The message bin of the second executor and the message of the fourth executor that feeds back the completion of the fourth data usage to the message bin of the fourth executor, and the downstream executors, such as the third executor and the fifth executor, send The message that the first data has been generated informs the third executive body and the fifth executive body that the first data is already in a state that can be read.
当第二执行体获得第一执行体使用完第二数据的消息后,使得第二执行体的输出数据缓存处于空置状态。同样,第四执行体获得第一执行体使用完第四数据的消息后,使得第四执行体的输出数据缓存处于空置状态。When the second executor obtains the message that the first executor has used the second data, the output data buffer of the second executor is left in a vacant state. Similarly, after the fourth executor obtains the message that the first executor has used the fourth data, the output data buffer of the fourth executor is in a vacant state.
第一执行体的上述执行任务的过程在其他执行体中同样发生。因此,在每个执行体中的有限状态机的控制下,基于上游执行体的输出结果,循环处理同类任务。从而每个执行体犹如一条数据处理路径上的固定任务的岗位人员,从而形成数据的流水线处理,不需要任何其他的外在指令。The above-mentioned task execution process of the first executive body also occurs in other executive bodies. Therefore, under the control of the finite state machine in each executive body, similar tasks are cyclically processed based on the output results of the upstream executive body. Thus, each executive body is like a fixed-task post worker on a data processing path, thereby forming a data pipeline processing without any other external instructions.
如前所述,图2中每个执行体显示都包含有一个输入数据缓存,实际上是不包含的,因为,每个执行体并不需要任何缓存来存储将被使用的数据,而是仅仅获取所需使用的数据处于能够被读取的状态即可。因此,每个执行体所要使用的数据在执行体未处于具体执行的状态时,数据依然保存在其上游执行体的输出数据缓存中。因此,为了形象显示,每个执行体中的输入数据缓存采用虚线表示,其实际上并不真的存在于执行体中。或者说,上游执行体的输出数据缓存就是下游执行体的虚拟输入数据缓存。因此,在图2中,对输入数据缓存采用了虚线标识。As mentioned earlier, each execution body shown in Figure 2 contains an input data buffer, which is actually not included, because each execution body does not need any buffer to store the data to be used, but only It is sufficient to obtain the data needed to be used in a state that can be read. Therefore, when the data to be used by each execution body is not in a specific execution state, the data is still stored in the output data buffer of the upstream execution body. Therefore, for visual display, the input data buffer in each executive body is represented by a dotted line, which does not actually exist in the executive body. In other words, the output data cache of the upstream executive is the virtual input data cache of the downstream executive. Therefore, in Figure 2, dashed lines are used for the input data buffer.
图3所示的是根据本公开的协处理器进行连续数据处理时的时序图。如图3所示,在初始状态,一次性向第一输出数据缓存121和第二输出数据缓存122存储两个数据D 1和D 2。在开始执行操作后,在第一轮数据处理步骤L 1,执行体组件110读取第一输出数据缓存121中的数据D 1并执行第一轮操作,随后在第二执行体130将结果数据R1输出或与其他并行设备进行数据交换。紧接着,进行第二轮数据处理步骤L 2,执行体组件110在第二执行体130发送来的反馈消息时读取第二输出数据缓存122中的数据D 2并执行第二轮操作获得结果数据R2,同时第一执行体120在第二执行体130发送来可以搬运数据的消息时将外界数据D 3搬运到第一输出数据缓存121以便用于第三轮数据处理步骤的处理。如此循环,在第n轮数据处理步骤L n,执行体组件110在第二执行体130发送来的反馈消息时读取第一输出数据缓存121和第二输出数据缓存122之一中的数据D并执行第n轮操作获得结果数据R,同时第一执行体120在第二执行体130发送来可以搬运数据的消息时将外界数据D n+1搬运到第一输出数据缓存121和第二输出数据缓存122之一中的处于空置状态的输出数据缓存以便用于第三轮数据处理步骤的处理。 FIG. 3 shows a timing diagram when the coprocessor according to the present disclosure performs continuous data processing. As shown in FIG. 3, in the initial state, two data D 1 and D 2 are stored in the first output data buffer 121 and the second output data buffer 122 at one time. After the execution of the operation is started, in the first round of data processing step L 1 , the executor component 110 reads the data D 1 in the first output data buffer 121 and executes the first round of operation, and then the result data in the second executor 130 R1 output or data exchange with other parallel devices. Then, perform the second round of data processing step L 2 , the executive component 110 reads the data D 2 in the second output data buffer 122 when the feedback message sent by the second executive 130 is sent, and executes the second round of operations to obtain the result data R2, while performing the first body 120 in the second transmission time can be performed by message 130 carrying a data of the external data D 3 is transported to the first output data buffer 121 for use in processing of the third round of data processing steps. In this cycle, in the nth round of data processing step L n , the executive component 110 reads the data D in one of the first output data buffer 121 and the second output data buffer 122 when the feedback message sent by the second executive body 130 is And execute the nth round of operation to obtain the result data R. At the same time, the first executor 120 transfers the external data D n+1 to the first output data buffer 121 and the second output when the second executor 130 sends a message that can carry data. The output data buffer in an empty state in one of the data buffers 122 is used for the processing of the third round of data processing steps.
尽管上面根据图1-3描述了本公开的具体内容,但是从优先实现参数交互方面考虑,也可以提供一种替代方案,即第一执行体120可以仅仅具有一个输出数据缓存,例如第一输出数据缓存121。具体而言,执行体组件110从所述第一执行体120的输出数据缓存121读取所需处理的数据,并执行预定的操作处理获得操作结果数据,并在获得操作结果数据之后向第一执行体120反馈消息,以便告知第一执行体120将所述输出数据缓存121置于空置状态。第二执行体130在获得来自所述执行体组件110的消息后将所述执行体组件110的操作结果数据通过所述协处理器与外围设备之间的通信信道输出到外围设备,并在所述操作结果数据被输出后(或执行完参数交互后)向所述执行组件110反馈消息以及向第一执行体120发送可以搬运数据的消息.所述第一执行体120在获得来自第二执行体130的可以搬运数据的消息后,第一执行体120经由所述协处理器与外围设备之间的通信信道将所需要处理的数据搬运到处于空置状态的输出数据缓存121。这样,协处理器在深度学习系统中,优先处理参数交互,随后进行新数据的拷贝处理。Although the specific content of the present disclosure is described above according to FIGS. 1-3, in consideration of the priority of parameter interaction, an alternative solution can also be provided, that is, the first executive body 120 may only have one output data buffer, for example, the first output Data cache 121. Specifically, the executor component 110 reads the data to be processed from the output data buffer 121 of the first executor 120, executes predetermined operation processing to obtain the operation result data, and sends the data to the first executor after obtaining the operation result data. The executive body 120 feeds back a message to inform the first executive body 120 to place the output data buffer 121 in a vacant state. After the second executive body 130 obtains the message from the executive body component 110, the operation result data of the executive body component 110 is output to the peripheral device through the communication channel between the coprocessor and the peripheral device, and the After the operation result data is output (or after the parameter interaction is performed), a message is fed back to the execution component 110 and a message that can carry data is sent to the first execution body 120. The first execution body 120 receives information from the second execution After the message of the body 130 that can carry data, the first executive body 120 transfers the data to be processed to the output data buffer 121 in an empty state via the communication channel between the coprocessor and the peripheral device. In this way, the coprocessor prioritizes the parameter interaction in the deep learning system, and then copies the new data.
图4所示的是根据本公开的协处理器中进行数据处理方法的流程图。如图4所示,在步骤S410处,协处理器在初始状态,一次性由第一执行体120将需要处理的数据D(例如数据D 1和D 2)存储到第一输出数据缓存121和第二输出数据缓存122。可选择地,也可以先只在一个输出数据缓存中存储数据,而另一个输出数据缓存处于空置状态。随后,在步骤S415处,执行体组件110从第一输出数据缓存121和第二输出数据缓存122之一读取数据。在初始状态,第一输出数据缓存121和第二输出数据缓存122中的哪一个被先读取用于执行操作对本公开的技术方案的实现没有影响。为了表示方便,在此先对第一输出数据缓存121进行描述,即执行体组件110从第一输出数据缓存121读取数据,并执行第n轮数据操作,从而获得第n结果数据R。执行体组件110在完成步骤S415处获得结果数据R之后,一方面,会向第一执行体120反馈消息,使得第一执行体120在步骤S420处将第一输出数据缓存121置于控制状态,使得第一输出数据缓存121处于可被写入数据的状态,另一方面,执行体组件110会同时向第二执行体130发送可以进行数据交互的消息,告知第二执行体130可以在步骤S425处将第n结果数据R输出或与其外的CPU或并行的协处理器(例如GPU)进行数据交互。在步骤S425结束之后,第二执行体130一方面向执行体组件110发出反馈消息,以便告知执行体组件110可以进行下一轮数据处理,另一方面向第一执行体120发出可从外界搬运数据的消息。随后,在步骤S430处,执行体组件110在获得来自第二执行体130的反馈消息后,针对第二输出数据缓存122中的数据D执行第n+1轮操作处理并获得第n+1数据结果R。在进行第n+1轮操作处理的同时,第一执行体120在步骤S435处从外界将下一个要处理的数据D搬运到处于空置状态的第一输出数据缓存121,以备用于第n+2轮操作处理。 Fig. 4 shows a flowchart of a data processing method in a coprocessor according to the present disclosure. As shown in FIG. 4, at step S410, the coprocessor is in the initial state, and the first executive body 120 stores the data D (for example, data D 1 and D 2 ) to be processed into the first output data buffer 121 and The second output data buffer 122. Alternatively, it is also possible to store data in only one output data buffer, while the other output data buffer is in a vacant state. Subsequently, at step S415, the executive component 110 reads data from one of the first output data buffer 121 and the second output data buffer 122. In the initial state, which of the first output data buffer 121 and the second output data buffer 122 is read first for performing the operation has no influence on the implementation of the technical solution of the present disclosure. For the convenience of presentation, the first output data buffer 121 is described here first, that is, the executive component 110 reads data from the first output data buffer 121 and executes the nth round of data operations to obtain the nth result data R. After the executor component 110 obtains the result data R at step S415, on the one hand, it will feed back a message to the first executor 120, so that the first executor 120 puts the first output data buffer 121 in the control state at step S420. So that the first output data buffer 121 is in a state where data can be written. On the other hand, the executive component 110 will simultaneously send a data exchange message to the second executive 130 to inform the second executive 130 that it can perform data exchange in step S425. It outputs the nth result data R or performs data interaction with the external CPU or parallel coprocessor (such as GPU). After step S425 ends, the second executive body 130 sends a feedback message to the executive body component 110 on the one hand to inform the executive body component 110 that it can perform the next round of data processing, and on the other hand, it sends a message to the first executive body 120 that it can be transported from outside. Data message. Subsequently, at step S430, after obtaining the feedback message from the second executive body 130, the executive component 110 executes the n+1th round of operation processing on the data D in the second output data buffer 122 and obtains the n+1th data Result R. While performing the n+1th round of operation processing, the first executor 120 transports the next data D to be processed from the outside to the vacant first output data buffer 121 at step S435 for use in the n+th 2 rounds of operation processing.
执行体组件110在完成步骤S430处获得结果数据R之后,一方面,会向第一执行体120反馈消息,使得第一执行体120在步骤S440处将第二输出数据缓存122置于控制状态,使得第二输出数据缓存122处于可被写入数据的状态,另一方面,执行体组件110会同时向第二执行体130发送可以进行数据交互的消息,告知第二执行体130可以在步骤S445处将第n+1结果数据R输出或与其外的CPU或并行的协处理器(例如GPU)进行数据交互。After the executor component 110 obtains the result data R at step S430, on the one hand, it will feed back a message to the first executor 120, so that the first executor 120 puts the second output data buffer 122 in the control state at step S440. So that the second output data buffer 122 is in a state where data can be written. On the other hand, the executive component 110 will simultaneously send a data exchange message to the second executive 130 to inform the second executive 130 that it can perform data exchange in step S445 It outputs the n+1th result data R or performs data interaction with the external CPU or parallel coprocessor (such as GPU).
在步骤S445结束之后,第二执行体130一方面向执行体组件110发出反馈消息,以便告知执行体组件110可以进行下一轮数据处理,另一方面向第一执行体120发出可从外界搬运数据的消息。随后,在步骤S450处,执行体组件110在获得来自第二执行体130的反馈消息后,针对第一输出数据缓存122中的数据D执行第n+2轮操作处理并获得第n+2数据结果R。在进行第n+2轮操作处理的同时,第一执行体120在步骤S455处从外界将下一个要处理的数据D搬运到处于空置状态的第二输出数据缓存122,以备用于第n+3轮操作处理。After the end of step S445, the second executive body 130 sends a feedback message to the executive body component 110 on the one hand to inform the executive body component 110 that it can perform the next round of data processing, and on the other hand, it sends a message to the first executive body 120 that it can be transported from outside. Data message. Subsequently, at step S450, after obtaining the feedback message from the second executive body 130, the executive component 110 executes the n+2th round of operation processing on the data D in the first output data buffer 122 and obtains the n+2th data Result R. While performing the n+2th round of operation processing, the first executor 120 transports the next data D to be processed from the outside to the second output data buffer 122 in the vacant state at step S455 for use in the n+th 3 rounds of operation processing.
随后,在步骤S460处,对计数器进行“n=n+2”的调整后,返回步骤S420重复执行步骤S422-S460。Subsequently, at step S460, after adjusting the counter "n=n+2", return to step S420 to repeat steps S422-S460.
尽管以上针对本公开的协处理器进行详细描述,但是需要指出的是,第二执行体可以不仅仅包含两个输出数据缓存121和12,而是可以包括三个或三个以上。例如,在执行体组件110需要在一轮操作中读取两个以上数据时,可以为第一执行体120设置四个输出数据缓存。这可以根据实际需要来设置。Although the coprocessor of the present disclosure is described in detail above, it should be pointed out that the second execution body may not only include two output data buffers 121 and 12, but may include three or more than three. For example, when the executive component 110 needs to read more than two data in one round of operation, four output data buffers can be set for the first executive 120. This can be set according to actual needs.
尽管在本公开的描述中将执行体组件110与第二执行体130描述为两个独立的个体,但是可选择地,第二执行体130本身可以是执行体组件110的一部分,并且两者执行的操作处理过程与结果数据的输出和交互过程彼此先后一体执行。因此,尽管本公开为了描述方便将两者分开,并不意味着两者分离存在是实现本公开所必要的安排。Although the executive body assembly 110 and the second executive body 130 are described as two independent individuals in the description of the present disclosure, alternatively, the second executive body 130 itself may be a part of the executive body assembly 110, and both execute The operation processing process and the output and interaction process of the result data are executed one after another. Therefore, although the present disclosure separates the two for the convenience of description, it does not mean that the separate existence of the two is a necessary arrangement for realizing the present disclosure.
当根据本公开的协处理器,例如GPU用于大数据技术以及深度学习领域时,由于存在大量的数据搬运和数据交互,因此,在GPU与外部的通信带宽固定的情况下,如何减少交互的数据量来提升数据传输的速度就显得非常重要。而在大数据计算以及深度学习中采用根据本公开的协处理器,由于数据的拷贝和数据的交互分时进行,因此能够在固定带宽情形下减少数据的通信量,从而也就增加了数据通信的速度。同时,由于其中数据的拷贝是在上一轮操作进行的同时进行,因此,本公开充分利用了数据操作期间不存在数据交互导致带宽占用的情形,由此将下一轮数据的拷贝前移到上一轮的操作期间并占用带宽,从而实现了数据的拷贝和数据交互进行带宽占用分时处理。从而最终实现了本公开的目的。When the coprocessor according to the present disclosure, such as the GPU, is used in the field of big data technology and deep learning, due to the large amount of data handling and data interaction, how to reduce the interaction between the GPU and the external communication bandwidth is fixed The amount of data is very important to increase the speed of data transmission. In the big data calculation and deep learning, the coprocessor according to the present disclosure is used. Since the copy of the data and the interaction of the data are performed in a time-sharing manner, the data communication volume can be reduced under the fixed bandwidth situation, thereby increasing the data communication speed. At the same time, because the copy of the data is performed at the same time as the previous round of operation, the present disclosure makes full use of the situation that there is no data interaction during the data operation that causes bandwidth occupation, thereby moving the next round of data copy forward to Bandwidth was occupied during the last round of operation, thereby realizing data copy and data interaction for bandwidth occupancy time-sharing processing. Thus, the objective of the present disclosure is finally achieved.
更为重要的是。 为了实现本公开的目的,设置了两个或两个以上的输出数据缓存,因此,结合上述控制过程,实现了数据的流式处理,极大地提高的数据处理的速度。the more important thing is. In order to achieve the purpose of the present disclosure, two or more output data buffers are provided. Therefore, in combination with the above control process, data streaming is realized, which greatly improves the data processing speed.
尽管在本公开中执行体组件110是一种组件,但是其可以仅仅包含一个执行体,也可以有多个执行体组成一个数据处理路径或一个数据处理网络。Although the executive component 110 is a component in the present disclosure, it may only include one executive, or there may be multiple executives to form a data processing path or a data processing network.
至此,本说明书描述了根据本公开实施例的一种协处理器以及协处理器中加快数据处理速度的方法。根据本公开的协处理器以及协处理器中加快数据处理速度的方法通过为一个执行体组件配置两个或两个以上的输出数据缓存,并通过控制信号控制各个输出数据缓存的拷贝数据的时间段,能够极大提高执行体组件的使用效率,使得执行体组件只需要等待一个数据输出或交互时间间隔而不需要等待处理数据的拷贝,从而缩短了执行体组件的等待时间,提高了执行体组件对时间的利用效率,由此也提高了协处理器的效率。So far, this specification describes a coprocessor and a method for accelerating data processing speed in the coprocessor according to embodiments of the present disclosure. According to the coprocessor and the method for accelerating the data processing speed in the coprocessor according to the present disclosure, two or more output data buffers are configured for an executive body component, and the time for copying data of each output data buffer is controlled by a control signal Segment, can greatly improve the efficiency of the use of the executive component, so that the executive component only needs to wait for a data output or interaction time interval without waiting for the copy of the processing data, thereby shortening the waiting time of the executive component and improving the executive component The efficiency of the time utilization of the components also improves the efficiency of the coprocessor.
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,对本领域的普通技术人员而言,能够理解本公开的方法和装置的全部或者任何步骤或者部件,可以在任何计算装置(包括处理器、存储介质等)或者计算装置的网络中,以硬件、固件、软件或者它们的组合加以实现,这是本领域普通技术人员在阅读了本公开的说明的情况下运用他们的基本编程技能就能实现的。The above describes the basic principles of the present disclosure in conjunction with specific embodiments. However, it should be pointed out that for those of ordinary skill in the art, all or any steps or components of the method and device of the present disclosure can be understood, and they can be used in any computing device. (Including processors, storage media, etc.) or a network of computing devices, implemented by hardware, firmware, software, or a combination thereof, this is the basic for those of ordinary skill in the art to use them after reading the description of the present disclosure Programming skills can be achieved.
因此,本公开的目的还可以通过在任何计算装置上运行一个程序或者一组程序来实现。所述计算装置可以是公知的通用装置。因此,本公开的目的也可以仅仅通过提供包含实现所述方法或者装置的程序代码的程序产品来实现。也就是说,这样的程序产品也构成本公开,并且存储有这样的程序产品的存储介质也构成本公开。显然,所述存储介质可以是任何公知的存储介质或者将来所开发出来的任何存储介质。Therefore, the purpose of the present disclosure can also be realized by running a program or a group of programs on any computing device. The computing device may be a well-known general-purpose device. Therefore, the purpose of the present disclosure can also be achieved only by providing a program product containing program code for implementing the method or device. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. Obviously, the storage medium may be any well-known storage medium or any storage medium developed in the future.
还需要指出的是,在本公开的装置和方法中,显然,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。并且,执行上述系列处理的步骤可以自然地按照说明的顺序按时间顺序执行,但是并不需要一定按照时间顺序执行。某些步骤可以并行或彼此独立地执行。It should also be pointed out that, in the device and method of the present disclosure, obviously, each component or each step can be decomposed and/or recombined. These decomposition and/or recombination should be regarded as equivalent solutions of the present disclosure. In addition, the steps of performing the above-mentioned series of processing can naturally be performed in chronological order in the order of description, but it is not necessarily performed in chronological order. Some steps can be performed in parallel or independently of each other.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,取决于设计要求和其他因素,可以发生各种各样的修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The foregoing specific implementations do not constitute a limitation on the protection scope of the present disclosure. Those skilled in the art should understand that, depending on design requirements and other factors, various modifications, combinations, sub-combinations, and substitutions can occur. Any modifications, equivalent substitutions and improvements made within the spirit and principles of this disclosure should be included in the protection scope of this disclosure.

Claims (10)

  1. 一种协处理器,包括执行体组件、具有至少两个输出数据缓存的第一执行体以及第二执行体,其中,A coprocessor includes an executive body component, a first executive body having at least two output data caches, and a second executive body, wherein,
    所述执行体组件轮流地从所述第一执行体的至少两个输出数据缓存之一读取所需处理的数据,并执行预定的操作处理以获得操作结果数据,并在获得操作结果数据之后向第一执行体反馈消息,以便告知第一执行体将所述至少两个输出数据缓存之一置于空置状态;The executor component reads the data to be processed from one of the at least two output data buffers of the first executor in turn, and executes predetermined operation processing to obtain operation result data, and after obtaining the operation result data Feedback a message to the first executive body, so as to inform the first executive body to place one of the at least two output data buffers in a vacant state;
    所述第二执行体在获得来自所述执行体组件的消息后将所述执行体组件的操作结果数据通过所述协处理器与外围设备之间的通信信道输出到外围设备,并在所述操作结果数据被输出后向所述执行组件反馈消息以及向第一执行体发送可以搬运数据的消息;以及The second executor outputs the operation result data of the executor component to the peripheral device through the communication channel between the coprocessor and the peripheral device after obtaining the message from the executor component, and sends it in the After the operation result data is output, a message is fed back to the execution component and a message that can carry the data is sent to the first execution body; and
    所述第一执行体在获得来自第二执行体的可以搬运数据的消息后,在所述执行体组件在获得第二执行体反馈的消息后针对所述第一执行体的至少两个输出数据缓存中的另一个读取所需处理的数据并执行预定的操作处理获得另一个操作结果数据的同时,第一执行体经由所述协处理器与外围设备之间的通信信道将所需要处理的数据搬运到处于空置状态的所述至少两个输出数据缓存之一。After the first executor obtains a message that can carry data from the second executor, after the executor component obtains the message fed back by the second executor, it outputs at least two data of the first executor The other one in the cache reads the data to be processed and executes the predetermined operation processing to obtain another operation result data. At the same time, the first executor transmits the data to be processed through the communication channel between the coprocessor and the peripheral device. The data is transferred to one of the at least two output data buffers in an empty state.
  2. 如权利要求1所述的协处理器,其中所述至少两个输出数据缓存的数量为三个、四个或五个。The coprocessor according to claim 1, wherein the number of the at least two output data buffers is three, four or five.
  3. 如权利要求1所述的协处理器,其中所述被第一执行体所搬运的数据是其他协处理器的输出的操作结果数据。5. The coprocessor according to claim 1, wherein the data carried by the first execution body is operation result data output by other coprocessors.
  4. 如权利要求1所述的协处理器,其中所述执行体组件包含至少一个执行体。The coprocessor according to claim 1, wherein the executive component includes at least one executive.
  5. 如权利要求1-4之一所述的协处理器,其中每个执行体都在自身的有限状态机达到执行触发条件时执行规定操作。The coprocessor according to any one of claims 1 to 4, wherein each executive body performs a prescribed operation when its own finite state machine reaches an execution trigger condition.
  6. 一种用于协处理器的数据处理加速方法,所述协处理器包括执行体组件、具有至少两个输出数据缓存的第一执行体以及第二执行体,所述方法包括:A data processing acceleration method for a coprocessor. The coprocessor includes an executive component, a first executive and a second executive with at least two output data buffers, the method comprising:
    通过所述执行体组件从所述第一执行体的至少两个输出数据缓存中的第一输出数据缓存读取所需处理的第一数据,并执行预定的操作处理获得第一操作结果数据,并在获得第一操作结果数据之后向第一执行体反馈消息以及向第二执行体发送可以执行操作的消息;Read the first data to be processed from the first output data buffer in the at least two output data buffers of the first executor through the executive body component, and perform predetermined operation processing to obtain the first operation result data, And after obtaining the first operation result data, feedback a message to the first executive body and send a message that the operation can be performed to the second executive body;
    第一执行体在获得来自执行体组件的反馈消息后将存储第一数据的第一输出数据缓存置于空置状态;The first execution body places the first output data buffer storing the first data in an empty state after obtaining the feedback message from the execution body component;
    所述第二执行体在获得来自所述执行体组件的消息后将所述执行体组件的第一操作结果数据通过所述协处理器与外围设备之间的通信信道输出到外围设备,并在所述第一操作结果数据被输出后向所述执行组件反馈消息以及向第一执行体发送可以搬运数据的消息;以及The second executive body outputs the first operation result data of the executive body component to the peripheral device through the communication channel between the coprocessor and the peripheral device after obtaining the message from the executive body component, and then After the first operation result data is output, a message is fed back to the execution component and a message that can carry data is sent to the first execution body; and
    在获得来自第二执行体的可以搬运数据的消息后,在所述执行体组件在获得第二执行体反馈的消息后针对所述第一执行体的至少两个输出数据缓存中的第二个输出数据缓存读取所需处理的第二数据并执行预定的操作处理获得第二操作结果数据的同时,所述第一执行体经由所述协处理器与外围设备之间的通信信道将所需要处理的数据作为新的第一数据搬运到处于空置状态的所述至少两个输出数据缓存的第一输出数据缓存;After obtaining a message from the second executor that can carry data, after the executor component obtains the message fed back by the second executor, the second one of the at least two output data buffers of the first executor The output data buffer reads the second data to be processed and executes predetermined operation processing to obtain the second operation result data. At the same time, the first executor will use the communication channel between the coprocessor and the peripheral device. The processed data is transported as new first data to the first output data buffer of the at least two output data buffers that are in a vacant state;
    所述执行体组件在获得第二操作结果数据之后向第一执行体反馈消息以及向第二执行体发送可以执行操作的消息;The executor component feeds back a message to the first executor and sends an executable message to the second executor after obtaining the second operation result data;
    第一执行体在获得来自执行体组件的反馈消息后将存储第二数据的第二输出数据缓存置于空置状态;The first executive body places the second output data buffer storing the second data in a vacant state after obtaining the feedback message from the executive body component;
    所述第二执行体在获得来自所述执行体组件的消息后将所述执行体组件的第二操作结果数据通过所述通信信道输出,并在所述第二操作结果数据被输出后向所述执行组件反馈消息以及向第一执行体发送可以搬运数据的消息;以及The second executor outputs the second operation result data of the executor component through the communication channel after obtaining the message from the executor component, and sends the second operation result data to the Said executive component feedback message and sending a message that can carry data to the first executive body; and
    在获得来自第二执行体的可以搬运数据的消息后,在所述第一执行体经由所述协处理器与外围设备之间的通信信道将所需要处理数据作为新的第二数据搬运到处于空置状态的所述至少两个输出数据缓存的第二输出数据缓存的同时,开始重复以上步骤。After obtaining the message that can carry data from the second executive body, the first executive body transfers the required processing data as new second data to the computer via the communication channel between the coprocessor and the peripheral device. At the same time as the second output data buffer of the at least two output data buffers in the vacant state, the above steps are repeated.
  7. 如权利要求6所述的用于协处理器的数据处理加速方法,其中所述至少两个输出数据缓存的数量为三个、四个或五个。8. The data processing acceleration method for a coprocessor according to claim 6, wherein the number of the at least two output data buffers is three, four or five.
  8. 如权利要求6所述的用于协处理器的数据处理加速方法,其中所述被第一执行体所搬运的数据是其他协处理器的输出的操作结果数据。8. The method for accelerating data processing for a coprocessor according to claim 6, wherein the data carried by the first execution body is operation result data output by other coprocessors.
  9. 如权利要求6所述的用于协处理器的数据处理加速方法,其中所述执行体组件包含至少一个执行体。8. The method for accelerating data processing for a coprocessor according to claim 6, wherein the executive component includes at least one executive.
  10. 如权利要求6-9之一所述的用于协处理器的数据处理加速方法,其中每个执行体都在自身的有限状态机达到执行触发条件时执行规定操作。8. The method for accelerating data processing for a coprocessor according to any one of claims 6-9, wherein each executive body executes a prescribed operation when its own finite state machine reaches an execution trigger condition.
PCT/CN2020/093840 2019-07-15 2020-06-02 Coprocessor and data processing acceleration method therefor WO2021008257A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910633024.3A CN110188067B (en) 2019-07-15 2019-07-15 Coprocessor and data processing acceleration method thereof
CN201910633024.3 2019-07-15

Publications (1)

Publication Number Publication Date
WO2021008257A1 true WO2021008257A1 (en) 2021-01-21

Family

ID=67725699

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093840 WO2021008257A1 (en) 2019-07-15 2020-06-02 Coprocessor and data processing acceleration method therefor

Country Status (2)

Country Link
CN (1) CN110188067B (en)
WO (1) WO2021008257A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188067B (en) * 2019-07-15 2023-04-25 北京一流科技有限公司 Coprocessor and data processing acceleration method thereof
CN112785483B (en) * 2019-11-07 2024-01-05 深南电路股份有限公司 Method and equipment for accelerating data processing
CN110955529B (en) * 2020-02-13 2020-10-02 北京一流科技有限公司 Memory resource static deployment system and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130132771A1 (en) * 2011-11-18 2013-05-23 Nokia Corporation Method and apparatus for providing information consistency in distributed computing environments
US20140310259A1 (en) * 2013-04-15 2014-10-16 Vmware, Inc. Dynamic Load Balancing During Distributed Query Processing Using Query Operator Motion
CN107783721A (en) * 2016-08-25 2018-03-09 华为技术有限公司 The processing method and physical machine of a kind of data
CN107908471A (en) * 2017-09-26 2018-04-13 聚好看科技股份有限公司 A kind of tasks in parallel processing method and processing system
CN108404415A (en) * 2018-03-22 2018-08-17 网易(杭州)网络有限公司 The treating method and apparatus of data
CN110188067A (en) * 2019-07-15 2019-08-30 北京一流科技有限公司 Coprocessor and its data processing accelerated method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011034189A (en) * 2009-07-30 2011-02-17 Renesas Electronics Corp Stream processor and task management method thereof
CN102542525B (en) * 2010-12-13 2014-02-12 联想(北京)有限公司 Information processing equipment and information processing method
CN107544937A (en) * 2016-06-27 2018-01-05 深圳市中兴微电子技术有限公司 A kind of coprocessor, method for writing data and processor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130132771A1 (en) * 2011-11-18 2013-05-23 Nokia Corporation Method and apparatus for providing information consistency in distributed computing environments
US20140310259A1 (en) * 2013-04-15 2014-10-16 Vmware, Inc. Dynamic Load Balancing During Distributed Query Processing Using Query Operator Motion
CN107783721A (en) * 2016-08-25 2018-03-09 华为技术有限公司 The processing method and physical machine of a kind of data
CN107908471A (en) * 2017-09-26 2018-04-13 聚好看科技股份有限公司 A kind of tasks in parallel processing method and processing system
CN108404415A (en) * 2018-03-22 2018-08-17 网易(杭州)网络有限公司 The treating method and apparatus of data
CN110188067A (en) * 2019-07-15 2019-08-30 北京一流科技有限公司 Coprocessor and its data processing accelerated method

Also Published As

Publication number Publication date
CN110188067A (en) 2019-08-30
CN110188067B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
US11782710B2 (en) Execution or write mask generation for data selection in a multi-threaded, self-scheduling reconfigurable computing fabric
WO2021008257A1 (en) Coprocessor and data processing acceleration method therefor
JP7426979B2 (en) host proxy on gateway
US11675734B2 (en) Loop thread order execution control of a multi-threaded, self-scheduling reconfigurable computing fabric
US20210243080A1 (en) Efficient Loop Execution for a Multi-Threaded, Self-Scheduling Reconfigurable Computing Fabric
WO2021008259A1 (en) Data processing system for heterogeneous architecture and method therefor
US8112559B2 (en) Increasing available FIFO space to prevent messaging queue deadlocks in a DMA environment
US11586571B2 (en) Multi-threaded, self-scheduling reconfigurable computing fabric
WO2021008260A1 (en) Data executor and data processing method thereof
WO2021008258A1 (en) Data flow acceleration member in data processing path of coprocessor and method thereof
JP2005235228A (en) Method and apparatus for task management in multiprocessor system
CN106844017A (en) The method and apparatus that event is processed for Website server
WO2020163315A1 (en) Systems and methods for artificial intelligence with a flexible hardware processing framework
TW200402657A (en) Registers for data transfers
JP2022545697A (en) sync network
CN100489830C (en) 64 bit stream processor chip system structure oriented to scientific computing
US8631086B2 (en) Preventing messaging queue deadlocks in a DMA environment
CN112639738A (en) Data passing through gateway
CN107180010A (en) Heterogeneous computing system and method
WO2021159926A1 (en) Static deployment system and method for memory resources
CN110347450B (en) Multi-stream parallel control system and method thereof
CN110245024B (en) Dynamic allocation system and method for static storage blocks
WO2016008317A1 (en) Data processing method and central node
CN111475205B (en) Coarse-grained reconfigurable array structure design method based on data flow decoupling
JP7406539B2 (en) streaming engine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20840250

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20840250

Country of ref document: EP

Kind code of ref document: A1