CN112817638A - Data processing device and method - Google Patents

Data processing device and method Download PDF

Info

Publication number
CN112817638A
CN112817638A CN201911128755.9A CN201911128755A CN112817638A CN 112817638 A CN112817638 A CN 112817638A CN 201911128755 A CN201911128755 A CN 201911128755A CN 112817638 A CN112817638 A CN 112817638A
Authority
CN
China
Prior art keywords
calculation
output register
module
calculation result
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911128755.9A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Simm Computing Technology Co ltd
Original Assignee
Beijing Simm Computing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Simm Computing Technology Co ltd filed Critical Beijing Simm Computing Technology Co ltd
Priority to CN201911128755.9A priority Critical patent/CN112817638A/en
Publication of CN112817638A publication Critical patent/CN112817638A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

The invention discloses a data processing device and a method, wherein the data processing device comprises: a processing module (10) for sending instructions, the instructions comprising calculation instructions and transmission instructions; a calculation module (20) for performing a calculation based on the calculation instruction sent by the processing module and generating a calculation result; a Tx transmitter (30) for transmitting the calculation result generated by the calculation module based on the transmission instruction. According to the data processing device provided by the embodiment of the invention, the processing module sends the calculation instruction and the transmission instruction, the Tx transmitter is connected with the calculation module and sends the calculation result generated by the calculation module, the processing module does not need to participate in the calculation process and the transmission of the calculation result at any time, the processing module can further process other data efficiently, and the efficiency of the processing module is improved.

Description

Data processing device and method
Technical Field
The present invention relates to the field of processing core structures, and in particular, to a data processing apparatus and method.
Background
With the development of science and technology, the human society is rapidly entering the intelligent era. The important characteristics of the intelligent era are that people obtain more and more data, the quantity of the obtained data is larger and larger, and the requirement on the speed of processing the data is higher and higher.
Chips are the cornerstone of data processing, which fundamentally determines the ability of people to process data. From the application field, the chip mainly has two routes: one is a general chip route, such as a Central Processing Unit (CPU), which provides great flexibility but is less computationally efficient in Processing domain-specific algorithms; the other is a special chip route, such as a Tensor Processing Unit (TPU), which can exert higher effective computing power in some specific fields, but has poorer or even no Processing capability in the more versatile and general fields.
Because the data of the intelligent era is various and huge in quantity, the chip is required to have extremely high flexibility, can process algorithms in different fields and in different days, has extremely high processing capacity, and can rapidly process extremely large and sharply increased data volume.
Disclosure of Invention
Objects of the invention
An object of the present invention is to provide a data processing apparatus and method, the data processing apparatus being provided with a processing module, a calculation module, and a Tx transmitter, the processing module transmitting a calculation instruction and a transmission result, the calculation module performing a calculation based on the calculation instruction and generating a calculation result, the Tx transmitter transmitting the calculation result generated by the calculation module based on the transmission instruction.
According to the data processing device provided by the invention, the processing module sends the calculation instruction and the transmission instruction, the Tx transmitter is connected with the calculation module and sends the calculation result generated by the calculation module, the processing module does not need to participate in the calculation process and the transmission of the calculation result at any time, the processing module can further process other data efficiently, and the efficiency of the processing module is improved.
(II) technical scheme
To solve the above problem, a first aspect of the present invention provides a data processing apparatus comprising: the processing module is used for sending instructions, and the instructions comprise calculation instructions and transmission instructions; the calculation module is used for executing calculation based on the calculation instruction and generating a calculation result; a Tx transmitter connected to the calculation module for transmitting the calculation result generated by the calculation module based on the transmission instruction.
According to the data processing device provided by the embodiment of the invention, the processing module sends the calculation instruction and the transmission instruction, the Tx transmitter is connected with the calculation module and sends the calculation result generated by the calculation module, the processing module does not need to participate in the calculation process and the transmission of the calculation result at any time, the processing module can further process other data efficiently, and the efficiency of the processing module is improved.
Furthermore, the calculation module is also used for caching the calculation result; the Tx transmitter is used for transmitting the calculation result cached in the calculation module based on the transmission instruction.
Further, the computing module comprises an execution unit and an output register set; an execution unit configured to perform a calculation based on a calculation instruction and generate the calculation result; and the output register group is used for caching the calculation result and at least comprises one output register.
Further, the output register set is further configured to send a first signal when the output register set is capable of receiving the calculation result, and the execution unit sends the calculation result to the output register set based on the first signal.
Further, the output register set is further configured to send a second signal when the calculation result cannot be received in the output register set, and the execution unit suspends sending the calculation result based on the second signal.
Further, the output register set is further configured to transmit a third signal when the number of the calculation results stored in the output register set is lower than a first preset value, and the Tx transmitter suspends fetching from the output register set based on the third signal.
Further, the output register set is further configured to send a fourth signal when the number of the calculation results stored in the output register set is higher than a second preset value, and the Tx transmitter fetches data from the output register set based on the fourth signal.
Further, the calculation module comprises one or more multipliers and adders; the number of output registers in the output register group is larger than or equal to the number of the multiplier-adders.
Furthermore, the number of output registers in the output register group and the number of multiplier-adders arranged in the computing module are in a multiple relation.
More preferably, the number of output registers provided in the output register group is at least 2 times the number of multiplier-adders provided in the execution unit.
Further, the processing module is configured to send the calculation instruction and the sending instruction sequentially.
The second aspect of the present invention also provides a core structure, including the data processing apparatus and the storage module provided in the first aspect; the storage module is used for storing instructions and data, and the instructions comprise the calculation instructions and the transmission instructions.
Further, the memory module includes a plurality of memory cells; the plurality of memory cells includes a first memory cell, a second memory cell, and a third memory cell; the first storage unit is used for storing the instruction and/or storing data for the processing module to read and write; the second storage unit is used for storing the calculation result generated by the calculation module and/or storing data for reading and writing by the calculation module; the third storage unit is used for storing data received from the outside.
The third aspect of the present invention also provides a chip comprising one or more of the core structures provided in the second aspect.
The fourth aspect of the invention also provides a card comprising one or more chips of the third aspect.
The fifth aspect of the present invention also provides an electronic device comprising one or more cards of the fourth aspect.
The sixth aspect of the present invention further provides a data processing method, including that a processing module sends a calculation instruction and a transmission instruction; the calculation module executes calculation based on the calculation instruction to generate a calculation result; the Tx transmitter is connected with the calculation module and transmits the calculation result generated by the calculation module based on the transmission instruction.
(III) advantageous effects
The technical scheme of the invention has the following beneficial technical effects:
(1) according to the data processing device provided by the invention, the processing module sends the calculation instruction and the transmission instruction, the calculation module generates the calculation result, the Tx transmitter is connected with the calculation module and sends the calculation result generated by the calculation module, the processing module does not need to participate in the calculation process and the transmission of the calculation result at any time, the processing module can further process other data efficiently, and the efficiency of the processing module is improved.
(2) According to the data processing device provided by the embodiment of the invention, the processing module does not send the calculation instruction and the transmission instruction in sequence, and the transmission instruction indicates the Tx transmitter to send the calculation result generated by the calculation module, so that the calculation module can execute the calculation and the Tx transmitter sends the calculation result to be processed in parallel, and the utilization rate of the calculation module is improved.
(3) The data processing device provided by the embodiment of the invention does not need a processing module to split the task, thereby reducing the difficulty of programming and compiling.
Drawings
FIG. 1 is a schematic diagram of a data processing apparatus;
FIG. 2 is a schematic diagram of another data processing apparatus;
FIG. 3 is a timing chart of data processing performed by the other data processing apparatus shown in FIG. 2;
FIG. 4 is a schematic block diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 5 is a schematic block diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 6 is a schematic block diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a core architecture according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a core architecture according to an embodiment of the present invention;
FIG. 9 is a flow chart of a data processing method according to an embodiment of the invention.
Reference numerals:
1: a core; 10: a processing module; 20: a calculation module; 21: an execution unit; 22: an output register set; 30: a Tx transmitter; 40: an Rx receiver; 100: a data processing device; 200: and a storage module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
In a multi (many) core architecture, multiple cores often participate in the execution of a task at the same time, and at this time, data is transmitted between the cores. The data transmission may affect the computation of the core, and further reduce the effective performance of the whole chip, such as the data processing efficiency of the chip.
FIG. 1 is a schematic diagram of a data processing apparatus.
As shown in fig. 1, the data Processing apparatus includes a central Processing Unit CPU, a data transceiver module TR, a Processing Unit (PU), and a Memory module Memory (Mem for short), wherein the central Processing Unit CPU, the transceiver module TR, and the Processing Unit PU share the same Memory module Mem. Under the control of the CPU, the PU performs calculation, and the TR performs data transmission and reception. When the calculation of one task is completed, result data generated by the calculation of the task is transmitted, and the specific process is as follows:
the CPU reads out a calculation task from the memory module Mem, then sends a calculation instruction to the PU, and the PU starts calculation; after the PU completes the calculation, writing the calculation result into a memory module Mem, and sending the information of the completion of the calculation to a CPU; the CPU sends a transmission instruction to the TR based on the information of the completion of the calculation, the TR reads the calculation result from the memory module Mem, and then data transmission is started; and after the TR finishes the data transmission, sending a message of finishing the data transmission to the CPU.
In such a data processing apparatus, there are generally the following drawbacks:
1. because data calculation and transmission of calculation results are serial processing, the calculation speed of a chip is low, and the effectiveness is reduced.
2. The CPU needs to control the calculation and transmit the calculation result based on the operating states of the PU and the TR, that is, the CPU is instructed to perform the calculation or the TR transmission based on the information that the calculation is completed by the PU and the information that the transmission is completed by the TR, so that the data processing efficiency of the CPU is low, and the CPU cannot be fully utilized.
3. The calculation result needs to be written into Mem by the PU first, and then read out from Mem by the TR and then sent out, which consumes much time and power consumption, resulting in low efficiency of data processing.
Fig. 2 is a schematic configuration diagram of another data processing apparatus.
As shown in fig. 2, the data processing apparatus includes a central processing unit CPU, a transmitting and receiving module TR, an arithmetic unit PU, and a memory module Mem, wherein the memory module Mem includes a first memory unit M0 and a second memory unit M1. In the data processing device, two storage units are arranged, so that the PU and the TR can read and write data simultaneously, and parallel processing of calculating data and transmitting a calculation result is realized. Completing the calculation of a task and transmitting result data generated by the calculation of the task, wherein the process comprises the following steps:
firstly, a CPU reads out a calculation task from a memory module Mem and splits the task into a plurality of subtasks; then the CPU sends a calculation instruction to the PU, and the PU starts calculation for the first time and writes the calculation result into a first storage unit M0; after the PU is calculated, the PU sends the calculated information to the CPU; the CPU gives a second calculation instruction to the PU, the PU starts the second calculation and stores the calculation result in a second storage unit M1, meanwhile, the CPU gives a transmission instruction to the TR to indicate the TR to read and transmit the data stored in the first storage unit M0, when the TR transmits the data stored in the MO, the PU completes the second calculation, writes the calculation result into the first storage unit M1, and sends a signal of completing the second calculation to the CPU; then the CPU instructs the PU to perform the third calculation, the PU starts the third calculation and writes the result of the third calculation into the MO, and at the same time, the CPU instructs the TR to transmit the result data of the second calculation stored in the second calculation unit M1, and so on until the whole calculation task is completed. Fig. 3 is a timing chart of data processing performed by the other data processing apparatus shown in fig. 2. As shown in fig. 3, in this data processing apparatus, the CPU splits the task, and fig. 3 exemplifies splitting the task 2N times.
Generally, such a data transmission device has the following disadvantages:
1. in order to enable the PU and the TR to process in parallel, the task needs to be divided, and the PU needs to write the calculation result into the corresponding storage unit based on different calculation times and different time, which results in complicated program writing and compiling.
2. The CPU needs to control the calculation and transmit the calculation result based on the running states of the PU and the TR, namely, the PU is indicated to calculate again or transmit the TR based on the information that the calculation of the PU is completed and the information that the transmission of the TR is completed, so that the efficiency of the CPU is reduced, the data processing efficiency of the CPU is low, and the CPU cannot be fully utilized.
3. The calculation result needs to be written into Mem by the PU and read out from Mem by the TR, which consumes time and power consumption, resulting in low efficiency of data processing.
The following describes in detail a data transmitting and receiving apparatus according to an embodiment of the present application. In the description of the present invention, it should be noted that the terms "first", "second", "third", and "fourth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Fig. 4 is a schematic configuration diagram of a data processing apparatus according to a first embodiment of the present invention.
As shown in fig. 4, the data processing apparatus includes a processing module 10, a calculation module 20, and a Tx transmitter 30.
Alternatively, the processing module 10 may be a central processing unit CPU, and the computing module 20 may be one or more data arithmetic units PU.
The processing module 10 is configured to send an instruction, where the instruction includes a calculation instruction and a transmission instruction.
And the calculation module 20 is configured to perform calculation based on the calculation instruction sent by the processing module 10, and generate a calculation result.
A Tx transmitter 30 connected to the calculation module 20 for transmitting the calculation result generated by the calculation module 20 based on a transmission instruction.
In one embodiment, the calculation module 20 is further configured to cache the calculation result; tx transmitter 30 for transmitting the calculation result buffered in calculation module 20 based on the transmission instruction.
In the data processing apparatus provided in the above embodiment of the present invention, the processing module sends the calculation instruction and the transmission instruction, the Tx transmitter is connected to the calculation module, and sends the calculation result generated by the calculation module, and the processing module 10 does not need to participate in the calculation process at all times, so that the processing module can further process other data efficiently, and the efficiency of the processing module is improved.
Fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
As shown in fig. 5, the calculation module 20 includes an execution unit 21 and an output register group 22 connected to each other.
And an execution unit 21, configured to perform a calculation based on the calculation instruction sent by the processing module 10 and generate a calculation result.
And an output register set 22, configured to cache the calculation result, where the output register set 22 includes at least one output register.
In one embodiment, the set of output registers 22 is further configured to send a first signal when the set of output registers 22 is capable of receiving the calculation. The execution unit 21 sends the calculation result to the output register group 22 based on the first signal.
In the above embodiment, the output register group 22 can receive the calculation result, and there may be a storage space in the output register group 22, or the storage space in the output register group 22 is enough to accommodate the data amount obtained by the execution unit 21 executing one calculation.
In one embodiment, the processing module 10 issues computational instructions to the execution unit 21 of the computation module 20. When receiving the calculation instruction sent by the processing module 10, the execution unit 21 starts to perform calculation to obtain a calculation result, and when receiving the first signal sent by the output register set, the execution unit 21 buffers the calculation result in the output register set 22.
In a preferred embodiment, when the execution unit 21 receives the first signal sent by the output register set 22, it will determine whether the first signal indicates that there is a storage space in the output register set 22; if so, the execution unit 21 sends the calculation result to the output register group 22.
In a preferred embodiment, when the execution unit 21 receives the first signal sent by the output register set 22, it will determine whether the first signal indicates that the execution unit 21 can store the generated calculation result in the output register set 22; if so, the execution unit 21 sends the calculation result to the output register group 22.
In an embodiment, if the computation amount corresponding to the computation instruction sent by the processing module 10 is large, a large number of computation results are generated, and the execution unit 21 continuously writes the computation results into the output memory bank 22, which may result in a situation where the output register bank 22 cannot receive the computation results. For example, the buffer space in the output register group 22 is occupied, or the remaining buffer space of the output register group 22 is not enough to store the amount of data generated by the execution unit 21 once.
The output memory group 22 is also used to send a second signal to the execution unit 21, and the execution unit 21 enters a waiting state based on the second signal, i.e. the execution unit 21 suspends sending the calculation result based on the second signal, at which time the calculation is suspended if the execution unit 21 cannot transfer the already calculated data to the output register group 22. Until the execution unit 21 receives the first signal sent by the output register group 22 again, the execution unit 21 sends the result that has been calculated to the output register group, and continues to perform calculation based on the calculation instruction.
In an alternative embodiment, when there are no calculation results in the output register group 22 or the number of calculation results is lower than the first preset value, the output register group generates a third signal, and the Tx transmitter 30 enters a waiting state based on the third signal, i.e., the Tx transmitter 30 suspends taking data from the output register group 22 based on the third signal.
In an alternative embodiment, if there are calculation results in the output register group 22 or the number of calculation results in the output register group 22 is higher than the second preset value, the output register group 22 sends a fourth signal, and the Tx transmitter 30 fetches data from the output register group 22 based on the fourth signal. After receiving the fourth signal, Tx transmitter 30 may determine whether the fourth signal indicates that Tx transmitter 30 may retrieve the calculation result from the buffer of output register set 22, and if so, Tx transmitter 30 retrieves the calculation result from the buffer of output register set 22 and transmits it.
It is understood that the first preset value may be the same as or different from the second preset value, and the invention is not limited thereto.
It should be noted that the number of the calculation results mentioned herein may be a data amount obtained by one-time calculation by the execution unit, or may be a data amount obtained by multiple calculations, and the data amount obtained by one-time calculation may be a calculation result of the whole task, or may be a calculation result of a part of tasks, and is not limited herein. The first preset value refers to the size of the number of calculation results transmitted at one time, for example: the first preset value is 8 bits, and then the output register set sends out a third signal when the number of the calculation results in the output register set is less than 8 bits. Of course, the second preset value is also the same, and the preset value may be set according to actual needs without being limited in this embodiment.
Further, the execution unit 21 includes one or more multiplier-adders. The output register group 22 includes one or more output registers, and the number of the output registers provided in the output register group 22 is greater than or equal to the number of the multiplier-adders.
Preferably, the number of output registers provided in the output register group 22 and the number of multiplier-adders provided in the execution unit 21 are in a multiple relationship.
More preferably, the number of output registers provided in the output register group 22 is at least 2 times the number of multiplier-adders provided in the execution unit 21.
The number of output registers set in the output register group 22 is at least 2 times the number of multiplier-adders set in the execution unit 21, for example, the output register group 22 includes a first output register for storing the calculation result generated last time by the execution unit 21 and a second output register for storing the calculation result generated this time by the execution unit 21, and the output register group 22 can provide more storage space, so that the execution unit 21 can execute the calculation, the output register group 22 stores the calculation result, and the Tx transmitter fetches data from the output register group in parallel, thereby improving the data processing efficiency.
For another example, 2 multipliers and adders are arranged in the execution unit 21, and the output calculation result of each multiplier and adder 2s is 2 bytes, so that 2 multipliers and adders in 2s generate a calculation result of 4 bytes in total, and 4 output registers may be arranged in the output register group, where 2 output registers store the calculation result of 4 bytes generated by the previous 2s execution unit 21, and the other 2 output registers store data of 4 bytes generated by the next 2s execution unit 21, so that the output registers can continuously store the data generated by the execution unit 21, and the execution unit 21 is efficiently utilized to perform data calculation. In the present embodiment, one output register is used to store the output result of one calculation by one multiplier-adder, i.e., one output register is used to store the calculation result of 2 bytes.
In one embodiment, processing module 10 is configured to send calculation instructions to calculation module 20 and send transmission instructions to Tx transmitter 30, non-sequentially.
In this embodiment, after receiving the transmission instruction sent by the processing module 10, the Tx transmitter 30 acquires information that data needs to be transmitted according to the setting of the processing module 10. Optionally, the information that needs to transmit data may be parameters such as the type of data, the size of data that needs to be sent, and the destination address to which the data is sent.
When Tx transmitter 30 receives the transmission instruction, and when Tx transmitter 30 receives the fourth signal from output register set 22, Tx transmitter 30 fetches the data from output register set 22 and transmits the data to the destination address.
The embodiment of the present invention takes the computing module 20 as PU, the processing module 10 as CPU, the Execution Unit 21 as Execution Unit (EU), and the output register set 22 as Rout Array as an example, and details the data processing procedure of the data processing apparatus as follows:
the CPU extracts the calculation instruction and the transmission instruction, issues the calculation instruction to the EU in the PU, and sends the transmission instruction to the Tx transmitter 30 without any order.
And after receiving the calculation instruction, the EU starts to calculate and generates a calculation result.
When Rout Array is able to receive the calculation, a first signal is sent to the EU.
When the EU receives the first signal W _ rdy sent by the Rout Array, the EU determines whether the first signal W _ rdy indicates that the Rout Array can store the calculation result. If so, the calculation result is stored in the Rout Array.
When there is a calculation result or the number of calculation results is higher than the second preset value in the Rout Array, a fourth signal is sent to the Tx transmitter, and the Tx transmitter 30 takes out the calculation result from the buffer of the output register set 22 and sends it out.
When the calculation result cannot be received in Rout Array, a second signal is sent to EU. And after receiving the second signal, the EU suspends sending the calculation result to the Rout Array.
When there are no calculation results in the Rout Array or the number of calculation results is lower than the first preset value, the Rout Array transmits a third signal to the Tx transmitter 30, and the Tx transmitter 30 temporarily buffers data from the Rout Array.
And so on, until the Tx transmitter 30 sends out all the calculation results corresponding to the transmission instruction issued by the CPU this time.
In the data processing device of the above embodiment of the present invention, the CPU does not sequentially issue the calculation task and the transmission task to the EU and Tx transmitters, the task does not need to be divided, the calculation result generated by the EU is stored by the Rout Array, and the Tx transmitter extracts the calculation result from the Rout Array, on one hand, the calculation result does not need to be stored in the Mem, and the Mem is not occupied, thereby improving the power consumption; on the other hand, EU calculation and Tx transmitter transmission calculation results are processed in parallel, so that the constant participation of a CPU is not needed, and the efficiency of the CPU is improved.
Further, as shown in fig. 6, the output register group 22 is, for example, a First In First Out (FIFO) output register group, W _ rdy represents the first signal, Full represents the second signal, Empty represents the third signal, and R _ rdy represents the fourth signal. With the symbols between the FIFO and Tx transmitter and between the FIFO and EU represented as inverters. Wt is represented as a line for EU write data and Rd is represented as a line for Tx transmitter read data.
In one embodiment, Full indicates that the FIFO memory is Full, there is no storage space, which indicates that the Rout Array cannot receive the calculation result, and Full may be equal to 1. Empty indicates that the FIFO is Empty, there is no calculation result, and it may be 1.
And the CPU sends the calculation instruction and the transmission instruction to the EU transmitter and the Tx transmitter in a non-sequential manner.
The EU carries out calculation based on the calculation instruction, a calculation result is generated, if no storage space exists in the FIFO, the FIFO sends Full to 1 to the EU, when the FIFO passes through the inverter during sending, namely after the FIFO passes through the inverter, W _ rdy is 0, the EU receives W _ rdy which is 0 and knows that the FIFO is Full, the EU enters a waiting state, until the EU receives a first signal sent by the FIFO, namely when the EU receives a W _ rdy which is 1 signal, the FIFO is indicated to have the storage space, and the Wt is controlled to write the calculation result into the FIFO.
If the FIFO is Empty and no calculation result is stored, the FIFO sends Empty 1 to the Tx transmitter and R _ rdy 0 at the time of transmission or after passing through an inverter, that is, after passing through an inverter, at this time, the Tx transmitter enters a wait state even if it has received a transmission instruction transmitted by the CPU. Until the FIFO is not empty, the Tx transmitter will receive R _ rdy ═ 1, indicating that data can be transmitted, and if the Tx transmitter has received a transmission instruction sent by the CPU, the Tx control Rd reads the data from the FIFO and then transmits it.
In the embodiment, parallel processing can be realized during calculation of an output register group and an execution unit in the data processing device, the processing module CPU sends a calculation instruction and a transmission instruction to a transmission calculation module and a Tx transmitter which are not in sequence, parallel processing of data calculation and transmission can be realized, and a calculation result generated by the execution unit is stored in the output register group, and the Tx transmitter sends the calculation result after extracting data from the output register group, so that the CPU does not need to participate in calculation constantly, the CPU can process more other data, and the effective performance of the CPU is improved.
FIG. 7 is a schematic diagram of a core architecture according to an embodiment of the present invention.
The core 1 of the embodiment shown in fig. 7 includes the data processing apparatus 100 and the memory module 200 provided in the above embodiment. In the data processing apparatus 100, the calculation module 20 is a PU, the processing module 10 is a CPU, the Execution Unit 21 is an Execution Unit (EU), and the output register set 22 is Rout Array. W _ rdy denotes a first signal, and R _ rdy denotes a fourth signal.
Wherein the storage module is used for storing instructions and data, the instructions include calculation instructions sent by the CPU to the execution unit 21 and transmission instructions sent to the Tx transmitter 30.
In the above embodiment, the memory module 200 may be provided with one memory unit.
In the embodiment shown in fig. 7, the data processing procedure of the core structure is as follows:
the CPU extracts the calculation instruction and the transmission instruction in the memory module 200, issues the calculation instruction to the EU in the PU, and sends the transmission instruction to the Tx transmitter 30 without any sequence. And after receiving the calculation instruction, the EU starts to calculate and generates a calculation result. When the EU receives the first signal W _ rdy sent by the Rout Array, the calculation result is stored in the Rout Array. And when the calculation results are stored in the Rout Array or the number of the calculation results is higher than a second preset value, sending a fourth signal R _ rdy. The Tx transmitter extracts the calculation result from Rout Array and transmits the calculation result.
In the core structure provided by the above embodiment, because EU stores the calculation result in Rout Array and Tx is taken from Rout Array and then transmitted, it is not necessary to split the task, and when a storage unit is provided, parallel processing of calculation and transmission can be realized, and complexity of programming and compiling is reduced.
FIG. 8 is a schematic diagram of a core architecture according to an embodiment of the present invention.
As shown in fig. 8, the core 1 includes the data processing apparatus 100 and the storage module 200 provided in the above embodiment. The memory module 200 includes one or more memory cells. Wherein the data processing apparatus further comprises an Rx receiver 40.
Alternatively, the Rx receiver 40 and the Tx receiver 30 may be called TR data transceiver modules.
The plurality of memory cells includes a first memory cell 210, a second memory cell 220, and a third memory cell 230.
The first storage unit 210 is used for storing the calculation instruction sent by the CPU to the execution unit 21, and may also be used for storing data for reading and writing by the CPU.
The second storage unit 220 is configured to store a calculation result generated by the calculation module 20, and may also be used to store data for reading and writing data by the calculation module 20.
A third storage unit 230 for storing data received from the Rx receiver 40 from the outside. The Tx transmitter 30 may transmit data after reading the data from the third storage unit 230.
In the present embodiment, the output register group 22 and the execution unit 21 in the data processing apparatus 100 can implement parallel processing during calculation, and the processing module CPU sends a calculation instruction and a transmission instruction to the transmission calculation module and the Tx transmitter, which are not in sequence, so that parallel processing of data calculation and transmission can be implemented. In addition, when the storage module 200 is provided with three storage units, the TR data transceiver module and the processing module CPU in the data processing apparatus can realize full parallelism when reading required data in the data processing process. Therefore, in the present embodiment, the core has high processing speed and good aging properties.
An embodiment of the present invention further provides a chip including one or more of the core structures of the above embodiments.
An embodiment of the present invention further provides a card board, including one or more chips of the above embodiments.
An embodiment of the present invention further provides an electronic device, including one or more of the cards of the above embodiments.
Fig. 9 is a flowchart illustrating a data processing method according to an embodiment of the present invention.
As shown in fig. 9, the method includes steps S101 to S103:
step S101, the processing module sends an instruction, wherein the instruction comprises a calculation instruction and a transmission instruction.
In a preferred embodiment, the processing module sends the calculation instruction to the calculation module and the transmission instruction to the Tx transmitter, not sequentially.
And step S102, the calculation module executes calculation based on the calculation instruction sent by the processing module and generates a calculation result.
In one embodiment, a computing module includes an execution unit and a set of output registers.
The calculation module executes calculation based on the calculation instruction sent by the processing module and generates a calculation result, and the step of sending the transmission instruction based on the calculation result by the calculation module comprises the following steps:
the execution unit executes calculation based on the calculation instruction sent by the processing module and generates the calculation result.
The output register group caches the calculation result, and at least comprises one output register.
In one embodiment, a first signal is sent to an execution unit when a set of output registers is capable of receiving a result of a computation.
Step S103, the Tx transmitter is connected to the output register set, and transmits the calculation result generated by the calculation module based on the transmission instruction.
In an optional embodiment, the steps further comprise:
when the execution unit receives the first signal sent by the output register group, the execution unit judges whether the first signal indicates that the output register group can store the calculation result. If yes, the calculation result is stored in the output register group.
And when the calculation results exist in the output register group or the number of the calculation results is higher than a second preset value, transmitting a fourth signal to the Tx transmitter, and taking out the calculation results from the buffer memory of the output register group by the Tx transmitter and transmitting the calculation results.
And when the calculation result cannot be received in the output register group, sending a second signal to the execution unit. And after receiving the second signal, the execution unit suspends sending the calculation result to the output register group.
When there are no calculation results in the output register set or the number of calculation results is lower than the first preset value, the output register set transmits a third signal to the Tx transmitter, and the Tx transmitter suspends data fetching from the output register set.
The technical scheme of the invention has the following beneficial technical effects
(1) According to the data processing method provided by the invention, the processing module sends the calculation instruction and the transmission instruction, the calculation module generates the calculation result, the Tx transmitter is connected with the calculation module and sends the calculation result generated by the calculation module, the processing module does not need to participate in the calculation process and the transmission of the calculation result at any time, the processing module can further process other data efficiently, and the efficiency of the processing module is improved.
(2) According to the data processing method provided by the embodiment of the invention, the processing module does not send the calculation instruction and the transmission instruction in sequence, and the transmission instruction indicates the Tx transmitter to send the calculation result generated by the calculation module, so that the calculation module can execute the calculation and the Tx transmitter sends the calculation result to be processed in parallel, and the utilization rate of the calculation module is improved.
(3) The data processing device provided by the embodiment of the invention does not need a processing module to split the task, thereby reducing the difficulty of programming and compiling.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims (10)

1. A data processing apparatus, comprising:
a processing module (10) for sending instructions, the instructions comprising: calculating instructions and transmitting instructions;
a calculation module (20) for performing a calculation based on the calculation instruction and generating a calculation result;
a Tx transmitter (30) connected with the calculation module and used for transmitting the calculation result generated by the calculation module (20) based on the transmission instruction.
2. The data processing apparatus of claim 1,
the computing module (20) is further configured to cache the computing result;
the Tx transmitter (30) is used for transmitting the calculation result buffered in the calculation module (20) based on the transmission instruction.
3. The data processing device of claim 1 or 2, wherein the calculation module (20) comprises;
an execution unit (21) for performing a calculation based on the calculation instruction and generating the calculation result;
and the output register group (22) is used for caching the calculation result, and the output register group (22) at least comprises one output register.
4. The data processing apparatus of claim 3,
the output register set (22) is further used for sending a first signal when the output register set (22) can receive the calculation result, and the execution unit (21) sends the calculation result to the output register set (22) based on the first signal; and/or
The output register group (22) is further used for sending a second signal when the calculation result cannot be received in the output register group (22), and the execution unit (21) suspends sending the calculation result based on the second signal.
5. The data processing apparatus of claim 3 or 4,
the output register set (22) further configured to transmit a third signal when the number of the calculation results stored in the output register set (22) is lower than a first preset value, the Tx transmitter (30) suspending fetching from the output register set (22) based on the third signal; and/or
The output register group (22) is further configured to transmit a fourth signal when the number of the calculation results stored in the output register (22) is higher than a second preset value, and the Tx transmitter (30) fetches data from the output register group (22) based on the fourth signal.
6. The data processing apparatus according to any one of claims 3 to 5,
the execution unit (21) comprises one or more multiplier-adders;
the number of the output registers in the output register group (22) is larger than or equal to the number of the multiplier-adders; preferably, the number of output registers arranged in the output register group (22) and the number of multiplier-adders arranged in the execution unit (21) are in a multiple relation; more preferably, the number of output registers provided in the output register group (22) is at least 2 times the number of multipliers and adders provided in the execution unit (21).
7. The data processing apparatus according to any one of claims 1 to 6,
the processing module (10) is used for sending the calculation instruction and the transmission instruction in a non-sequential manner.
8. A core architecture comprising a data processing apparatus according to any one of claims 1 to 7 and a memory module;
the storage module is used for storing instructions and data, and the instructions comprise the calculation instructions and the transmission instructions.
9. A chip comprising one or more core structures according to claim 8.
10. A data processing method, comprising:
the processing module (10) sends a calculation instruction and a transmission instruction;
the calculation module (20) executes calculation based on the calculation instruction to generate a calculation result;
a Tx transmitter (30) connected with the calculation module and transmitting the calculation result generated by the calculation module (20) based on the transmission instruction.
CN201911128755.9A 2019-11-18 2019-11-18 Data processing device and method Pending CN112817638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911128755.9A CN112817638A (en) 2019-11-18 2019-11-18 Data processing device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911128755.9A CN112817638A (en) 2019-11-18 2019-11-18 Data processing device and method

Publications (1)

Publication Number Publication Date
CN112817638A true CN112817638A (en) 2021-05-18

Family

ID=75852943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911128755.9A Pending CN112817638A (en) 2019-11-18 2019-11-18 Data processing device and method

Country Status (1)

Country Link
CN (1) CN112817638A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1115794A (en) * 1997-06-20 1999-01-22 Hitachi Ltd Parallel data processor
CN1881201A (en) * 2006-04-10 2006-12-20 姜咏江 Core design of PU-MU-CHL structured computer
CN101303643A (en) * 2008-06-06 2008-11-12 清华大学 Arithmetic logic unit using asynchronous circuit to implement
CN202694323U (en) * 2012-07-20 2013-01-23 中国地质大学(武汉) Parallel cellular automaton processing system
CN103118096A (en) * 2013-01-25 2013-05-22 浪潮电子信息产业股份有限公司 Battle field real-time information sharing method based on cloud computing
US20140331032A1 (en) * 2013-05-03 2014-11-06 Ashraf Ahmed Streaming memory transpose operations
CN104657330A (en) * 2015-03-05 2015-05-27 浪潮电子信息产业股份有限公司 High-performance heterogeneous computing platform based on x86 architecture processor and FPGA
CN108984235A (en) * 2018-06-29 2018-12-11 郑州云海信息技术有限公司 A kind of method and relevant apparatus of data processing
CN109542515A (en) * 2017-10-30 2019-03-29 上海寒武纪信息科技有限公司 Arithmetic unit and method
CN109739556A (en) * 2018-12-13 2019-05-10 北京空间飞行器总体设计部 A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1115794A (en) * 1997-06-20 1999-01-22 Hitachi Ltd Parallel data processor
CN1881201A (en) * 2006-04-10 2006-12-20 姜咏江 Core design of PU-MU-CHL structured computer
CN101303643A (en) * 2008-06-06 2008-11-12 清华大学 Arithmetic logic unit using asynchronous circuit to implement
CN202694323U (en) * 2012-07-20 2013-01-23 中国地质大学(武汉) Parallel cellular automaton processing system
CN103118096A (en) * 2013-01-25 2013-05-22 浪潮电子信息产业股份有限公司 Battle field real-time information sharing method based on cloud computing
US20140331032A1 (en) * 2013-05-03 2014-11-06 Ashraf Ahmed Streaming memory transpose operations
CN104657330A (en) * 2015-03-05 2015-05-27 浪潮电子信息产业股份有限公司 High-performance heterogeneous computing platform based on x86 architecture processor and FPGA
CN109542515A (en) * 2017-10-30 2019-03-29 上海寒武纪信息科技有限公司 Arithmetic unit and method
CN108984235A (en) * 2018-06-29 2018-12-11 郑州云海信息技术有限公司 A kind of method and relevant apparatus of data processing
CN109739556A (en) * 2018-12-13 2019-05-10 北京空间飞行器总体设计部 A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈海等: "基于实时操作系统的ATmega128串行通信驱动程序设计", 中国医学装备, vol. 13, no. 11, 2 December 2016 (2016-12-02), pages 98 - 103 *
陈海等: "基于实时操作系统的ATmega128串行通信驱动程序设计", 中国医学装备, vol. 13, no. 11, pages 98 - 103 *

Similar Documents

Publication Publication Date Title
CN102906726B (en) Association process accelerated method, Apparatus and system
CN111126589B (en) Neural network data processing device and method and electronic equipment
KR100840140B1 (en) System and method for organizing data transfers with memory hub memory modules
CN100562892C (en) Image processing engine and comprise the image processing system of image processing engine
KR102409024B1 (en) Multi-core interconnect in a network processor
US20130145124A1 (en) System and method for performing shaped memory access operations
TW200907699A (en) Resource management in multi-processor system
CN114546914B (en) Processing device and system for performing data processing on multiple channel information
CN114399035A (en) Method for transferring data, direct memory access device and computer system
EP3951605B1 (en) Data transmission device and method, and readable storage medium
CN114546277A (en) Device, method, processing device and computer system for accessing data
CN114564234A (en) Processing apparatus, method and system for performing data processing on a plurality of channels
US8819305B2 (en) Directly providing data messages to a protocol layer
CN114661353A (en) Data handling device and processor supporting multithreading
CN109558226A (en) A kind of DSP multi-core parallel concurrent calculating dispatching method based on internuclear interruption
CN115994115B (en) Chip control method, chip set and electronic equipment
CN108234147A (en) DMA broadcast data transmission method based on host counting in GPDSP
CN112817638A (en) Data processing device and method
CN109741237B (en) Large-scale image data processing system and method
CN108595369B (en) Arithmetic parallel computing device and method
CN109522125A (en) A kind of accelerated method, device and the processor of matrix product transposition
CN115237349A (en) Data read-write control method, control device, computer storage medium and electronic equipment
KR100639146B1 (en) Data processing system having a cartesian controller
CN116991593B (en) Operation instruction processing method, device, equipment and storage medium
CN113568665B (en) Data processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: Room 201, No. 6 Fengtong Heng Street, Huangpu District, Guangzhou City, Guangdong Province

Applicant after: Guangzhou Ximu Semiconductor Technology Co.,Ltd.

Address before: Building 202-24, No. 6, Courtyard 1, Gaolizhang Road, Haidian District, Beijing

Applicant before: Beijing SIMM Computing Technology Co.,Ltd.

Country or region before: China

CB02 Change of applicant information