WO2021174446A1 - Data processing apparatus and data processing method - Google Patents

Data processing apparatus and data processing method Download PDF

Info

Publication number
WO2021174446A1
WO2021174446A1 PCT/CN2020/077804 CN2020077804W WO2021174446A1 WO 2021174446 A1 WO2021174446 A1 WO 2021174446A1 CN 2020077804 W CN2020077804 W CN 2020077804W WO 2021174446 A1 WO2021174446 A1 WO 2021174446A1
Authority
WO
WIPO (PCT)
Prior art keywords
program
processing
signal
processing core
scheduler
Prior art date
Application number
PCT/CN2020/077804
Other languages
French (fr)
Chinese (zh)
Inventor
王维伟
罗飞
Original Assignee
北京希姆计算科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京希姆计算科技有限公司 filed Critical 北京希姆计算科技有限公司
Priority to PCT/CN2020/077804 priority Critical patent/WO2021174446A1/en
Priority to CN202080096325.8A priority patent/CN115151892A/en
Publication of WO2021174446A1 publication Critical patent/WO2021174446A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/22Microcontrol or microprogram arrangements
    • G06F9/24Loading of the microprogram

Definitions

  • the chip is the cornerstone of data processing, and it fundamentally determines the ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as a central processing unit (CPU), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as Tensor Processing Unit (TPU), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle.
  • a general-purpose chip route such as a central processing unit (CPU), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as Tensor Processing Unit (TPU), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even
  • the chip Due to the wide variety and huge amount of data in the intelligent era, the chip is required to have extremely high flexibility, capable of processing different fields and rapidly changing algorithms, and extremely strong processing capabilities, which can quickly process extremely large and rapidly increasing data. quantity.
  • the object of the present invention is to provide a data processing device and a data processing method.
  • the data processing device is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed to read from an external storage unit through the synchronization scheduler. Take the program corresponding to each processing core and send it to the corresponding processing core.
  • the data processing device provided by the embodiment of the present invention does not require a processing core to fetch data from an external storage unit, which avoids data reading caused by multiple processing cores. The delay improves the computing power of the chip.
  • the programs executed by each processing core can be the same or different, and the synchronization scheduler can respond to the update of multiple processing core programs. Assign tasks flexibly, give full play to the computing power of each processing core, and further enhance the computing power of the device.
  • the first aspect of the present invention provides a data processing device, which includes at least two processing cores; And send a configuration signal; a direct storage access controller is used to read the program corresponding to each processing core from the external storage unit based on the configuration signal and send it to the corresponding processing core.
  • the data processing device provided by the embodiment of the present invention is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed by the synchronization scheduler to read a program from an external storage unit, and send it to the corresponding program Processing cores.
  • the data processing device provided by the embodiments of the present invention does not require processing cores to fetch data from an external storage unit, avoiding the delay caused by multiple processing cores reading data, and improving the computing power of the chip.
  • the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to the update of multiple processing core programs, which can flexibly distribute tasks and give full play to the computing power of each processing core. Further improve the computing power of the chip.
  • the synchronization scheduler is further configured to respond to the program update signal and send a synchronization operation signal to each processing core connected to the synchronization scheduler, and the synchronization operation signal is a predetermined number of data received by the synchronization scheduler.
  • the synchronous operation signal is used to instruct each processing core to start executing their respective programs at the same time.
  • program includes a plurality of program segments.
  • processing core is configured to send the program update signal to the synchronization scheduler after each program segment is executed.
  • the direct access controller is used to read the program segments corresponding to each processing core from the external storage unit, and to process the cores corresponding to the program segments; the direct access controller, It is also used to send the configuration completion signal to the synchronization scheduler after sending the program segment to the corresponding processing core.
  • the synchronization scheduler is further configured to send a synchronization operation signal to each processing core connected to the synchronization scheduler in response to the program update signal, including: the synchronization scheduler is also configured to respond to the program The update signal and the configuration completion signal send the synchronization operation signal to each processing core connected to the synchronization scheduler.
  • the program includes an operation instruction and a program update instruction; the processing core is configured to execute the program update instruction based on the completion of the operation instruction, and complete sending the program update signal based on the program update instruction.
  • the synchronous scheduler includes a counter; the counter is used to record the number of program update signals received; the synchronous scheduler is also used to respond to the number of the program update signals being equal to the predetermined number , Ready to send the synchronous operation signal.
  • the synchronization scheduler is further configured to prepare to send the synchronization operation signal in response to the number of the program update signals being equal to the predetermined number, including: the synchronization scheduler is further configured to respond to the program The number of update signals is equal to the predetermined number, a configuration signal is sent, and the synchronization operation signal is sent after receiving the configuration completion signal sent by the direct access controller.
  • the synchronization scheduler receives a predetermined number of the program update signals includes: the synchronization scheduler receives a predetermined number of the program update signals sent by all the processing cores connected to the synchronization scheduler; Or the synchronization scheduler receiving the predetermined number of the program update signals includes: the synchronization scheduler receives the predetermined number of the program update signals sent by each of the processing cores connected to the synchronization scheduler.
  • the synchronous scheduler is configured to generate and send configuration signals in response to the program update signal of each of the processing cores, and includes: the synchronous scheduler is configured to receive each processing core connected to the synchronous scheduler. After the program update signal is sent, the configuration signal is sent.
  • the at least two processing cores include a first processing core and a second processing core; the first processing core and the second processing core execute different programs, and the calculation result of the program executed by the first processing core is The input of the program executed by the second processing core.
  • a chip including one or more data processing devices provided in the first aspect.
  • a card board which includes one or more chips provided in the second aspect.
  • an electronic device including one or more cards provided in the third aspect.
  • a data processing method comprising: a processing core executes a program; a synchronous scheduler responds to the program update signal of each processing core connected to the synchronous scheduler, and generates and sends a configuration signal; directly Based on the configuration signal, the storage access controller reads the program corresponding to each processing core from the external storage unit and sends it to the corresponding processing core.
  • a computer storage medium having a computer program stored on the computer storage medium, and when the program is executed by a processor, the data processing method of the fifth aspect is implemented.
  • an electronic device including a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor implements The fifth aspect of the data processing method.
  • a computer program product which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the data processing method of the fifth aspect.
  • the data processing device provided by the embodiment of the present invention is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed by the synchronization scheduler to read a program from an external storage unit, and send it to the corresponding program Processing cores.
  • the data processing device provided by the embodiments of the present invention does not require processing cores to fetch data from an external storage unit, avoiding the delay caused by multiple processing cores reading data, and improving the computing power of the chip.
  • the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to the update of multiple processing core programs, which can flexibly distribute tasks and give full play to the computing power of each processing core. Further improve the computing power of the chip.
  • FIG. 1 is a schematic diagram of the structure of a chip provided by the prior art
  • Fig. 2 is a schematic structural diagram of a chip provided by another prior art
  • Figure 3 is a schematic structural diagram of a data processing device according to the present invention.
  • FIG. 4 is a schematic structural diagram of another data processing device according to the present invention.
  • Figure 5 is a schematic diagram of the structure of a neural network provided according to the present invention.
  • FIG. 6 is a schematic diagram of the operation of the data processing device provided by the present invention applied to the neural network;
  • Fig. 7 is a schematic diagram of scheduling processing cores by a synchronous scheduler according to the present invention.
  • Fig. 8 is a schematic flowchart of a data processing method provided according to the present invention.
  • multi-core or many-core architecture chips are often used.
  • the processing cores in a chip with a multi-core or many-core architecture have a certain ability to process data independently, and also have a relatively large internal storage space.
  • the larger storage space is generally used to store its own programs, data, and weights.
  • the computing power of each core depends on many factors, such as task scheduling and distribution, chip architecture, core structure, and core circuit. Among them, task scheduling and allocation is a very critical factor. If task scheduling and allocation are reasonable, the effective computing power of each core can be fully utilized, otherwise the effective computing power of each core is low.
  • FIG. 1 is a schematic diagram of the structure of a chip provided by the prior art.
  • the chip includes a scheduler and multiple processing cores C1 to Cn.
  • the scheduler receives instructions sent from outside the chip.
  • the scheduler receives instructions from outside the chip.
  • the instruction sent by the source is then transmitted to each processing core according to a preset strategy (for example, in a preset order), and each processing core executes the same instruction but processes different data. For example, if the instruction is to process a+b, but the a or b of the two processing cores may be different values, then the data processed by the two processing cores are different data.
  • each processing core can have a relatively simple structure, such as a single instruction multiple data structure (SIMD) or a single instruction multiple thread structure (SIMT) .
  • SIMD single instruction multiple data structure
  • SIMT single instruction multiple thread structure
  • the scheduler can only passively receive instructions from the outside and allocate them to each processing core. Regardless of whether it is a SIMD structure or a SIMT structure, each processing core can only execute the same instructions, resulting in a single chip with a single function and lack of flexibility.
  • Fig. 2 is a schematic structural diagram of a chip provided by another prior art.
  • the chip includes a plurality of processing cores C1 to Cn and a memory unit memory.
  • each core can independently read instructions from Memory (such as DDR) and perform operations.
  • Memory such as DDR
  • each core has a complete control circuit, register set and other circuits. This structure is in a multi-core CPU or ASIC. More common.
  • Each processing core has high autonomy and can run instructions independently. However, due to the high autonomy of the processing core, it is difficult for multiple processing cores to cooperate with each other to efficiently complete a complete task.
  • Fig. 3 is a schematic structural diagram of a data processing device according to the first embodiment of the present invention.
  • the data processing device includes at least two processing cores, a synchronous scheduler (Synchronizer and Scheduler, S_S), and a direct memory access controller (Direct Memory Access Controller, DMAC).
  • a synchronous scheduler Synchronizer and Scheduler, S_S
  • a direct memory access controller Direct Memory Access Controller, DMAC
  • S_S is connected to at least two processing cores, and the at least two processing cores can be all processing cores in the data processing device, such as processing core C1 to processing core Cn, DMCA and at least two processing cores, S_S and external storage Unit Memory connection.
  • S_S is used to generate and send a configuration signal in response to the program update signal of each processing core connected to the S_S.
  • DAMC is used to read the program corresponding to each processing core from the external Memory and send it to the corresponding processing core based on the configuration signal.
  • the data processing device provided by the embodiment of the present invention is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed by the synchronization scheduler to read a program from an external storage unit, and send it to the corresponding program Processing cores.
  • the data processing device provided by the embodiments of the present invention does not require processing cores to fetch data from an external storage unit, avoiding the delay caused by multiple processing cores reading data, and improving the computing power of the chip.
  • the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to the update of multiple processing core programs, which can flexibly distribute tasks and give full play to the computing power of each processing core. Further improve the computing power of the chip.
  • S_S is also used to respond to the program update signal to send a synchronization operation signal to the processing core connected to S_S.
  • the synchronization operation signal is that after S_S receives a predetermined number of the program update signals, it sends a synchronization operation signal to each processor.
  • the synchronous operation signal sent by the core is used to instruct each processing core to start executing their respective programs at the same time.
  • S_S is also used to send a configuration signal first in response to the number of program update signals being equal to the predetermined number, and then send the synchronization operation signal after receiving the configuration completion signal sent by the direct access controller .
  • the program includes a plurality of program segments.
  • the processing core is configured to send the program update signal to the S_S after each program segment is executed.
  • the number of program segments executed by each processing core is the same.
  • S_S includes a first counter, and the first counter is used to record the number of program update signals received.
  • S_S is also used for preparing to send the synchronous running signal in response to the number of program update signals recorded by the first counter being equal to the predetermined number.
  • the number of the first counter may be, for example, one or more.
  • the predetermined number may be, for example, the sum of the program update signals sent by each core received by S_S, that is, the predetermined number is the number of processing cores connected to S_S and the number of program segments. The product of.
  • the preset number recorded by the first counter can be set to one or more.
  • the data processing device includes 2 processing cores, and the program segments executed by each processing core are 4 segments.
  • the preset number of the first counter record When the preset number of the first counter record is set to a value, the preset number of the first counter record is set to 8, that is, when the program update signal received by the first counter record is 8, S_S will send to the DMAC Configuration signal. When S_S receives the configuration completion signal returned by DMAC, S_S sends a synchronization signal to each core. At this time, the first counter is cleared and counting is restarted.
  • the preset number recorded by the first counter When the preset number recorded by the first counter is set to multiple values, the preset number recorded by the counter is 8, 16, 24... These preset numbers are the cumulative number of program update signals received by the synchronization controller.
  • S_S When the first counter records that the cumulative number of program update signals received is 8, S_S will send a configuration signal to the DMAC. When the DMAC returns a configuration completion signal, S_S will send a synchronization signal to each core; the counter is not cleared at this time, Continue counting. When the cumulative number of program update signals received by the counter records reaches 16, S_S will send a configuration signal to the DMAC. When receiving the configuration completion signal returned by the DMAC, S_S will send a synchronization signal to each core; the first counter continues Accumulate count, when it reaches the next preset number, S_S repeats the above steps.
  • the predetermined number may be the number of program segments.
  • each processing core connected to the S_S corresponds to a first counter, and the first counter is used to record the number of program update signals sent by the processing core corresponding to S_S.
  • each processing core After each processing core has sent a predetermined number of program update signals, that is, when each first counter receives a preset number of program update signals, it sends a configuration signal to the DMAC. After receiving the configuration complete signal sent by the DMAC, S_S Send a synchronization signal to each core.
  • S_S is used to generate and send a configuration signal in response to the program update signal of each processing core, including: S_S, used to receive the program sent by each processing core connected to S_S Send the configuration signal after updating the signal. For example, if there are 5 processing cores connected to S_S, S_S sends the configuration signal after receiving the program update signal sent by all 5 processing cores.
  • the direct access to the controller is used to read the program segment from the external Memory, and send the program segment to the processing core corresponding to the program segment.
  • the external memory refers to the storage unit on the Host, for example.
  • the direct access controller is further configured to send a configuration completion signal to the S_S after sending the program segment to the processing core corresponding to the program segment.
  • S_S is further configured to send a synchronization operation signal to the processing core connected to S_S in response to the program update signal, including:
  • S_S is also used to send a synchronous operation signal to each processing core connected to S_S in response to the program update signal and the configuration completion signal.
  • S_S is used to send the configuration signal to the DMAC in response to the number of program update signals recorded by the counter being equal to the predetermined number.
  • the DMAC sends the program or program segment indicated by the configuration signal to the corresponding processing core, and then sends it to the S_S Send the signal that the configuration is complete, S_S sends a synchronous operation signal to each processing core connected to the SS after receiving the signal that the configuration is complete.
  • the "program indicated by the configuration signal” may be the same or different from the program that has just been executed by the processing core.
  • the program executed by the processing core includes an operation instruction and a program update instruction; the processing core is used to execute the program update instruction after completing the operation instruction, and generate and send a program update signal based on the program update instruction.
  • each program segment includes arithmetic instructions and program update instructions.
  • the processing core is provided with a storage module PRAM, which is used to store and receive the program sent by the DMAC, and the program executed by the processing core is read from its own PRAM.
  • PRAM storage module
  • each processing core reads the instructions contained in the program from its own set of PRAM without reading from the external memory, so it can avoid the design of complex cache circuits. Compared with the prior art, there is no need to read from the memory. Reading data reduces the delay and greatly improves the execution efficiency of instructions.
  • the storage space of the PRAM is greater than or equal to 16KB.
  • the programs stored in any two processing cores are the same or different.
  • At least two processing cores in the data processing device include a first processing core and a second processing core; the first processing core and the second processing core execute different programs, and the first processing core executes
  • the calculation result of the program may be the input of the program executed by the second processing core.
  • the calculation result of the program executed by the first processing core is used as the input of the second processing core, so that the chip provided in the embodiment can be used for the calculation of the neural network.
  • each processing core can run their stored programs at the same time, which enables the orderly exchange of data between each processing core, and cooperates with each other to efficiently complete a complete task.
  • the data processing device provided by the embodiment of the present invention allocates programs to each processing core through S_S and DMAC.
  • the programs and data exchanges executed by each processing have been set before the program runs.
  • the micro-control unit MCU on the top of the chip or The host of the system only needs to configure the counter of S_S to achieve the established strategy, and the MCU or the host of the system can change the configuration of the counter of S_S and the stored program of the Memory to change the program executed by each core and the distribution of the program And scheduling, it is convenient to modify the allocation and scheduling of each processing core task in the chip, and can efficiently use the computing power of the processing core.
  • a chip which includes one or more data processing devices provided in the above aspects.
  • the chip when the chip includes multiple data processing devices, the chip may include multiple S_Ss, and each S_S is connected to multiple processing cores and a direct access controller.
  • a card board which includes one or more chips provided in the foregoing embodiments.
  • an electronic device including one or more of the card boards provided in the foregoing embodiments.
  • Fig. 4 is a schematic structural diagram of a data processing device provided by the present invention.
  • the device includes a first processing core C1, a second processing core C2, S_S, and DMAC.
  • Each processing core is provided with PRAM.
  • Fig. 5 is a schematic diagram of the structure of the neural network provided by the present invention.
  • This embodiment takes a two-layer neural network as an example.
  • the neural network has a two-layer structure.
  • the control program of each layer of the neural network is 128KB, and the calculation amount of each layer is the same.
  • the sub-task allocation strategy is allocated to two processing core pipelines for calculation, that is, C1 and C2 are each responsible for the calculation of a layer of the network, and each runs the program of the corresponding layer. For example, C1 calculates the first layer network Layer1, and C2 calculates the second layer network Layer2.
  • the input data will be sent to C1, C1 will process the input data in the first layer, and send the processing result of the first layer to C2, and C2 will use the processing result of the first layer as input and perform the second layer processing to get
  • the final result is output, that is, the data flows through Layer 1 and Layer 2 in turn to realize the operation of the entire neural network, and finally get the output.
  • Fig. 6 is a schematic diagram of the operation of applying the chip provided by the present invention to a neural network.
  • Input1-1 represents the input of the entire neural network, and also represents that Input1-1 is used as the input of the first layer of neural network layer1 at the beginning of the time period t1, and Input2-1 represents the calculation of layer1 in the time period t1 As a result, it is also used as the input of the second layer of neural network layer2 at the time period t2, output1 represents the calculation result of layer2 after the time period t2, and the calculation result is also the output result of the neural network.
  • C1 When processing the pipeline of layer1, its input is the input of neural network data, and its output is used as the input of C2, and the output of C2 is the final output result.
  • control program of each layer To 128KB. Since the PRAM of each processing core is only 32KB, the control program of each layer needs to be scheduled according to a certain strategy to update the program of each core. For example, the program of each core can be divided into four program segments and transmitted to the corresponding core in an average of 32KB each time.
  • C1 receives the input of the first program segment of the first program at t1, and executes the first program.
  • C1 sends the program update instruction PU_S1 to S_S, because the output of C1 is C2 Input, you can set C1 to run each program segment of the first program, S_S sends a configuration signal to the DMAC every time it receives only the program update signal sent by C1, until C1 runs the last program segment of the first program, DMAC sends configuration signals, DMAC reads the first section of the second program run by C1 and the first section of the first program run by C2 from the external memory, and sends them to C1 and C2.
  • the DMAC sends a configuration completion signal to S_S
  • S_S sends a synchronous operation signal Sync to C1 and C2, instructing C1 and C2 to start running the received program segment at the same time.
  • PUpdate sends the generated program update signal PU_s to S_S.
  • PU_s indicates that the program in PRAM needs to be updated; after S_S receives the signal, it judges whether All PU_s sent by all processing cores connected to S_S are received. If PU_s sent by all processing cores connected to S_S are not received, it will be in a waiting state until all processing cores connected to S_S are sent. PU_s. If S_S receives the PU_s sent by all the processing cores connected to S_S, it sends a configuration signal to the DMAC. The DMAC fetches the program segment indicated by the configuration signal from the external Memory and sends it to the processing core corresponding to the program segment to control the PRAM is updated.
  • S_S After receiving the program update signal sent by all processing cores connected to S_S for the fourth time, it indicates that all program segments of each core and each layer have been executed, and S_S first sends a configuration signal to the DMAC.
  • DMCA reads the first program segment of the program indicated by the configuration signal from the external memory according to the configuration signal and sends it to the corresponding processing core.
  • the DMAC sends a configuration completion signal.
  • S_S When S_S receives the configuration completion signal sent by DMAC, it generates and sends a synchronization operation signal Sync, indicating that the programs of each core need to be executed at the same time. After receiving the synchronization operation signal, each core can start to exchange data, that is C1 sends the calculation result to C2.
  • program indicated by the configuration signal may be the same or different from the program that has just been executed.
  • C1 to receive and input the first program segment of the first program at time t1
  • C2 to also receive the first program segment of the first program at time t1
  • both C1 and C2 are executed
  • the input of the first program segment of the first program run by C2 can be set as a preset value, so that both C1 and C2 execute the first program at the same time.
  • Fig. 7 is a schematic diagram of scheduling processing cores by a synchronous scheduler provided by the present invention.
  • C1 and C2 receive the Sync signal sent by S_S and run the program stored in the PRAM from the beginning. After C1 executes the operation instruction in the first block, it will execute the last instruction in the first block, namely the update instruction PUpdate, which means that the execution of this block has been completed. PUpdate generates the program update signal PU_s1 and sends the update signal Give S_S, then C1 starts to wait.
  • S_S receives PU_s1 and finds that it has not received the update signal sent by each processing core connected to S_S, that is, PU_s2 has not received this time, and will continue to wait.
  • S_S receives PU_s2, and the first counter finds that it has received the program update signal sent by each processing core connected to S_S, and configures the DMAC to start the program update of C1 and C2.
  • DMAC reads the new program segments of C1 and C2 from Memory, and then sends them to the PRAM of C1 and C2 respectively, until the new program segment is updated.
  • the first counter of S_S When the first counter of S_S records that the number of program update signals received is 8, it means that the two processing cores have completed their four program segments, that is, the entire program of each of the two processing cores has been executed.
  • S_S The first counter will be reset, and the update times will be re-recorded from the beginning; first configure the DMAC, start the program update of C1 and C2, that is, reload the first program segment of the next program, when the DMAC sends the program segment, it will send to S_S Send the configuration completion signal, S_S generates a synchronous operation signal Sync, and sends it to each core, instructing each core to start working at the same time. At this time, each core can transmit data to each other.
  • the working sequence of resetting the first counter, configuring the DMAC, and sending the synchronous operation signal may be in no particular order.
  • the above is an example of a 2-layer neural network.
  • the calculation result of C1 is taken as the input data of C2 as an example.
  • the data processing device provided by the embodiment of the present invention can be used in any neural network, and C1 and C2 can also be used. There is no data link.
  • Fig. 8 is a schematic flow chart of the data processing method provided by the present invention.
  • the data processing method includes:
  • Step S101 processing the core execution program
  • Step S102 the synchronous scheduler generates and sends a configuration signal in response to the program update signal of each processing core connected to the synchronous scheduler;
  • Step S103 Based on the configuration signal, the direct storage access controller reads the program corresponding to each processing core from the external storage unit and sends it to the corresponding processing core.
  • the synchronization scheduler also sends a synchronization operation signal to each processing core connected to the S_S in response to the program update signal, and the synchronization operation signal is used to instruct the processing cores connected to the synchronization scheduler to start at the same time. Perform the respective procedures.
  • S_S also sends a synchronization operation signal to each processing core connected to S_S in response to the program update signal, including: when S_S receives a predetermined number of the program update signals, sending to each processing core connected to S_S Synchronous operation signal.
  • the program executed by the processing core includes a plurality of program segments.
  • the processing core After executing each program segment, the processing core sends the program update signal to the synchronous scheduler.
  • the number of program segments executed by each processing core is the same.
  • the first counter of S_S records the number of program update signals received.
  • the number of the first counter may be, for example, one or more.
  • the predetermined number may be, for example, the sum of the program update signals sent by each core received by S_S, that is, the predetermined number is the number of processing cores connected to S_S and the number of program segments. The product of.
  • the preset number recorded by the first counter can be set to one or more.
  • the data processing device includes two processing cores, and the program segments executed by each processing core are 4 segments.
  • the preset number of the first counter record When the preset number of the first counter record is set to one, the preset number of the first counter record is set to 8, that is, when the program update letter received by the first counter record is 8, S_S will send to the DMAC Configuration signal. After S_S receives the configuration completion signal returned by DMAC, S_S will send a synchronization signal to each core. At this time, the first counter is cleared and counting is restarted.
  • the preset number of the first counter record When the preset number of the first counter record is set to be multiple, the preset number of the counter record is 8, 16, 24... These preset numbers are the cumulative number of program update signals received by the synchronization controller.
  • S_S When the first counter records that the cumulative number of program update signals received is 8, S_S will send a configuration signal to the DMAC. When the DMAC returns a configuration completion signal, S_S will send a synchronization signal to each core; the counter is not cleared at this time, Continue counting. When the cumulative number of program update signals received by the counter records reaches 16, S_S will send a configuration signal to the DMAC. When receiving the configuration completion signal returned by the DMAC, S_S will send a synchronization signal to each core; the first counter continues Accumulate count, when it reaches the next preset number, S_S repeats the above steps.
  • the predetermined number may be the number of program segments.
  • each processing core connected to the S_S corresponds to a first counter, and the first counter is used to record the number of program update signals sent by the processing core corresponding to S_S.
  • each processing core After each processing core has sent a predetermined number of program update signals, that is, when each first counter receives a preset number of program update signals, it sends a configuration signal to the DMAC. After receiving the configuration complete signal sent by the DMAC, S_S Send a synchronization signal to each core.
  • S_S generates and sends a configuration signal in response to a program update signal of each processing core, including: S_S, sends the configuration signal after receiving a program update signal sent by all processing cores connected to S_S. For example, if there are five processing cores connected to the synchronous scheduler, the synchronous scheduler sends the configuration signal after receiving the program update signal sent by all the five processing cores.
  • the DMAC also reads the program segment corresponding to each processing core from the external Memory, and sends the program segment to the corresponding processing core.
  • DMAC after sending the program segment to the corresponding processing core, sends the configuration completion signal to S_S.
  • S_S also sends a synchronization operation signal to each processing core connected to S_S in response to the program update signal, including: S_S, which is also used to respond to the program update signal and the configuration completion signal to each processing core connected to S_S
  • the core sends a synchronous operation signal.
  • S_S in response to the number of program update signals recorded by the counter being equal to the predetermined number, S_S first sends the configuration signal to the DMAC.
  • the DMAC sends the program or program segment indicated by the configuration signal to the corresponding processing core, and then sends the configuration complete to S_S After receiving the configuration completion signal, S_S sends a synchronization operation signal to each processing core connected to SS.
  • the program includes an operation instruction and a program update instruction; after the operation instruction is completed, the processing core executes the program update instruction, and based on the completion of the program update instruction, generates and sends a program update signal.
  • the completion operation instruction refers to when the operation instruction is completed or after the operation instruction is completed.
  • the at least two processing cores include a first processing core and a second processing core; the programs executed by the first processing core and the second processing core are different, and the calculation result of the program executed by the first processing core is the second processing Input to the program executed by the nuclear.
  • a computer storage medium is provided, and a computer program is stored on the computer storage medium, and when the program is executed by a processor, the data processing method provided in the above embodiment is implemented.
  • an electronic device including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the processor executes the program
  • the data processing method provided by the foregoing embodiment is implemented.
  • a computer program product which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the data processing method provided in the above-mentioned embodiments.
  • the direct storage access controller is instructed by the synchronization scheduler to read the program from the external storage unit and send it to the processing core corresponding to the program.
  • the storage unit fetches data to avoid the delay caused by multiple processing cores to read data, which improves the computing power of the chip.
  • the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to multiple processing cores.
  • the update of the program can flexibly allocate tasks, give full play to the computing power of each processing core, and further enhance the computing power of the chip.

Abstract

A data processing apparatus and a data processing method. The data processing apparatus comprises: at least two processing cores; a synchronous scheduler, configured to respond to a program update signal of each processing core connected to the synchronous scheduler, and generate and send a configuration signal; and a direct storage access controller, configured to read a program corresponding to each processing core from an external storage unit based on the configuration signal and send the program to a corresponding processing core. On the one hand, the data processing apparatus can read the external storage unit without using a processing core, avoiding data reading delay of a plurality of processing cores and improving the computing power of a chip, and on the other hand, the program executed by each processing core of the data processing apparatus may be the same or different, the synchronous scheduler responds to the update of programs of the plurality of processing cores, task assignment can be flexibly carried out, the computing power of each processing core is fully provided, and the computing power of the chip is further improved.

Description

一种数据处理装置及数据处理方法Data processing device and data processing method 背景技术Background technique
随着科学技术的发展,人类社会正在快速进入智能时代。智能时代的重要特点,就是人们获得数据的种类越来越多,获得数据的量越来越大,而对处理数据的速度要求越来越高。With the development of science and technology, human society is rapidly entering the era of intelligence. The important feature of the intelligent age is that people have more and more types of data, the amount of data they can obtain is larger and larger, and the requirements for the speed of data processing are getting higher and higher.
芯片是数据处理的基石,它从根本上决定了人们处理数据的能力。从应用领域来看,芯片主要有两条路线:一条是通用芯片路线,例如中央处理器(Central Processing Unit,CPU)等,它们能提供极大的灵活性,但是在处理特定领域算法时有效算力比较低;另一条是专用芯片路线,例如张量处理器(Tensor Processing Unit,TPU)等,它们在某些特定领域,能发挥较高的有效算力,但是面对灵活多变的比较通用的领域,它们处理能力比较差甚至无法处理。The chip is the cornerstone of data processing, and it fundamentally determines the ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as a central processing unit (CPU), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as Tensor Processing Unit (TPU), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle.
由于智能时代的数据种类繁多且数量巨大,所以要求芯片既具有极高的灵活性,能处理不同领域且日新月异的算法,又具有极强的处理能力,能快速处理极大的且急剧增长的数据量。Due to the wide variety and huge amount of data in the intelligent era, the chip is required to have extremely high flexibility, capable of processing different fields and rapidly changing algorithms, and extremely strong processing capabilities, which can quickly process extremely large and rapidly increasing data. quantity.
发明内容Summary of the invention
(一)发明目的(1) Purpose of the invention
本发明的目的是提供一种数据处理装置及数据处理方法,该数据处理装置中设置有同步调度器和直接存储访问控制器,通过同步调度器指示直接存储访问控制器从外部的存储单元中读取与各个处理核对应的程序,并发送给与相应的处理核,一方面,本发明实施例提供的数据处理装置无需处理核从外部存储单元取数,避免多个处理核读取数据产生的延时,提高了芯片的算 力,另一方面,本发明实施例提供的数据处理装置,各个处理核执行的程序可以相同或不同,并通过同步调度器响应多个处理核程序的更新,能够灵活的进行任务分配,充分发挥各处理核的算力,进一步的提升装置的算力。The object of the present invention is to provide a data processing device and a data processing method. The data processing device is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed to read from an external storage unit through the synchronization scheduler. Take the program corresponding to each processing core and send it to the corresponding processing core. On the one hand, the data processing device provided by the embodiment of the present invention does not require a processing core to fetch data from an external storage unit, which avoids data reading caused by multiple processing cores. The delay improves the computing power of the chip. On the other hand, in the data processing device provided by the embodiment of the present invention, the programs executed by each processing core can be the same or different, and the synchronization scheduler can respond to the update of multiple processing core programs. Assign tasks flexibly, give full play to the computing power of each processing core, and further enhance the computing power of the device.
(二)技术方案(2) Technical solution
为解决上述问题,本发明的第一方面提供了一种数据处理装置,该装置包括至少2个处理核;同步调度器,用于响应与同步调度器连接的各个处理核的程序更新信号,生成并发送配置信号;直接存储访问控制器,用于基于所述配置信号,从外部存储单元中读取与各个处理核对应的程序并发送给相应的处理核。In order to solve the above-mentioned problems, the first aspect of the present invention provides a data processing device, which includes at least two processing cores; And send a configuration signal; a direct storage access controller is used to read the program corresponding to each processing core from the external storage unit based on the configuration signal and send it to the corresponding processing core.
本发明实施方式提供的数据处理装置中设置有同步调度器和直接存储访问控制器,通过同步调度器指示直接存储访问控制器从外部的存储单元中读取程序,并发送给与该程序对应的处理核,一方面,本发明实施例提供的数据处理装置无需处理核从外部存储单元取数,避免多个处理核读取数据产生的延时,提高了芯片的算力,另一方面,本发明实施例提供的数据处理装置,各个处理核执行的程序可以相同或不同,并通过同步调度器响应多个处理核程序的更新,能够灵活的进行任务分配,充分发挥各处理核的算力,进一步的提升芯片的算力。The data processing device provided by the embodiment of the present invention is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed by the synchronization scheduler to read a program from an external storage unit, and send it to the corresponding program Processing cores. On the one hand, the data processing device provided by the embodiments of the present invention does not require processing cores to fetch data from an external storage unit, avoiding the delay caused by multiple processing cores reading data, and improving the computing power of the chip. In the data processing device provided by the embodiment of the invention, the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to the update of multiple processing core programs, which can flexibly distribute tasks and give full play to the computing power of each processing core. Further improve the computing power of the chip.
进一步地,所述同步调度器,还用于响应所述程序更新信号,向与同步调度器连接的各个处理核发送同步运行信号,所述同步运行信号为所述同步调度器接收到预定数目的所述程序更新信号后,向各个处理核发送的,所述同步运行信号用于指示各个处理核同时开始执行各自程序。Further, the synchronization scheduler is further configured to respond to the program update signal and send a synchronization operation signal to each processing core connected to the synchronization scheduler, and the synchronization operation signal is a predetermined number of data received by the synchronization scheduler. After the program update signal is sent to each processing core, the synchronous operation signal is used to instruct each processing core to start executing their respective programs at the same time.
进一步地,所述程序包括多个程序段。Further, the program includes a plurality of program segments.
进一步地,所述处理核,用于在执行每个所述程序段后,均向所述同步调度器发送所述程序更新信号。Further, the processing core is configured to send the program update signal to the synchronization scheduler after each program segment is executed.
进一步地,所述直接访问控制器,用于从所述外部存储单元中读取与各个处理核对应的所述程序段,并将所述程序段相应的处理核;所述直接访问控制器,还用于在将所述程序段发送给相应的处理核后,将配置完成的信号 发送给所述同步调度器。Further, the direct access controller is used to read the program segments corresponding to each processing core from the external storage unit, and to process the cores corresponding to the program segments; the direct access controller, It is also used to send the configuration completion signal to the synchronization scheduler after sending the program segment to the corresponding processing core.
进一步地,所述同步调度器,还用于响应于所述程序更新信号向与同步调度器连接的各个处理核发送同步运行信号,包括:所述同步调度器,还用于响应于所述程序更新信号以及所述配置完成信号,向与同步调度器连接的各个处理核发送所述同步运行信号。Further, the synchronization scheduler is further configured to send a synchronization operation signal to each processing core connected to the synchronization scheduler in response to the program update signal, including: the synchronization scheduler is also configured to respond to the program The update signal and the configuration completion signal send the synchronization operation signal to each processing core connected to the synchronization scheduler.
进一步地,所述程序包括运算指令和程序更新指令;所述处理核用于基于所述运算指令完成,执行所述程序更新指令,并基于所述程序更新指令完成发送所述程序更新信号。Further, the program includes an operation instruction and a program update instruction; the processing core is configured to execute the program update instruction based on the completion of the operation instruction, and complete sending the program update signal based on the program update instruction.
进一步地,所述同步调度器包括计数器;所述计数器用于记录收到的程序更新信号的个数;所述同步调度器还用于响应于所述程序更新信号的个数等于所述预定数目,准备发送所述同步运行信号。Further, the synchronous scheduler includes a counter; the counter is used to record the number of program update signals received; the synchronous scheduler is also used to respond to the number of the program update signals being equal to the predetermined number , Ready to send the synchronous operation signal.
进一步地,所述同步调度器还用于响应于所述程序更新信号的个数等于所述预定数目,准备发送所述同步运行信号,包括:所述同步调度器还用于响应于所述程序更新信号的个数等于所述预定数目,发送配置信号,在收到所述直接访问控制器发送的配置完成的信号后,再发送所述同步运行信号。Further, the synchronization scheduler is further configured to prepare to send the synchronization operation signal in response to the number of the program update signals being equal to the predetermined number, including: the synchronization scheduler is further configured to respond to the program The number of update signals is equal to the predetermined number, a configuration signal is sent, and the synchronization operation signal is sent after receiving the configuration completion signal sent by the direct access controller.
进一步地,所述同步调度器接收到预定数目的所述程序更新信号,包括:所述同步调度器接收到与同步调度器连接的所有所述处理核发送的预定数目的所述程序更新信号;或者所述同步调度器接收到预定数目的所述程序更新信号包括:所述同步调度器接收到与同步调度器连接的每个所述处理核发送的预定数目的所述程序更新信号。Further, that the synchronization scheduler receives a predetermined number of the program update signals includes: the synchronization scheduler receives a predetermined number of the program update signals sent by all the processing cores connected to the synchronization scheduler; Or the synchronization scheduler receiving the predetermined number of the program update signals includes: the synchronization scheduler receives the predetermined number of the program update signals sent by each of the processing cores connected to the synchronization scheduler.
进一步地,所述同步调度器,用于响应各个所述处理核的程序更新信号,生成并发送配置信号,包括:所述同步调度器,用于接收到与同步调度器连接的各个处理核均发送的所述程序更新信号后发送所述配置信号。Further, the synchronous scheduler is configured to generate and send configuration signals in response to the program update signal of each of the processing cores, and includes: the synchronous scheduler is configured to receive each processing core connected to the synchronous scheduler. After the program update signal is sent, the configuration signal is sent.
进一步地,至少两个处理核包括第一处理核和第二处理核;第一处理核和所述第二处理核执行的程序不同,所述第一处理核执行的程序的计算结果为所述第二处理核执行的程序的输入。Further, the at least two processing cores include a first processing core and a second processing core; the first processing core and the second processing core execute different programs, and the calculation result of the program executed by the first processing core is The input of the program executed by the second processing core.
根据本发明的第二方面,提供了一种芯片,包括一个或多个第一方面提 供的数据处理装置。According to a second aspect of the present invention, there is provided a chip including one or more data processing devices provided in the first aspect.
根据本发明的第三方面,提供了一种卡板,包括一个或多个第二方面提供的芯片。According to a third aspect of the present invention, a card board is provided, which includes one or more chips provided in the second aspect.
根据本发明的第四方面,提供了一种电子设备,包括一个或多个第三方面提供的卡板。According to a fourth aspect of the present invention, there is provided an electronic device including one or more cards provided in the third aspect.
根据本发明的第五方面,提供了一种数据处理方法,该方法包括:处理核执行程序;同步调度器响应与同步调度器连接的各个处理核的程序更新信号,生成并发送配置信号;直接存储访问控制器基于所述配置信号,从外部存储单元中读取与各个处理核对应的程序并发给相应的处理核。According to a fifth aspect of the present invention, there is provided a data processing method, the method comprising: a processing core executes a program; a synchronous scheduler responds to the program update signal of each processing core connected to the synchronous scheduler, and generates and sends a configuration signal; directly Based on the configuration signal, the storage access controller reads the program corresponding to each processing core from the external storage unit and sends it to the corresponding processing core.
根据本发明的第六方面,提供了一种计算机存储介质,所述计算机存储介质上存储有计算机程序,所述程序被处理器执行时实现第五方面的数据处理方法。According to a sixth aspect of the present invention, there is provided a computer storage medium having a computer program stored on the computer storage medium, and when the program is executed by a processor, the data processing method of the fifth aspect is implemented.
根据本发明的第七方面,提供了一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现第五方面的数据处理方法。According to a seventh aspect of the present invention, there is provided an electronic device including a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor implements The fifth aspect of the data processing method.
根据本发明的第八方面,提供一种计算机程序产品,其中,包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行第五方面的数据处理方法。According to an eighth aspect of the present invention, a computer program product is provided, which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the data processing method of the fifth aspect.
(三)有益效果(3) Beneficial effects
本发明的上述技术方案具有如下有益的技术效果:The above technical solution of the present invention has the following beneficial technical effects:
本发明实施方式提供的数据处理装置中设置有同步调度器和直接存储访问控制器,通过同步调度器指示直接存储访问控制器从外部的存储单元中读取程序,并发送给与该程序对应的处理核,一方面,本发明实施例提供的数据处理装置无需处理核从外部存储单元取数,避免多个处理核读取数据产生的延时,提高了芯片的算力,另一方面,本发明实施例提供的数据处理装置,各个处理核执行的程序可以相同或不同,并通过同步调度器响应多个处理核 程序的更新,能够灵活的进行任务分配,充分发挥各处理核的算力,进一步的提升芯片的算力。The data processing device provided by the embodiment of the present invention is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed by the synchronization scheduler to read a program from an external storage unit, and send it to the corresponding program Processing cores. On the one hand, the data processing device provided by the embodiments of the present invention does not require processing cores to fetch data from an external storage unit, avoiding the delay caused by multiple processing cores reading data, and improving the computing power of the chip. In the data processing device provided by the embodiment of the invention, the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to the update of multiple processing core programs, which can flexibly distribute tasks and give full play to the computing power of each processing core. Further improve the computing power of the chip.
附图说明Description of the drawings
图1是一现有技术提供的芯片的结构示意图;FIG. 1 is a schematic diagram of the structure of a chip provided by the prior art;
图2是另一现有技术提供的芯片的结构示意图;Fig. 2 is a schematic structural diagram of a chip provided by another prior art;
图3是根据本发明提供的一种数据处理装置的结构示意图;Figure 3 is a schematic structural diagram of a data processing device according to the present invention;
图4是根据本发明提供的另一数据处理装置的结构示意图;Figure 4 is a schematic structural diagram of another data processing device according to the present invention;
图5是根据本发明提供的神经网络的结构示意图;Figure 5 is a schematic diagram of the structure of a neural network provided according to the present invention;
图6是根据本发明提供的数据处理装置应用到神经网络的运算示意图;6 is a schematic diagram of the operation of the data processing device provided by the present invention applied to the neural network;
图7是根据本发明提供的同步调度器对处理核的调度示意图;Fig. 7 is a schematic diagram of scheduling processing cores by a synchronous scheduler according to the present invention;
图8是根据本发明提供的数据处理方法的流程示意图。Fig. 8 is a schematic flowchart of a data processing method provided according to the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚明了,下面结合具体实施方式并参照附图,对本发明进一步详细说明。应该理解,这些描述只是示例性的,而并非要限制本发明的范围。此外,在以下说明中,省略了对公知结构和技术的描述,以避免不必要地混淆本发明的概念。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings. It should be understood that these descriptions are only exemplary, and are not intended to limit the scope of the present invention. In addition, in the following description, descriptions of well-known structures and technologies are omitted to avoid unnecessarily obscuring the concept of the present invention.
在神经网络计算中,经常会用到多核或者众核架构的芯片。一般多核或众核架构的芯片中的处理核,都有一定独立处理数据的能力,并且也会带有比较大内存储空间,较大的存储空间一般用于存储自身的程序、数据和权重。In neural network computing, multi-core or many-core architecture chips are often used. Generally, the processing cores in a chip with a multi-core or many-core architecture have a certain ability to process data independently, and also have a relatively large internal storage space. The larger storage space is generally used to store its own programs, data, and weights.
如何让众多的核能够高效率的发挥算力,是决定整个芯片性能的关键。各核的算力发挥,取决于多种因素,例如任务的调度与分配、芯片的架构、核的结构、核的电路等。其中任务的调度与分配是一个非常关键的因素,如果任务的调度与分配合理,则能充分发挥各核的有效算力高,否则各核的有效算力低。How to enable many cores to exert their computing power efficiently is the key to determining the performance of the entire chip. The computing power of each core depends on many factors, such as task scheduling and distribution, chip architecture, core structure, and core circuit. Among them, task scheduling and allocation is a very critical factor. If task scheduling and allocation are reasonable, the effective computing power of each core can be fully utilized, otherwise the effective computing power of each core is low.
图1是一现有技术提供的芯片的结构示意图。FIG. 1 is a schematic diagram of the structure of a chip provided by the prior art.
如图1所示,该芯片包括调度器和多个处理核C1至Cn,在图1所示的芯片中,调度器接收到来自芯片外部发送的指令,例如调度器接收到来自芯片外部的指令源发送的指令,然后将指令按预设的策略(例如按照预设顺序)传输给各个处理核,各个处理核执行相同的指令,但是处理不同的数据。例如,指令为处理a+b,但是两个处理核的a或b可能是不同的数值,那么这两个处理核处理的数据就是不同的数据。As shown in Figure 1, the chip includes a scheduler and multiple processing cores C1 to Cn. In the chip shown in Figure 1, the scheduler receives instructions sent from outside the chip. For example, the scheduler receives instructions from outside the chip. The instruction sent by the source is then transmitted to each processing core according to a preset strategy (for example, in a preset order), and each processing core executes the same instruction but processes different data. For example, if the instruction is to process a+b, but the a or b of the two processing cores may be different values, then the data processed by the two processing cores are different data.
对于图1所示的芯片架构,各个处理核可以是比较简单的结构,例如是单指令多数据结构(Single Instruction Multiple Data,SIMD),或者是单指令多线程结构(Single Instruction Multiple Threads,SIMT)。For the chip architecture shown in Figure 1, each processing core can have a relatively simple structure, such as a single instruction multiple data structure (SIMD) or a single instruction multiple thread structure (SIMT) .
通常这种方式存在如下的弊端:Usually this method has the following disadvantages:
调度器只能被动的从外部接收指令,再分配给各个处理核,无论是SIMD结构还是SIMT结构,各个处理核只能执行相同的指令,导致芯片的功能单一,且缺乏灵活性。The scheduler can only passively receive instructions from the outside and allocate them to each processing core. Regardless of whether it is a SIMD structure or a SIMT structure, each processing core can only execute the same instructions, resulting in a single chip with a single function and lack of flexibility.
图2是另一现有技术提供的芯片的结构示意图。Fig. 2 is a schematic structural diagram of a chip provided by another prior art.
如图2所示,该芯片包括多个处理核C1至Cn和存储单元memory。在图2所示的芯片中,各核能从Memory中(例如DDR)中独立读取指令,并进行运算,通常各核具有完整的控制电路、寄存器组等电路,该结构在多核CPU或者ASIC中比较常见。As shown in Figure 2, the chip includes a plurality of processing cores C1 to Cn and a memory unit memory. In the chip shown in Figure 2, each core can independently read instructions from Memory (such as DDR) and perform operations. Usually each core has a complete control circuit, register set and other circuits. This structure is in a multi-core CPU or ASIC. More common.
通常这种方式存在如下的弊端:Usually this method has the following disadvantages:
(1)每个处理核自主性较高,都能够独立的运行指令,但是由于处理核的自主性较高,多个处理核难以相互配合高效的完成一个完整的任务。(1) Each processing core has high autonomy and can run instructions independently. However, due to the high autonomy of the processing core, it is difficult for multiple processing cores to cooperate with each other to efficiently complete a complete task.
(2)芯片中电路控制比较复杂,每一个核都几乎是一个完整的CPU,若想要利用处理核之间高效的配合完成一个完整的任务,则电路的设计难度大,且功耗和面积大。(2) The circuit control in the chip is more complicated, and each core is almost a complete CPU. If you want to use the efficient cooperation between the processing cores to complete a complete task, the circuit design is difficult, and the power consumption and area Big.
(3)多个处理核可能频繁访问指令存储区,引起存储访问效率的下降,进而影响芯片算力的发挥。(3) Multiple processing cores may frequently access the instruction storage area, causing a decrease in storage access efficiency, which in turn affects the performance of the chip's computing power.
为解决上述问题,提出本发明的技术方案。In order to solve the above-mentioned problems, the technical solution of the present invention is proposed.
下面将详细说明本申请一实施方式提供的芯片。在本发明的描述中,需要说明的是,术语“第一”、“第二”、“第三”、“第四”仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,下面所描述的本发明不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。The chip provided by an embodiment of the present application will be described in detail below. In the description of the present invention, it should be noted that the terms "first", "second", "third", and "fourth" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance. In addition, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.
图3是根据本发明第一实施方式提供的数据处理装置的结构示意图。Fig. 3 is a schematic structural diagram of a data processing device according to the first embodiment of the present invention.
如图3所示,该数据处理装置,包括:至少2个处理核、同步调度器(Synchronizer and Scheduler,S_S)和直接存储访问控制器(Direct Memory Access Controller,DMAC)。As shown in FIG. 3, the data processing device includes at least two processing cores, a synchronous scheduler (Synchronizer and Scheduler, S_S), and a direct memory access controller (Direct Memory Access Controller, DMAC).
其中,S_S与至少2个处理核连接,至少2个处理核可以是数据处理装置内所有的处理核,例如是处理核C1到处理核Cn,DMCA与至少2个处理核、S_S以及外部的存储单元Memory连接。Among them, S_S is connected to at least two processing cores, and the at least two processing cores can be all processing cores in the data processing device, such as processing core C1 to processing core Cn, DMCA and at least two processing cores, S_S and external storage Unit Memory connection.
S_S,用于响应与S_S连接的各个处理核的程序更新信号,生成并发送配置信号。S_S is used to generate and send a configuration signal in response to the program update signal of each processing core connected to the S_S.
DAMC,用于基于所述配置信号,从外部的Memory中读取与各个处理核对应的程序并发送给相应的处理核。DAMC is used to read the program corresponding to each processing core from the external Memory and send it to the corresponding processing core based on the configuration signal.
本发明实施方式提供的数据处理装置中设置有同步调度器和直接存储访问控制器,通过同步调度器指示直接存储访问控制器从外部的存储单元中读取程序,并发送给与该程序对应的处理核,一方面,本发明实施例提供的数据处理装置无需处理核从外部存储单元取数,避免多个处理核读取数据产生的延时,提高了芯片的算力,另一方面,本发明实施例提供的数据处理装置,各个处理核执行的程序可以相同或不同,并通过同步调度器响应多个处理核程序的更新,能够灵活的进行任务分配,充分发挥各处理核的算力,进一步的提升芯片的算力。The data processing device provided by the embodiment of the present invention is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed by the synchronization scheduler to read a program from an external storage unit, and send it to the corresponding program Processing cores. On the one hand, the data processing device provided by the embodiments of the present invention does not require processing cores to fetch data from an external storage unit, avoiding the delay caused by multiple processing cores reading data, and improving the computing power of the chip. In the data processing device provided by the embodiment of the invention, the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to the update of multiple processing core programs, which can flexibly distribute tasks and give full play to the computing power of each processing core. Further improve the computing power of the chip.
在一个实施例中,S_S,还用于响应所述程序更新信号,向与S_S连接的处理核发送同步运行信号,同步运行信号为S_S接收到预定数目的所述程序更新信号后,向各个处理核发送的,同步运行信号用于指示各个处理核同时开始执行各自程序。In one embodiment, S_S is also used to respond to the program update signal to send a synchronization operation signal to the processing core connected to S_S. The synchronization operation signal is that after S_S receives a predetermined number of the program update signals, it sends a synchronization operation signal to each processor. The synchronous operation signal sent by the core is used to instruct each processing core to start executing their respective programs at the same time.
进一步的,S_S还用于响应于程序更新信号的个数等于所述预定数目,先发送配置信号,在收到所述直接访问控制器发送的配置完成的信号后,再发送所述同步运行信号。Further, S_S is also used to send a configuration signal first in response to the number of program update signals being equal to the predetermined number, and then send the synchronization operation signal after receiving the configuration completion signal sent by the direct access controller .
在一个实施例中,所述程序包括多个程序段。In one embodiment, the program includes a plurality of program segments.
所述处理核,用于在执行每个所述程序段后,均向所述S_S发送所述程序更新信号。The processing core is configured to send the program update signal to the S_S after each program segment is executed.
优选的,每个处理核执行的程序段数目相同。Preferably, the number of program segments executed by each processing core is the same.
可选的,S_S包括第一计数器,第一计数器用于记录收到的程序更新信号的个数。Optionally, S_S includes a first counter, and the first counter is used to record the number of program update signals received.
S_S还用于响应于所述第一计数器记录的程序更新信号的个数等于所述预定数目,准备发送所述同步运行信号。S_S is also used for preparing to send the synchronous running signal in response to the number of program update signals recorded by the first counter being equal to the predetermined number.
在本发明实施例中,第一计数器的个数例如可以为一个,还可以为多个。In the embodiment of the present invention, the number of the first counter may be, for example, one or more.
可选的,当第一计数器的数目为一个时,预定数目例如可以为S_S接收到的各个核发送的程序更新信号的总和,即预定数目为与S_S连接的处理核个数与程序段个数的乘积。Optionally, when the number of the first counter is one, the predetermined number may be, for example, the sum of the program update signals sent by each core received by S_S, that is, the predetermined number is the number of processing cores connected to S_S and the number of program segments. The product of.
可选的,可设置第一计数器记录的预设数目为一个或多个。Optionally, the preset number recorded by the first counter can be set to one or more.
例如,数据处理装置包括2个处理核,每个处理核执行的程序段为4段。For example, the data processing device includes 2 processing cores, and the program segments executed by each processing core are 4 segments.
当设置第一计数器记录的预设数目为一个数值时,则设置第一计数器记录的预设数目为8,即当第一计数器记录收到的程序更新信号为8个时,S_S会向DMAC发送配置信号,当S_S收到DMAC返回的配置完成信号后,S_S会向各个核发送同步信号,此时,对第一计数器清零,重新开始计数。When the preset number of the first counter record is set to a value, the preset number of the first counter record is set to 8, that is, when the program update signal received by the first counter record is 8, S_S will send to the DMAC Configuration signal. When S_S receives the configuration completion signal returned by DMAC, S_S sends a synchronization signal to each core. At this time, the first counter is cleared and counting is restarted.
当设置第一计数器记录的预设数目为多个数值时,计数器记录的预设数目为8、16、24……这些预设数目为同步控制器接收到的程序更新信号的累计数目。当第一计数器记录接收到程序更新信号累计的数目为8个时,S_S会向DMAC发送配置信号,当DMAC返回的配置完成信号,S_S会向各个核发送同步信号;此时计数器不清零,继续计数,当计数器记录接收到的程序更新信号累计的数目达到16时,S_S会向DMAC发送配置信号,当收到DMAC返 回的配置完成信号,S_S会向各个核发送同步信号;第一计数器继续累计计数,当达到下一个预设数目时,S_S重复上面的步骤。When the preset number recorded by the first counter is set to multiple values, the preset number recorded by the counter is 8, 16, 24... These preset numbers are the cumulative number of program update signals received by the synchronization controller. When the first counter records that the cumulative number of program update signals received is 8, S_S will send a configuration signal to the DMAC. When the DMAC returns a configuration completion signal, S_S will send a synchronization signal to each core; the counter is not cleared at this time, Continue counting. When the cumulative number of program update signals received by the counter records reaches 16, S_S will send a configuration signal to the DMAC. When receiving the configuration completion signal returned by the DMAC, S_S will send a synchronization signal to each core; the first counter continues Accumulate count, when it reaches the next preset number, S_S repeats the above steps.
可选的,当第一计数器的数目为多个时,所述预定数目可以为程序段的个数。Optionally, when the number of first counters is multiple, the predetermined number may be the number of program segments.
此时,例如,每个与所述S_S连接的处理核对应一个第一计数器,该第一计数器用于记录接收到与S_S对应的处理核发送的程序更新信号的数目,在S_S的接收到每个处理核都发送了预定数目的程序更新信号后,即每个第一计数器都收到预设数目的程序更新信号时,向DMAC发送配置信号,在接收到DMAC发送的配置完成信号后,S_S向各个核发送同步信号。At this time, for example, each processing core connected to the S_S corresponds to a first counter, and the first counter is used to record the number of program update signals sent by the processing core corresponding to S_S. After each processing core has sent a predetermined number of program update signals, that is, when each first counter receives a preset number of program update signals, it sends a configuration signal to the DMAC. After receiving the configuration complete signal sent by the DMAC, S_S Send a synchronization signal to each core.
上述设置计数器的方法仅为示例性说明,只要能够在一定条件下实现S-S对DMAC的配置和对各个处理核的同步操作的任何软硬件结构等都可以用在本发明的实施例中,在此不做更多说明。The above method of setting the counter is only illustrative. As long as it can realize the configuration of the DMAC of the SS and the synchronization operation of each processing core under certain conditions, any software and hardware structure, etc., can be used in the embodiment of the present invention. Here, No more explanation.
在一个优选的实施例中,S_S,用于响应各个所述处理核的程序更新信号,生成并发送配置信号,包括:S_S,用于接收到与S_S连接的各个处理核均发送的所述程序更新信号后发送所述配置信号。例如,与S_S连接的处理核为5个,则S_S在接收到5个处理核都发送的程序更新信号后再发送配置信号。In a preferred embodiment, S_S is used to generate and send a configuration signal in response to the program update signal of each processing core, including: S_S, used to receive the program sent by each processing core connected to S_S Send the configuration signal after updating the signal. For example, if there are 5 processing cores connected to S_S, S_S sends the configuration signal after receiving the program update signal sent by all 5 processing cores.
在一个优选的实施例中,直接访问控制器,用于从外部的Memory中读取程序段,将程序段发送给与所述程序段对应的处理核。外部的Memory例如是指Host上的存储单元。In a preferred embodiment, the direct access to the controller is used to read the program segment from the external Memory, and send the program segment to the processing core corresponding to the program segment. The external memory refers to the storage unit on the Host, for example.
所述直接访问控制器,还用于在将所述程序段发送给所述程序段对应的处理核后,将配置完成的信号发送给所述S_S。The direct access controller is further configured to send a configuration completion signal to the S_S after sending the program segment to the processing core corresponding to the program segment.
在一个实施例中,S_S,还用于响应于所述程序更新信号向与S_S连接的处理核发送同步运行信号,包括:In an embodiment, S_S is further configured to send a synchronization operation signal to the processing core connected to S_S in response to the program update signal, including:
S_S,还用于响应于所述程序更新信号以及所述配置完成信号,向与S_S连接的各个处理核发送同步运行信号。S_S is also used to send a synchronous operation signal to each processing core connected to S_S in response to the program update signal and the configuration completion signal.
具体地,S_S用于响应于计数器记录的程序更新信号的个数等于预定数目时,先发送配置信号给DMAC,DMAC将配置信号所指示的程序或程序段发送 给对应的处理核后,向S_S发送配置完成的信号,S_S收到配置完成的信号后向与S-S连接的各个处理核发送同步运行信号。Specifically, S_S is used to send the configuration signal to the DMAC in response to the number of program update signals recorded by the counter being equal to the predetermined number. The DMAC sends the program or program segment indicated by the configuration signal to the corresponding processing core, and then sends it to the S_S Send the signal that the configuration is complete, S_S sends a synchronous operation signal to each processing core connected to the SS after receiving the signal that the configuration is complete.
这里“配置信号所指示的程序”可以是与处理核刚刚执行完的程序相同或者不同的程序。Here, the "program indicated by the configuration signal" may be the same or different from the program that has just been executed by the processing core.
在一个实施例中,处理核执行的程序包括运算指令和程序更新指令;处理核用于在完成运算指令后,执行程序更新指令,并基于程序更新指令生成并发送程序更新信号。In one embodiment, the program executed by the processing core includes an operation instruction and a program update instruction; the processing core is used to execute the program update instruction after completing the operation instruction, and generate and send a program update signal based on the program update instruction.
当处理核执行的程序包括多个程序段时,每个程序段都包括运算指令和程序更新指令。When the program executed by the processing core includes multiple program segments, each program segment includes arithmetic instructions and program update instructions.
在一个实施例中,处理核设置存储模块PRAM,该存储模块用于存储和接收DMAC发送的程序,处理核所执行的程序是从自身的PRAM中读取的。在本实施例中,各个处理核从自身设置PRAM中读取程序所包含的指令,无需从外部Memory读取,所以可以避免设计复杂的高速缓存电路,相比于现有技术,无需从Memory中读取数据,降低了延时,极大的提升指令的执行效率。In one embodiment, the processing core is provided with a storage module PRAM, which is used to store and receive the program sent by the DMAC, and the program executed by the processing core is read from its own PRAM. In this embodiment, each processing core reads the instructions contained in the program from its own set of PRAM without reading from the external memory, so it can avoid the design of complex cache circuits. Compared with the prior art, there is no need to read from the memory. Reading data reduces the delay and greatly improves the execution efficiency of instructions.
优选的,PRAM的存储空间大于或等于16KB。Preferably, the storage space of the PRAM is greater than or equal to 16KB.
可选的,任意两个处理核存储的程序相同或者不同。Optionally, the programs stored in any two processing cores are the same or different.
在一个优选的实施例中,数据处理装置中的至少两个处理核包括第一处理核和第二处理核;第一处理核和所述第二处理核执行的程序不同,第一处理核执行的程序的计算结果可以为所述第二处理核执行的程序的输入。将第一处理核执行的程序的计算结果作为第二处理核的输入,能够使实施例提供的芯片用于神经网络的计算。并且可通过S_S的使各个处理核同时运行各自存储的程序,能够使各个处理核之间有序的交换数据,相互配合高效的完成一个完整的任务。In a preferred embodiment, at least two processing cores in the data processing device include a first processing core and a second processing core; the first processing core and the second processing core execute different programs, and the first processing core executes The calculation result of the program may be the input of the program executed by the second processing core. The calculation result of the program executed by the first processing core is used as the input of the second processing core, so that the chip provided in the embodiment can be used for the calculation of the neural network. And through the S_S, each processing core can run their stored programs at the same time, which enables the orderly exchange of data between each processing core, and cooperates with each other to efficiently complete a complete task.
本发明实施方式提供的数据处理装置,通过S_S和DMAC,为各个处理核分配程序,各个处理所执行的程序和数据的交换在程序运行之前已经设定好,芯片的顶层的微控制单元MCU或者系统的Host只需要配置S_S的计数器,即可实现既定的策略,而且MCU或者系统的Host可通过更改S_S的计数器的配 置和Memory的存储的程序,从而改变各核执行的程序,以及程序的分配与调度,便于修改芯片中各个处理核任务的分配和调度,能够高效的利用处理核的算力。The data processing device provided by the embodiment of the present invention allocates programs to each processing core through S_S and DMAC. The programs and data exchanges executed by each processing have been set before the program runs. The micro-control unit MCU on the top of the chip or The host of the system only needs to configure the counter of S_S to achieve the established strategy, and the MCU or the host of the system can change the configuration of the counter of S_S and the stored program of the Memory to change the program executed by each core and the distribution of the program And scheduling, it is convenient to modify the allocation and scheduling of each processing core task in the chip, and can efficiently use the computing power of the processing core.
根据本发明的又一方面,提供了一种芯片,包括一个或多个上述方面提供的数据处理装置。According to another aspect of the present invention, a chip is provided, which includes one or more data processing devices provided in the above aspects.
例如,该芯片包括多个数据处理装置时,该芯片可以包括多个S_S,每个S_S连接多个处理核和一个直接访问控制器。For example, when the chip includes multiple data processing devices, the chip may include multiple S_Ss, and each S_S is connected to multiple processing cores and a direct access controller.
根据本发明的又一实施例,提供了一种卡板,包括一个或多个上述实施例提供的芯片。According to another embodiment of the present invention, a card board is provided, which includes one or more chips provided in the foregoing embodiments.
根据本发明的又一实施例,提供了一种电子设备,包括一个或多个上述实施例提供的卡板。According to another embodiment of the present invention, there is provided an electronic device, including one or more of the card boards provided in the foregoing embodiments.
图4是本发明提供的数据处理装置的结构示意图。Fig. 4 is a schematic structural diagram of a data processing device provided by the present invention.
如图4所示,该装置包括第一处理核C1、第二处理核C2、S_S和DMAC。每个处理核均设置有PRAM。As shown in Figure 4, the device includes a first processing core C1, a second processing core C2, S_S, and DMAC. Each processing core is provided with PRAM.
图5是本发明提供的神经网络的结构示意图。Fig. 5 is a schematic diagram of the structure of the neural network provided by the present invention.
本实施例以2层神经网络为例,如图5所示,神经网络为2层结构,神经网络的每一层的控制程序都是128KB,每一层的计算量相同,整个网络可以按照均分的任务分配策略分配给两个处理核流水进行计算,即C1和C2各负责一层网络的计算,各自运行相应层的程序。例如C1计算第一层网络Layer1,C2计算第二层网络Layer2。输入数据会送给C1,C1会将输入数据进行第一层的处理,且将第一层的处理结果发送给C2,C2将第一层的处理结果作为输入,进行第二层的处理,得到最终的结果后输出,也就是数据依次流过Layer1和Layer2,实现整个神经网络的运算,最后得到输出。This embodiment takes a two-layer neural network as an example. As shown in Figure 5, the neural network has a two-layer structure. The control program of each layer of the neural network is 128KB, and the calculation amount of each layer is the same. The sub-task allocation strategy is allocated to two processing core pipelines for calculation, that is, C1 and C2 are each responsible for the calculation of a layer of the network, and each runs the program of the corresponding layer. For example, C1 calculates the first layer network Layer1, and C2 calculates the second layer network Layer2. The input data will be sent to C1, C1 will process the input data in the first layer, and send the processing result of the first layer to C2, and C2 will use the processing result of the first layer as input and perform the second layer processing to get The final result is output, that is, the data flows through Layer 1 and Layer 2 in turn to realize the operation of the entire neural network, and finally get the output.
图6是本发明提供的芯片应用到神经网络的运算示意图。Fig. 6 is a schematic diagram of the operation of applying the chip provided by the present invention to a neural network.
如图6所示,Input1-1表示整个神经网络的输入,也表示在t1时间段的起始点时Input1-1作为第一层神经网络layer1的输入,Input2-1表示在t1时间段layer1的计算结果,同时也是作为在t2时间段的起始点时,第二 层神经网络layer2的输入,output1表示的是t2时间段后,layer2的计算结果,同时该计算结果也是神经网络的输出结果,当C1处理layer1的流水时,其输入是神经网络的数据的输入,其输出作为C2的输入,C2的输出即为最终的输出结果。As shown in Figure 6, Input1-1 represents the input of the entire neural network, and also represents that Input1-1 is used as the input of the first layer of neural network layer1 at the beginning of the time period t1, and Input2-1 represents the calculation of layer1 in the time period t1 As a result, it is also used as the input of the second layer of neural network layer2 at the time period t2, output1 represents the calculation result of layer2 after the time period t2, and the calculation result is also the output result of the neural network. When C1 When processing the pipeline of layer1, its input is the input of neural network data, and its output is used as the input of C2, and the output of C2 is the final output result.
设置每一层的控制程序为128KB,由于每一个处理核的PRAM只有32KB,所以每一层的控制程序需要按一定的策略调度,更新各核的程序。例如,可以将各核的程序,平均分四个程序段传输给相应的核,每次传输32KB。Set the control program of each layer to 128KB. Since the PRAM of each processing core is only 32KB, the control program of each layer needs to be scheduled according to a certain strategy to update the program of each core. For example, the program of each core can be divided into four program segments and transmitted to the corresponding core in an average of 32KB each time.
芯片中处理核在运行神经网络时的计算过程如下;The calculation process of the processing core in the chip when running the neural network is as follows:
在初始时,C1在t1时刻接收输入第一个程序的第一程序段,执行第一段程序,当执行第一段程序后,C1向S_S发送程序更新指令PU_S1,由于C1的输出是C2的输入,可设置C1运行第一个程序的各个程序段时,S_S每次只接收到C1发送的程序更新信号就向DMAC发送配置信号,直到C1运行第一个程序的最后一个程序段后,向DMAC发送配置信号,DMAC从外部的memory中读取C1运行的第二个程序的第一段和C2运行的第一个程序的第一段,并发送给C1和C2。当发送完成后,DMAC向S_S发送配置完成的信号,S_S向C1和C2发送同步运行信号Sync,指示C1和C2同时开始运行接收到的程序段。At the beginning, C1 receives the input of the first program segment of the first program at t1, and executes the first program. After the first program is executed, C1 sends the program update instruction PU_S1 to S_S, because the output of C1 is C2 Input, you can set C1 to run each program segment of the first program, S_S sends a configuration signal to the DMAC every time it receives only the program update signal sent by C1, until C1 runs the last program segment of the first program, DMAC sends configuration signals, DMAC reads the first section of the second program run by C1 and the first section of the first program run by C2 from the external memory, and sends them to C1 and C2. When the transmission is completed, the DMAC sends a configuration completion signal to S_S, and S_S sends a synchronous operation signal Sync to C1 and C2, instructing C1 and C2 to start running the received program segment at the same time.
当C1或C2将程序段的运算指令运行完毕,会执行程序更新指令PUpdate,PUpdate将生成的程序更新信号PU_s发送给S_S,PU_s表示需要将PRAM中的程序更新;S_S收到信号后,判断是否接收到全部的与S_S连接的全部的处理核发送的PU_s,如果没有收到所有与S_S连接的处理核都发送的PU_s,会处于等待状态,直到收到全部的与S_S连接的处理核均发送的PU_s。如果S_S收到了与S_S连接的全部的处理核发送的PU_s,向DMAC发出配置信号,DMAC从外部Memory取配置信号指示的程序段并发送给与该程序段对应的处理核,以对各核的PRAM进行更新。When C1 or C2 finishes running the operation instructions of the program segment, it will execute the program update instruction PUpdate. PUpdate sends the generated program update signal PU_s to S_S. PU_s indicates that the program in PRAM needs to be updated; after S_S receives the signal, it judges whether All PU_s sent by all processing cores connected to S_S are received. If PU_s sent by all processing cores connected to S_S are not received, it will be in a waiting state until all processing cores connected to S_S are sent. PU_s. If S_S receives the PU_s sent by all the processing cores connected to S_S, it sends a configuration signal to the DMAC. The DMAC fetches the program segment indicated by the configuration signal from the external Memory and sends it to the processing core corresponding to the program segment to control the PRAM is updated.
在第四次收到与S_S连接的全部的处理核均发送的了程序更新信号后,表示各核各层的所有程序段都执行完毕,S_S先向DMAC发送配置信号。DMCA 根据该配置信号从外部memory读取配置信号指示的程序的第一个程序段发送给相应的处理核,当发送完成后DMAC发送配置完成的信号。After receiving the program update signal sent by all processing cores connected to S_S for the fourth time, it indicates that all program segments of each core and each layer have been executed, and S_S first sends a configuration signal to the DMAC. DMCA reads the first program segment of the program indicated by the configuration signal from the external memory according to the configuration signal and sends it to the corresponding processing core. When the transmission is completed, the DMAC sends a configuration completion signal.
当S_S收到DMAC发送的配置完成的信号后,再生成并发送同步运行信号Sync,指示各核的程序需要同时开始执行,在收到同步运行信号后,各核之间可以开始交换数据,即C1将计算结果发给C2。When S_S receives the configuration completion signal sent by DMAC, it generates and sends a synchronization operation signal Sync, indicating that the programs of each core need to be executed at the same time. After receiving the synchronization operation signal, each core can start to exchange data, that is C1 sends the calculation result to C2.
需要说明的是,配置信号指示的程序可以是与刚刚执行完成的程序相同或者不同。It should be noted that the program indicated by the configuration signal may be the same or different from the program that has just been executed.
还需要说明的是,在初始时,也可以设置C1在t1时刻接收输入第一个程序的第一程序段,C2在t1时刻也接收第一个程序的第一程序段,C1和C2都执行第一程序段,可设置C2运行的第一个程序的第一程序段的输入为预设值,以使C1和C2都同时执行第一个程序。It should also be noted that at the initial stage, it is also possible to set C1 to receive and input the first program segment of the first program at time t1, and C2 to also receive the first program segment of the first program at time t1, and both C1 and C2 are executed For the first program segment, the input of the first program segment of the first program run by C2 can be set as a preset value, so that both C1 and C2 execute the first program at the same time.
图7是本发明提供的同步调度器对处理核的调度示意图。Fig. 7 is a schematic diagram of scheduling processing cores by a synchronous scheduler provided by the present invention.
如图7所示,C1和C2接收到S_S发送的Sync信号,同时从头开始运行存储在PRAM的程序。C1执行完第一程序段中的运算指令后,会执行第一程序段中的最后一条指令,即更新指令PUpdate,表示本程序段已经执行完毕,PUpdate产生程序更新信号PU_s1,并将更新信号发送给S_S,然后C1开始等待。As shown in Figure 7, C1 and C2 receive the Sync signal sent by S_S and run the program stored in the PRAM from the beginning. After C1 executes the operation instruction in the first block, it will execute the last instruction in the first block, namely the update instruction PUpdate, which means that the execution of this block has been completed. PUpdate generates the program update signal PU_s1 and sends the update signal Give S_S, then C1 starts to wait.
S_S收到PU_s1,发现还没有收到与S_S连接的各个处理核发送的更新信号,即本次的PU_s2还没有收到,会继续等待。S_S receives PU_s1 and finds that it has not received the update signal sent by each processing core connected to S_S, that is, PU_s2 has not received this time, and will continue to wait.
C2执行完其第一程序段中的运算指令后,会执行第一程序段中的最后一条指令,即更新指令PUpdate,表示程序段已经执行完毕,PUpdate产生程序更新信号PU_s2并发送给S_S,然后C2开始等待。After C2 has executed the operation instruction in its first block, it will execute the last instruction in the first block, the update instruction PUpdate, which means that the block has been executed. PUpdate generates the program update signal PU_s2 and sends it to S_S, and then C2 starts to wait.
S_S收到PU_s2,第一计数器发现收到与S_S连接的各个处理核发送的程序更新信号,会配置DMAC,启动对C1和C2的程序更新。S_S receives PU_s2, and the first counter finds that it has received the program update signal sent by each processing core connected to S_S, and configures the DMAC to start the program update of C1 and C2.
DMAC从Memory中将C1和C2的新程序段读出,再分别发送到C1和C2的PRAM中,直到新程序段更新完毕。DMAC reads the new program segments of C1 and C2 from Memory, and then sends them to the PRAM of C1 and C2 respectively, until the new program segment is updated.
当S_S的第一计数器记录接收到程序更新信号的个数为8个时,表示2 个处理核均完成了各自的四个程序段,即2个处理核各自的整个程序都已经执行完毕,S_S将重置第一计数器,从头开始重新记录更新次数;先配置DMAC,启动对C1和C2的程序更新,即重新加载下一个程序的第一段程序段,当DMAC发送完程序段后,向S_S发送配置完成的信号,S_S生成同步运行信号Sync,发送给各核,指示各核同时开始工作,此时各核之间可以互相传输数据。When the first counter of S_S records that the number of program update signals received is 8, it means that the two processing cores have completed their four program segments, that is, the entire program of each of the two processing cores has been executed. S_S The first counter will be reset, and the update times will be re-recorded from the beginning; first configure the DMAC, start the program update of C1 and C2, that is, reload the first program segment of the next program, when the DMAC sends the program segment, it will send to S_S Send the configuration completion signal, S_S generates a synchronous operation signal Sync, and sends it to each core, instructing each core to start working at the same time. At this time, each core can transmit data to each other.
需要说明的是,第一计数器的重置、配置DMAC和发送同步运行信号的工作顺序可以不分先后。It should be noted that the working sequence of resetting the first counter, configuring the DMAC, and sending the synchronous operation signal may be in no particular order.
上面均是以2层神经网络为例,以C1的计算结果作为C2的输入数据为例进行了描述,当然,本发明实施例提供的数据处理装置可以用于任何神经网络,C1与C2也可以没有数据关联。The above is an example of a 2-layer neural network. The calculation result of C1 is taken as the input data of C2 as an example. Of course, the data processing device provided by the embodiment of the present invention can be used in any neural network, and C1 and C2 can also be used. There is no data link.
图8是本发明提供的数据处理方法流程示意图。Fig. 8 is a schematic flow chart of the data processing method provided by the present invention.
如图8所示,该数据处理方法包括:As shown in Figure 8, the data processing method includes:
步骤S101,处理核执行程序;Step S101, processing the core execution program;
步骤S102,同步调度器响应与同步调度器连接的各个处理核的程序更新信号,生成并发送配置信号;Step S102, the synchronous scheduler generates and sends a configuration signal in response to the program update signal of each processing core connected to the synchronous scheduler;
步骤S103,直接存储访问控制器基于所述配置信号,从外部存储单元中读取与各个处理核对应的程序并发给相应的处理核。Step S103: Based on the configuration signal, the direct storage access controller reads the program corresponding to each processing core from the external storage unit and sends it to the corresponding processing core.
在一个实施例中,同步调度器还响应于所述程序更新信号向S_S连接的各个处理核发送同步运行信号,所述同步运行信号用于指示所述与同步调度器连接的各个处理核同时开始执行各自程序。In an embodiment, the synchronization scheduler also sends a synchronization operation signal to each processing core connected to the S_S in response to the program update signal, and the synchronization operation signal is used to instruct the processing cores connected to the synchronization scheduler to start at the same time. Perform the respective procedures.
具体地,S_S还响应于所述程序更新信号向与S_S连接的各个处理核发送同步运行信号,包括:当S_S接收到预定数目的所述程序更新信号后,向与S_S连接的各个处理核发送同步运行信号。Specifically, S_S also sends a synchronization operation signal to each processing core connected to S_S in response to the program update signal, including: when S_S receives a predetermined number of the program update signals, sending to each processing core connected to S_S Synchronous operation signal.
在一个实施例中,处理核执行的程序包括多个程序段。In one embodiment, the program executed by the processing core includes a plurality of program segments.
处理核在执行每个程序段后,均向同步调度器发送所述程序更新信号。After executing each program segment, the processing core sends the program update signal to the synchronous scheduler.
优选的,每个处理核执行的程序段数目相同。Preferably, the number of program segments executed by each processing core is the same.
优选的,S_S的第一计数器记录收到的程序更新信号的个数。Preferably, the first counter of S_S records the number of program update signals received.
S_S,还响应于第一计数器记录的程序更新信号的个数等于所述预定数目,准备发送所述同步运行信号。S_S, in response to the number of program update signals recorded by the first counter being equal to the predetermined number, preparing to send the synchronous operation signal.
在本发明实施例中,第一计数器的个数例如可以为一个,还可以为多个。In the embodiment of the present invention, the number of the first counter may be, for example, one or more.
可选的,当第一计数器的数目为一个时,预定数目例如可以为S_S接收到的各个核发送的程序更新信号的总和,即预定数目为与S_S连接的处理核个数与程序段个数的乘积。Optionally, when the number of the first counter is one, the predetermined number may be, for example, the sum of the program update signals sent by each core received by S_S, that is, the predetermined number is the number of processing cores connected to S_S and the number of program segments. The product of.
可选的,可设置第一计数器记录的预设数目为一个或多个。例如,数据处理装置包括两个处理核,每个处理核执行的程序段为4段。Optionally, the preset number recorded by the first counter can be set to one or more. For example, the data processing device includes two processing cores, and the program segments executed by each processing core are 4 segments.
当设置第一计数器记录的预设数目为一个的时候,则设置第一计数器记录的预设数目为8,即当第一计数器记录收到的程序更新信为8个时,S_S会向DMAC发送配置信号,当S_S收到DMAC返回的配置完成信号后,S_S会向各个核发送同步信号,此时,对第一计数器清零,重新开始计数。When the preset number of the first counter record is set to one, the preset number of the first counter record is set to 8, that is, when the program update letter received by the first counter record is 8, S_S will send to the DMAC Configuration signal. After S_S receives the configuration completion signal returned by DMAC, S_S will send a synchronization signal to each core. At this time, the first counter is cleared and counting is restarted.
当设置第一计数器记录的预设数目为多个的时候,计数器记录的预设数目为8、16、24……这些预设数目均为同步控制器接收到的程序更新信号的累计数目。当第一计数器记录接收到程序更新信号累计的数目为8个时,S_S会向DMAC发送配置信号,当DMAC返回的配置完成信号,S_S会向各个核发送同步信号;此时计数器不清零,继续计数,当计数器记录接收到的程序更新信号累计的数目达到16时,S_S会向DMAC发送配置信号,当收到DMAC返回的配置完成信号,S_S会向各个核发送同步信号;第一计数器继续累计计数,当达到下一个预设数目时,S_S重复上面的步骤。When the preset number of the first counter record is set to be multiple, the preset number of the counter record is 8, 16, 24... These preset numbers are the cumulative number of program update signals received by the synchronization controller. When the first counter records that the cumulative number of program update signals received is 8, S_S will send a configuration signal to the DMAC. When the DMAC returns a configuration completion signal, S_S will send a synchronization signal to each core; the counter is not cleared at this time, Continue counting. When the cumulative number of program update signals received by the counter records reaches 16, S_S will send a configuration signal to the DMAC. When receiving the configuration completion signal returned by the DMAC, S_S will send a synchronization signal to each core; the first counter continues Accumulate count, when it reaches the next preset number, S_S repeats the above steps.
可选的,当第一计数器的数目为多个时,所述预定数目可以为程序段的个数。Optionally, when the number of first counters is multiple, the predetermined number may be the number of program segments.
此时,例如,每个与所述S_S连接的处理核对应一个第一计数器,该第一计数器用于记录接收到与S_S对应的处理核发送的程序更新信号的数目,在S_S的接收到每个处理核都发送了预定数目的程序更新信号后,即每个第一计数器都收到预设数目的程序更新信号时,向DMAC发送配置信号,在接收到DMAC发送的配置完成信号后,S_S向各个核发送同步信号。At this time, for example, each processing core connected to the S_S corresponds to a first counter, and the first counter is used to record the number of program update signals sent by the processing core corresponding to S_S. After each processing core has sent a predetermined number of program update signals, that is, when each first counter receives a preset number of program update signals, it sends a configuration signal to the DMAC. After receiving the configuration complete signal sent by the DMAC, S_S Send a synchronization signal to each core.
可选的,S_S,响应于各个处理核的程序更新信号,生成并发送配置信号,包括:S_S,接收到与S_S连接的处理核均发送的程序更新信号后发送所述配置信号。例如,与同步调度器连接的处理核为5个,则同步调度器在接收到5个处理核都发送的程序更新信号后再发送配置信号。Optionally, S_S generates and sends a configuration signal in response to a program update signal of each processing core, including: S_S, sends the configuration signal after receiving a program update signal sent by all processing cores connected to S_S. For example, if there are five processing cores connected to the synchronous scheduler, the synchronous scheduler sends the configuration signal after receiving the program update signal sent by all the five processing cores.
可选的,DMAC还从外部Memory中读取与各个处理核对应的程序段,将程序段发送给相应的处理核。DMAC,还在将程序段发送给相应的处理核后,将配置完成的信号发送给S_S。Optionally, the DMAC also reads the program segment corresponding to each processing core from the external Memory, and sends the program segment to the corresponding processing core. DMAC, after sending the program segment to the corresponding processing core, sends the configuration completion signal to S_S.
在一个实施例中,S_S还响应于程序更新信号向与S_S连接的各个处理核发送同步运行信号,包括:S_S,还用于响应于程序更新信号以及配置完成信号,向与S_S连接的各个处理核发送同步运行信号。In one embodiment, S_S also sends a synchronization operation signal to each processing core connected to S_S in response to the program update signal, including: S_S, which is also used to respond to the program update signal and the configuration completion signal to each processing core connected to S_S The core sends a synchronous operation signal.
具体地,S_S响应于计数器记录的程序更新信号的个数等于预定数目时,先发送配置信号给DMAC,DMAC将配置信号指示的程序或程序段发送给对应的处理核后,向S_S发送配置完成的信号,S_S收到配置完成的信号后,向与S-S连接的各个处理核再发送同步运行信号。Specifically, in response to the number of program update signals recorded by the counter being equal to the predetermined number, S_S first sends the configuration signal to the DMAC. The DMAC sends the program or program segment indicated by the configuration signal to the corresponding processing core, and then sends the configuration complete to S_S After receiving the configuration completion signal, S_S sends a synchronization operation signal to each processing core connected to SS.
在一个实施例中,程序包括运算指令和程序更新指令;处理核在运算指令完成后,执行程序更新指令,并基于程序更新指令完成,生成并发送程序更新信号。其中,完成运算指令是指运算指令完成时或者运算指令完成后。In one embodiment, the program includes an operation instruction and a program update instruction; after the operation instruction is completed, the processing core executes the program update instruction, and based on the completion of the program update instruction, generates and sends a program update signal. Among them, the completion operation instruction refers to when the operation instruction is completed or after the operation instruction is completed.
在一个实施例中,至少两个处理核包括第一处理核和第二处理核;第一处理核和第二处理核执行的程序不同,第一处理核执行的程序的计算结果为第二处理核执行的程序的输入。In one embodiment, the at least two processing cores include a first processing core and a second processing core; the programs executed by the first processing core and the second processing core are different, and the calculation result of the program executed by the first processing core is the second processing Input to the program executed by the nuclear.
根据本发明的又一实施例,提供了一种计算机存储介质,所述计算机存储介质上存储有计算机程序,所述程序被处理器执行时实现上述实施例提供的数据处理方法。According to another embodiment of the present invention, a computer storage medium is provided, and a computer program is stored on the computer storage medium, and when the program is executed by a processor, the data processing method provided in the above embodiment is implemented.
根据本发明的又一实施例,提供了一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现上述实施例提供的数据处理方法。According to another embodiment of the present invention, there is provided an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the processor executes the program The data processing method provided by the foregoing embodiment is implemented.
根据本发明的又一方面,提供一种计算机程序产品,其中,包括计算机 指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行上述实施例提供的数据处理方法。According to another aspect of the present invention, a computer program product is provided, which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the data processing method provided in the above-mentioned embodiments.
本发明实施方式提供的数据处理方法中,通过同步调度器指示直接存储访问控制器从外部的存储单元中读取程序,并发送给与该程序对应的处理核,一方面,无需处理核从外部存储单元取数,避免多个处理核读取数据产生的延时,提高了芯片的算力,另一方面,各个处理核执行的程序可以相同或不同,并通过同步调度器响应多个处理核程序的更新,能够灵活的进行任务分配,充分发挥各处理核的算力,进一步的提升芯片的算力。In the data processing method provided by the embodiment of the present invention, the direct storage access controller is instructed by the synchronization scheduler to read the program from the external storage unit and send it to the processing core corresponding to the program. On the one hand, there is no need for the processing core from the outside. The storage unit fetches data to avoid the delay caused by multiple processing cores to read data, which improves the computing power of the chip. On the other hand, the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to multiple processing cores. The update of the program can flexibly allocate tasks, give full play to the computing power of each processing core, and further enhance the computing power of the chip.
应当理解的是,本发明的上述具体实施方式仅仅用于示例性说明或解释本发明的原理,而不构成对本发明的限制。因此,在不偏离本发明的精神和范围的情况下所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。此外,本发明所附权利要求旨在涵盖落入所附权利要求范围和边界、或者这种范围和边界的等同形式内的全部变化和修改例。It should be understood that the above-mentioned specific embodiments of the present invention are only used to exemplarily illustrate or explain the principle of the present invention, and do not constitute a limitation to the present invention. Therefore, any modifications, equivalent substitutions, improvements, etc. made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. In addition, the appended claims of the present invention are intended to cover all changes and modifications that fall within the scope and boundary of the appended claims, or equivalent forms of such scope and boundary.

Claims (10)

  1. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it comprises:
    至少2个处理核;At least 2 processing cores;
    同步调度器,用于响应与所述同步调度器连接的各个所述处理核的程序更新信号,生成并发送配置信号;A synchronous scheduler, configured to generate and send configuration signals in response to program update signals of each of the processing cores connected to the synchronous scheduler;
    直接存储访问控制器,用于基于所述配置信号,从外部存储单元中读取与各个所述处理核对应的程序并发送给相应的所述处理核。The direct storage access controller is configured to read the program corresponding to each processing core from the external storage unit based on the configuration signal and send it to the corresponding processing core.
  2. 根据权利要求1所述的芯片,其特征在于,The chip of claim 1, wherein:
    所述同步调度器,还用于响应所述程序更新信号,向与所述同步调度器连接的各个所述处理核发送同步运行信号,所述同步运行信号为所述同步调度器接收到预定数目的所述程序更新信号后,向各个所述处理核发送的,所述同步运行信号用于指示各个所述处理核同时开始执行各自程序。The synchronization scheduler is further configured to respond to the program update signal and send a synchronization operation signal to each of the processing cores connected to the synchronization scheduler, where the synchronization operation signal is a predetermined number received by the synchronization scheduler After the program update signal is sent to each of the processing cores, the synchronous operation signal is used to instruct each of the processing cores to start executing their respective programs at the same time.
  3. 根据权利要求1或2所述的数据处理装置,其特征在于,The data processing device according to claim 1 or 2, wherein:
    所述程序包括多个程序段。The program includes a plurality of program segments.
  4. 根据权利要求3所述的数据处理装置,其特征在于,The data processing device according to claim 3, wherein:
    所述处理核,用于在执行每个所述程序段后,均向所述同步调度器发送所述程序更新信号。The processing core is configured to send the program update signal to the synchronization scheduler after each program segment is executed.
  5. 根据权利要求3或4所述的数据处理装置,其特征在于,The data processing device according to claim 3 or 4, wherein:
    所述直接访问控制器,用于从所述外部存储单元中读取与各个所述处理核对应的所述程序段,并发送给相应的所述处理核;The direct access controller is configured to read the program segments corresponding to each of the processing cores from the external storage unit, and send them to the corresponding processing cores;
    所述直接访问控制器,还用于在将所述程序段发送给相应的所述处理核后,将配置完成的信号发送给所述同步调度器。The direct access controller is further configured to send a configuration completion signal to the synchronization scheduler after sending the program segment to the corresponding processing core.
  6. 根据权利要求5所述的数据处理装置,其特征在于,所述同步调度器,还用于响应于所述程序更新信号向与所述同步调度器连接的各个所述处理核发送同步运行信号,包括:The data processing device according to claim 5, wherein the synchronization scheduler is further configured to send a synchronization operation signal to each of the processing cores connected to the synchronization scheduler in response to the program update signal, include:
    所述同步调度器,还用于响应于所述程序更新信号以及所述配置完成的 信号,向与所述同步调度器连接的各个所述处理核发送所述同步运行信号。The synchronization scheduler is further configured to send the synchronization operation signal to each of the processing cores connected to the synchronization scheduler in response to the program update signal and the configuration completion signal.
  7. 根据权利要求2-6任一项所述的数据处理装置,其特征在于,所述同步调度器包括计数器;所述计数器用于记录收到的程序更新信号的个数;The data processing device according to any one of claims 2-6, wherein the synchronization scheduler comprises a counter; the counter is used to record the number of program update signals received;
    所述同步调度器还用于响应于所述程序更新信号的个数等于所述预定数目,准备发送所述同步运行信号。The synchronization scheduler is further configured to prepare to send the synchronization operation signal in response to the number of the program update signals being equal to the predetermined number.
  8. 根据权利要求2-7任一项所述的数据处理装置,其特征在于,The data processing device according to any one of claims 2-7, wherein:
    每个所述处理核执行所述程序段的数目相同;Each of the processing cores executes the same number of program segments;
    所述预定数目为所述程序段的数目,或者所述预定数目为与所述同步调度器连接的处理核的个数与所述程序段的数目的乘积。The predetermined number is the number of the program segments, or the predetermined number is the product of the number of processing cores connected to the synchronization scheduler and the number of the program segments.
  9. 根据权利要求1-8任一项所述的数据处理装置,其特征在于,所述至少两个处理核包括第一处理核和第二处理核;The data processing device according to any one of claims 1-8, wherein the at least two processing cores comprise a first processing core and a second processing core;
    所述第一处理核和所述第二处理核执行的程序不同,所述第一处理核执行的程序的计算结果为所述第二处理核执行的程序的输入。The programs executed by the first processing core and the second processing core are different, and the calculation result of the program executed by the first processing core is an input of the program executed by the second processing core.
  10. 一种数据处理方法,其特征在于,包括:A data processing method, characterized in that it comprises:
    处理核执行程序;Processing nuclear execution procedures;
    同步调度器响应与所述同步调度器连接的各个所述处理核的程序更新信号,生成并发送配置信号;The synchronous scheduler generates and sends a configuration signal in response to the program update signal of each of the processing cores connected to the synchronous scheduler;
    直接存储访问控制器基于所述配置信号,从外部存储单元中读取与各个所述处理核对应的所述程序并发给相应的处理核。Based on the configuration signal, the direct storage access controller reads the program corresponding to each processing core from the external storage unit and sends it to the corresponding processing core.
PCT/CN2020/077804 2020-03-04 2020-03-04 Data processing apparatus and data processing method WO2021174446A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/077804 WO2021174446A1 (en) 2020-03-04 2020-03-04 Data processing apparatus and data processing method
CN202080096325.8A CN115151892A (en) 2020-03-04 2020-03-04 Data processing device and data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/077804 WO2021174446A1 (en) 2020-03-04 2020-03-04 Data processing apparatus and data processing method

Publications (1)

Publication Number Publication Date
WO2021174446A1 true WO2021174446A1 (en) 2021-09-10

Family

ID=77613895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/077804 WO2021174446A1 (en) 2020-03-04 2020-03-04 Data processing apparatus and data processing method

Country Status (2)

Country Link
CN (1) CN115151892A (en)
WO (1) WO2021174446A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020056030A1 (en) * 2000-11-08 2002-05-09 Kelly Kenneth C. Shared program memory for use in multicore DSP devices
CN101996087A (en) * 2010-12-02 2011-03-30 北京星河亮点通信软件有限责任公司 Dynamical loading system and method for multi-core processor array program
CN104978282A (en) * 2014-04-04 2015-10-14 上海芯豪微电子有限公司 Cache system and method
CN107810477A (en) * 2015-06-26 2018-03-16 微软技术许可有限责任公司 The reuse of the instruction of decoding
CN108027766A (en) * 2015-09-19 2018-05-11 微软技术许可有限责任公司 Prefetched instruction block

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020056030A1 (en) * 2000-11-08 2002-05-09 Kelly Kenneth C. Shared program memory for use in multicore DSP devices
CN101996087A (en) * 2010-12-02 2011-03-30 北京星河亮点通信软件有限责任公司 Dynamical loading system and method for multi-core processor array program
CN104978282A (en) * 2014-04-04 2015-10-14 上海芯豪微电子有限公司 Cache system and method
CN107810477A (en) * 2015-06-26 2018-03-16 微软技术许可有限责任公司 The reuse of the instruction of decoding
CN108027766A (en) * 2015-09-19 2018-05-11 微软技术许可有限责任公司 Prefetched instruction block

Also Published As

Publication number Publication date
CN115151892A (en) 2022-10-04

Similar Documents

Publication Publication Date Title
TWI628594B (en) User-level fork and join processors, methods, systems, and instructions
CN100454280C (en) Processor system, DMA control circuit, DMA control method, control method for DMA controller, graphic processing method, and graphic processing circuit
JP6571078B2 (en) Parallel processing device for accessing memory, computer-implemented method, system, computer-readable medium
US8669990B2 (en) Sharing resources between a CPU and GPU
EP2187316B1 (en) Gated storage system and synchronization controller and method for multiple multi-threaded processors
US6363453B1 (en) Parallel processor with redundancy of processor pairs
US9146609B2 (en) Thread consolidation in processor cores
CN102023844B (en) Parallel processor and thread processing method thereof
US20090125907A1 (en) System and method for thread handling in multithreaded parallel computing of nested threads
DE102012221502A1 (en) A system and method for performing crafted memory access operations
CN101165655A (en) Multiple processor computation system and its task distribution method
TWI221250B (en) Multi-processor system
US11163677B2 (en) Dynamically allocated thread-local storage
CN112199173B (en) Data processing method for dual-core CPU real-time operating system
CN106462395A (en) Thread waiting in a multithreaded processor architecture
KR20190044572A (en) Instruction set
US20140143524A1 (en) Information processing apparatus, information processing apparatus control method, and a computer-readable storage medium storing a control program for controlling an information processing apparatus
CN106991071B (en) Kernel scheduling method and system
US10915488B2 (en) Inter-processor synchronization system
Elliott et al. Exploring the multitude of real-time multi-GPU configurations
US20120151145A1 (en) Data Driven Micro-Scheduling of the Individual Processing Elements of a Wide Vector SIMD Processing Unit
WO2021174446A1 (en) Data processing apparatus and data processing method
US20230063751A1 (en) A processor system and method for increasing data-transfer bandwidth during execution of a scheduled parallel process
US20230067432A1 (en) Task allocation method, apparatus, electronic device, and computer-readable storage medium
CN106909343B (en) A kind of instruction dispatching method and device based on data flow

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20922560

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20922560

Country of ref document: EP

Kind code of ref document: A1