WO2021174446A1

WO2021174446A1 - Data processing apparatus and data processing method

Info

Publication number: WO2021174446A1
Application number: PCT/CN2020/077804
Authority: WO
Inventors: 王维伟; 罗飞
Original assignee: 北京希姆计算科技有限公司
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2021-09-10
Also published as: CN115151892A

Abstract

A data processing apparatus and a data processing method. The data processing apparatus comprises: at least two processing cores; a synchronous scheduler, configured to respond to a program update signal of each processing core connected to the synchronous scheduler, and generate and send a configuration signal; and a direct storage access controller, configured to read a program corresponding to each processing core from an external storage unit based on the configuration signal and send the program to a corresponding processing core. On the one hand, the data processing apparatus can read the external storage unit without using a processing core, avoiding data reading delay of a plurality of processing cores and improving the computing power of a chip, and on the other hand, the program executed by each processing core of the data processing apparatus may be the same or different, the synchronous scheduler responds to the update of programs of the plurality of processing cores, task assignment can be flexibly carried out, the computing power of each processing core is fully provided, and the computing power of the chip is further improved.

Description

Data processing device and data processing method

Background technique

With the development of science and technology, human society is rapidly entering the era of intelligence. The important feature of the intelligent age is that people have more and more types of data, the amount of data they can obtain is larger and larger, and the requirements for the speed of data processing are getting higher and higher.

The chip is the cornerstone of data processing, and it fundamentally determines the ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as a central processing unit (CPU), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as Tensor Processing Unit (TPU), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle.

Due to the wide variety and huge amount of data in the intelligent era, the chip is required to have extremely high flexibility, capable of processing different fields and rapidly changing algorithms, and extremely strong processing capabilities, which can quickly process extremely large and rapidly increasing data. quantity.

Summary of the invention

(1) Purpose of the invention

The object of the present invention is to provide a data processing device and a data processing method. The data processing device is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed to read from an external storage unit through the synchronization scheduler. Take the program corresponding to each processing core and send it to the corresponding processing core. On the one hand, the data processing device provided by the embodiment of the present invention does not require a processing core to fetch data from an external storage unit, which avoids data reading caused by multiple processing cores. The delay improves the computing power of the chip. On the other hand, in the data processing device provided by the embodiment of the present invention, the programs executed by each processing core can be the same or different, and the synchronization scheduler can respond to the update of multiple processing core programs. Assign tasks flexibly, give full play to the computing power of each processing core, and further enhance the computing power of the device.

(2) Technical solution

In order to solve the above-mentioned problems, the first aspect of the present invention provides a data processing device, which includes at least two processing cores; And send a configuration signal; a direct storage access controller is used to read the program corresponding to each processing core from the external storage unit based on the configuration signal and send it to the corresponding processing core.

The data processing device provided by the embodiment of the present invention is provided with a synchronization scheduler and a direct storage access controller, and the direct storage access controller is instructed by the synchronization scheduler to read a program from an external storage unit, and send it to the corresponding program Processing cores. On the one hand, the data processing device provided by the embodiments of the present invention does not require processing cores to fetch data from an external storage unit, avoiding the delay caused by multiple processing cores reading data, and improving the computing power of the chip. In the data processing device provided by the embodiment of the invention, the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to the update of multiple processing core programs, which can flexibly distribute tasks and give full play to the computing power of each processing core. Further improve the computing power of the chip.

Further, the synchronization scheduler is further configured to respond to the program update signal and send a synchronization operation signal to each processing core connected to the synchronization scheduler, and the synchronization operation signal is a predetermined number of data received by the synchronization scheduler. After the program update signal is sent to each processing core, the synchronous operation signal is used to instruct each processing core to start executing their respective programs at the same time.

Further, the program includes a plurality of program segments.

Further, the processing core is configured to send the program update signal to the synchronization scheduler after each program segment is executed.

Further, the direct access controller is used to read the program segments corresponding to each processing core from the external storage unit, and to process the cores corresponding to the program segments; the direct access controller, It is also used to send the configuration completion signal to the synchronization scheduler after sending the program segment to the corresponding processing core.

Further, the synchronization scheduler is further configured to send a synchronization operation signal to each processing core connected to the synchronization scheduler in response to the program update signal, including: the synchronization scheduler is also configured to respond to the program The update signal and the configuration completion signal send the synchronization operation signal to each processing core connected to the synchronization scheduler.

Further, the program includes an operation instruction and a program update instruction; the processing core is configured to execute the program update instruction based on the completion of the operation instruction, and complete sending the program update signal based on the program update instruction.

Further, the synchronous scheduler includes a counter; the counter is used to record the number of program update signals received; the synchronous scheduler is also used to respond to the number of the program update signals being equal to the predetermined number , Ready to send the synchronous operation signal.

Further, the synchronization scheduler is further configured to prepare to send the synchronization operation signal in response to the number of the program update signals being equal to the predetermined number, including: the synchronization scheduler is further configured to respond to the program The number of update signals is equal to the predetermined number, a configuration signal is sent, and the synchronization operation signal is sent after receiving the configuration completion signal sent by the direct access controller.

Further, that the synchronization scheduler receives a predetermined number of the program update signals includes: the synchronization scheduler receives a predetermined number of the program update signals sent by all the processing cores connected to the synchronization scheduler; Or the synchronization scheduler receiving the predetermined number of the program update signals includes: the synchronization scheduler receives the predetermined number of the program update signals sent by each of the processing cores connected to the synchronization scheduler.

Further, the synchronous scheduler is configured to generate and send configuration signals in response to the program update signal of each of the processing cores, and includes: the synchronous scheduler is configured to receive each processing core connected to the synchronous scheduler. After the program update signal is sent, the configuration signal is sent.

Further, the at least two processing cores include a first processing core and a second processing core; the first processing core and the second processing core execute different programs, and the calculation result of the program executed by the first processing core is The input of the program executed by the second processing core.

According to a second aspect of the present invention, there is provided a chip including one or more data processing devices provided in the first aspect.

According to a third aspect of the present invention, a card board is provided, which includes one or more chips provided in the second aspect.

According to a fourth aspect of the present invention, there is provided an electronic device including one or more cards provided in the third aspect.

According to a fifth aspect of the present invention, there is provided a data processing method, the method comprising: a processing core executes a program; a synchronous scheduler responds to the program update signal of each processing core connected to the synchronous scheduler, and generates and sends a configuration signal; directly Based on the configuration signal, the storage access controller reads the program corresponding to each processing core from the external storage unit and sends it to the corresponding processing core.

According to a sixth aspect of the present invention, there is provided a computer storage medium having a computer program stored on the computer storage medium, and when the program is executed by a processor, the data processing method of the fifth aspect is implemented.

According to a seventh aspect of the present invention, there is provided an electronic device including a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor implements The fifth aspect of the data processing method.

According to an eighth aspect of the present invention, a computer program product is provided, which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the data processing method of the fifth aspect.

(3) Beneficial effects

The above technical solution of the present invention has the following beneficial technical effects:

Description of the drawings

FIG. 1 is a schematic diagram of the structure of a chip provided by the prior art;

Fig. 2 is a schematic structural diagram of a chip provided by another prior art;

Figure 3 is a schematic structural diagram of a data processing device according to the present invention;

Figure 4 is a schematic structural diagram of another data processing device according to the present invention;

Figure 5 is a schematic diagram of the structure of a neural network provided according to the present invention;

6 is a schematic diagram of the operation of the data processing device provided by the present invention applied to the neural network;

Fig. 7 is a schematic diagram of scheduling processing cores by a synchronous scheduler according to the present invention;

Fig. 8 is a schematic flowchart of a data processing method provided according to the present invention.

Detailed ways

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings. It should be understood that these descriptions are only exemplary, and are not intended to limit the scope of the present invention. In addition, in the following description, descriptions of well-known structures and technologies are omitted to avoid unnecessarily obscuring the concept of the present invention.

In neural network computing, multi-core or many-core architecture chips are often used. Generally, the processing cores in a chip with a multi-core or many-core architecture have a certain ability to process data independently, and also have a relatively large internal storage space. The larger storage space is generally used to store its own programs, data, and weights.

How to enable many cores to exert their computing power efficiently is the key to determining the performance of the entire chip. The computing power of each core depends on many factors, such as task scheduling and distribution, chip architecture, core structure, and core circuit. Among them, task scheduling and allocation is a very critical factor. If task scheduling and allocation are reasonable, the effective computing power of each core can be fully utilized, otherwise the effective computing power of each core is low.

FIG. 1 is a schematic diagram of the structure of a chip provided by the prior art.

As shown in Figure 1, the chip includes a scheduler and multiple processing cores C1 to Cn. In the chip shown in Figure 1, the scheduler receives instructions sent from outside the chip. For example, the scheduler receives instructions from outside the chip. The instruction sent by the source is then transmitted to each processing core according to a preset strategy (for example, in a preset order), and each processing core executes the same instruction but processes different data. For example, if the instruction is to process a+b, but the a or b of the two processing cores may be different values, then the data processed by the two processing cores are different data.

For the chip architecture shown in Figure 1, each processing core can have a relatively simple structure, such as a single instruction multiple data structure (SIMD) or a single instruction multiple thread structure (SIMT) .

Usually this method has the following disadvantages:

The scheduler can only passively receive instructions from the outside and allocate them to each processing core. Regardless of whether it is a SIMD structure or a SIMT structure, each processing core can only execute the same instructions, resulting in a single chip with a single function and lack of flexibility.

Fig. 2 is a schematic structural diagram of a chip provided by another prior art.

As shown in Figure 2, the chip includes a plurality of processing cores C1 to Cn and a memory unit memory. In the chip shown in Figure 2, each core can independently read instructions from Memory (such as DDR) and perform operations. Usually each core has a complete control circuit, register set and other circuits. This structure is in a multi-core CPU or ASIC. More common.

Usually this method has the following disadvantages:

(1) Each processing core has high autonomy and can run instructions independently. However, due to the high autonomy of the processing core, it is difficult for multiple processing cores to cooperate with each other to efficiently complete a complete task.

(2) The circuit control in the chip is more complicated, and each core is almost a complete CPU. If you want to use the efficient cooperation between the processing cores to complete a complete task, the circuit design is difficult, and the power consumption and area Big.

(3) Multiple processing cores may frequently access the instruction storage area, causing a decrease in storage access efficiency, which in turn affects the performance of the chip's computing power.

In order to solve the above-mentioned problems, the technical solution of the present invention is proposed.

The chip provided by an embodiment of the present application will be described in detail below. In the description of the present invention, it should be noted that the terms "first", "second", "third", and "fourth" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance. In addition, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

Fig. 3 is a schematic structural diagram of a data processing device according to the first embodiment of the present invention.

As shown in FIG. 3, the data processing device includes at least two processing cores, a synchronous scheduler (Synchronizer and Scheduler, S_S), and a direct memory access controller (Direct Memory Access Controller, DMAC).

Among them, S_S is connected to at least two processing cores, and the at least two processing cores can be all processing cores in the data processing device, such as processing core C1 to processing core Cn, DMCA and at least two processing cores, S_S and external storage Unit Memory connection.

S_S is used to generate and send a configuration signal in response to the program update signal of each processing core connected to the S_S.

DAMC is used to read the program corresponding to each processing core from the external Memory and send it to the corresponding processing core based on the configuration signal.

In one embodiment, S_S is also used to respond to the program update signal to send a synchronization operation signal to the processing core connected to S_S. The synchronization operation signal is that after S_S receives a predetermined number of the program update signals, it sends a synchronization operation signal to each processor. The synchronous operation signal sent by the core is used to instruct each processing core to start executing their respective programs at the same time.

Further, S_S is also used to send a configuration signal first in response to the number of program update signals being equal to the predetermined number, and then send the synchronization operation signal after receiving the configuration completion signal sent by the direct access controller .

In one embodiment, the program includes a plurality of program segments.

The processing core is configured to send the program update signal to the S_S after each program segment is executed.

Preferably, the number of program segments executed by each processing core is the same.

Optionally, S_S includes a first counter, and the first counter is used to record the number of program update signals received.

S_S is also used for preparing to send the synchronous running signal in response to the number of program update signals recorded by the first counter being equal to the predetermined number.

In the embodiment of the present invention, the number of the first counter may be, for example, one or more.

Optionally, when the number of the first counter is one, the predetermined number may be, for example, the sum of the program update signals sent by each core received by S_S, that is, the predetermined number is the number of processing cores connected to S_S and the number of program segments. The product of.

Optionally, the preset number recorded by the first counter can be set to one or more.

For example, the data processing device includes 2 processing cores, and the program segments executed by each processing core are 4 segments.

When the preset number of the first counter record is set to a value, the preset number of the first counter record is set to 8, that is, when the program update signal received by the first counter record is 8, S_S will send to the DMAC Configuration signal. When S_S receives the configuration completion signal returned by DMAC, S_S sends a synchronization signal to each core. At this time, the first counter is cleared and counting is restarted.

When the preset number recorded by the first counter is set to multiple values, the preset number recorded by the counter is 8, 16, 24... These preset numbers are the cumulative number of program update signals received by the synchronization controller. When the first counter records that the cumulative number of program update signals received is 8, S_S will send a configuration signal to the DMAC. When the DMAC returns a configuration completion signal, S_S will send a synchronization signal to each core; the counter is not cleared at this time, Continue counting. When the cumulative number of program update signals received by the counter records reaches 16, S_S will send a configuration signal to the DMAC. When receiving the configuration completion signal returned by the DMAC, S_S will send a synchronization signal to each core; the first counter continues Accumulate count, when it reaches the next preset number, S_S repeats the above steps.

Optionally, when the number of first counters is multiple, the predetermined number may be the number of program segments.

At this time, for example, each processing core connected to the S_S corresponds to a first counter, and the first counter is used to record the number of program update signals sent by the processing core corresponding to S_S. After each processing core has sent a predetermined number of program update signals, that is, when each first counter receives a preset number of program update signals, it sends a configuration signal to the DMAC. After receiving the configuration complete signal sent by the DMAC, S_S Send a synchronization signal to each core.

The above method of setting the counter is only illustrative. As long as it can realize the configuration of the DMAC of the SS and the synchronization operation of each processing core under certain conditions, any software and hardware structure, etc., can be used in the embodiment of the present invention. Here, No more explanation.

In a preferred embodiment, S_S is used to generate and send a configuration signal in response to the program update signal of each processing core, including: S_S, used to receive the program sent by each processing core connected to S_S Send the configuration signal after updating the signal. For example, if there are 5 processing cores connected to S_S, S_S sends the configuration signal after receiving the program update signal sent by all 5 processing cores.

In a preferred embodiment, the direct access to the controller is used to read the program segment from the external Memory, and send the program segment to the processing core corresponding to the program segment. The external memory refers to the storage unit on the Host, for example.

The direct access controller is further configured to send a configuration completion signal to the S_S after sending the program segment to the processing core corresponding to the program segment.

In an embodiment, S_S is further configured to send a synchronization operation signal to the processing core connected to S_S in response to the program update signal, including:

S_S is also used to send a synchronous operation signal to each processing core connected to S_S in response to the program update signal and the configuration completion signal.

Specifically, S_S is used to send the configuration signal to the DMAC in response to the number of program update signals recorded by the counter being equal to the predetermined number. The DMAC sends the program or program segment indicated by the configuration signal to the corresponding processing core, and then sends it to the S_S Send the signal that the configuration is complete, S_S sends a synchronous operation signal to each processing core connected to the SS after receiving the signal that the configuration is complete.

Here, the "program indicated by the configuration signal" may be the same or different from the program that has just been executed by the processing core.

In one embodiment, the program executed by the processing core includes an operation instruction and a program update instruction; the processing core is used to execute the program update instruction after completing the operation instruction, and generate and send a program update signal based on the program update instruction.

When the program executed by the processing core includes multiple program segments, each program segment includes arithmetic instructions and program update instructions.

In one embodiment, the processing core is provided with a storage module PRAM, which is used to store and receive the program sent by the DMAC, and the program executed by the processing core is read from its own PRAM. In this embodiment, each processing core reads the instructions contained in the program from its own set of PRAM without reading from the external memory, so it can avoid the design of complex cache circuits. Compared with the prior art, there is no need to read from the memory. Reading data reduces the delay and greatly improves the execution efficiency of instructions.

Preferably, the storage space of the PRAM is greater than or equal to 16KB.

Optionally, the programs stored in any two processing cores are the same or different.

In a preferred embodiment, at least two processing cores in the data processing device include a first processing core and a second processing core; the first processing core and the second processing core execute different programs, and the first processing core executes The calculation result of the program may be the input of the program executed by the second processing core. The calculation result of the program executed by the first processing core is used as the input of the second processing core, so that the chip provided in the embodiment can be used for the calculation of the neural network. And through the S_S, each processing core can run their stored programs at the same time, which enables the orderly exchange of data between each processing core, and cooperates with each other to efficiently complete a complete task.

The data processing device provided by the embodiment of the present invention allocates programs to each processing core through S_S and DMAC. The programs and data exchanges executed by each processing have been set before the program runs. The micro-control unit MCU on the top of the chip or The host of the system only needs to configure the counter of S_S to achieve the established strategy, and the MCU or the host of the system can change the configuration of the counter of S_S and the stored program of the Memory to change the program executed by each core and the distribution of the program And scheduling, it is convenient to modify the allocation and scheduling of each processing core task in the chip, and can efficiently use the computing power of the processing core.

According to another aspect of the present invention, a chip is provided, which includes one or more data processing devices provided in the above aspects.

For example, when the chip includes multiple data processing devices, the chip may include multiple S_Ss, and each S_S is connected to multiple processing cores and a direct access controller.

According to another embodiment of the present invention, a card board is provided, which includes one or more chips provided in the foregoing embodiments.

According to another embodiment of the present invention, there is provided an electronic device, including one or more of the card boards provided in the foregoing embodiments.

Fig. 4 is a schematic structural diagram of a data processing device provided by the present invention.

As shown in Figure 4, the device includes a first processing core C1, a second processing core C2, S_S, and DMAC. Each processing core is provided with PRAM.

Fig. 5 is a schematic diagram of the structure of the neural network provided by the present invention.

This embodiment takes a two-layer neural network as an example. As shown in Figure 5, the neural network has a two-layer structure. The control program of each layer of the neural network is 128KB, and the calculation amount of each layer is the same. The sub-task allocation strategy is allocated to two processing core pipelines for calculation, that is, C1 and C2 are each responsible for the calculation of a layer of the network, and each runs the program of the corresponding layer. For example, C1 calculates the first layer network Layer1, and C2 calculates the second layer network Layer2. The input data will be sent to C1, C1 will process the input data in the first layer, and send the processing result of the first layer to C2, and C2 will use the processing result of the first layer as input and perform the second layer processing to get The final result is output, that is, the data flows through Layer 1 and Layer 2 in turn to realize the operation of the entire neural network, and finally get the output.

Fig. 6 is a schematic diagram of the operation of applying the chip provided by the present invention to a neural network.

As shown in Figure 6, Input1-1 represents the input of the entire neural network, and also represents that Input1-1 is used as the input of the first layer of neural network layer1 at the beginning of the time period t1, and Input2-1 represents the calculation of layer1 in the time period t1 As a result, it is also used as the input of the second layer of neural network layer2 at the time period t2, output1 represents the calculation result of layer2 after the time period t2, and the calculation result is also the output result of the neural network. When C1 When processing the pipeline of layer1, its input is the input of neural network data, and its output is used as the input of C2, and the output of C2 is the final output result.

Set the control program of each layer to 128KB. Since the PRAM of each processing core is only 32KB, the control program of each layer needs to be scheduled according to a certain strategy to update the program of each core. For example, the program of each core can be divided into four program segments and transmitted to the corresponding core in an average of 32KB each time.

The calculation process of the processing core in the chip when running the neural network is as follows:

At the beginning, C1 receives the input of the first program segment of the first program at t1, and executes the first program. After the first program is executed, C1 sends the program update instruction PU_S1 to S_S, because the output of C1 is C2 Input, you can set C1 to run each program segment of the first program, S_S sends a configuration signal to the DMAC every time it receives only the program update signal sent by C1, until C1 runs the last program segment of the first program, DMAC sends configuration signals, DMAC reads the first section of the second program run by C1 and the first section of the first program run by C2 from the external memory, and sends them to C1 and C2. When the transmission is completed, the DMAC sends a configuration completion signal to S_S, and S_S sends a synchronous operation signal Sync to C1 and C2, instructing C1 and C2 to start running the received program segment at the same time.

When C1 or C2 finishes running the operation instructions of the program segment, it will execute the program update instruction PUpdate. PUpdate sends the generated program update signal PU_s to S_S. PU_s indicates that the program in PRAM needs to be updated; after S_S receives the signal, it judges whether All PU_s sent by all processing cores connected to S_S are received. If PU_s sent by all processing cores connected to S_S are not received, it will be in a waiting state until all processing cores connected to S_S are sent. PU_s. If S_S receives the PU_s sent by all the processing cores connected to S_S, it sends a configuration signal to the DMAC. The DMAC fetches the program segment indicated by the configuration signal from the external Memory and sends it to the processing core corresponding to the program segment to control the PRAM is updated.

After receiving the program update signal sent by all processing cores connected to S_S for the fourth time, it indicates that all program segments of each core and each layer have been executed, and S_S first sends a configuration signal to the DMAC. DMCA reads the first program segment of the program indicated by the configuration signal from the external memory according to the configuration signal and sends it to the corresponding processing core. When the transmission is completed, the DMAC sends a configuration completion signal.

When S_S receives the configuration completion signal sent by DMAC, it generates and sends a synchronization operation signal Sync, indicating that the programs of each core need to be executed at the same time. After receiving the synchronization operation signal, each core can start to exchange data, that is C1 sends the calculation result to C2.

It should be noted that the program indicated by the configuration signal may be the same or different from the program that has just been executed.

It should also be noted that at the initial stage, it is also possible to set C1 to receive and input the first program segment of the first program at time t1, and C2 to also receive the first program segment of the first program at time t1, and both C1 and C2 are executed For the first program segment, the input of the first program segment of the first program run by C2 can be set as a preset value, so that both C1 and C2 execute the first program at the same time.

Fig. 7 is a schematic diagram of scheduling processing cores by a synchronous scheduler provided by the present invention.

As shown in Figure 7, C1 and C2 receive the Sync signal sent by S_S and run the program stored in the PRAM from the beginning. After C1 executes the operation instruction in the first block, it will execute the last instruction in the first block, namely the update instruction PUpdate, which means that the execution of this block has been completed. PUpdate generates the program update signal PU_s1 and sends the update signal Give S_S, then C1 starts to wait.

S_S receives PU_s1 and finds that it has not received the update signal sent by each processing core connected to S_S, that is, PU_s2 has not received this time, and will continue to wait.

After C2 has executed the operation instruction in its first block, it will execute the last instruction in the first block, the update instruction PUpdate, which means that the block has been executed. PUpdate generates the program update signal PU_s2 and sends it to S_S, and then C2 starts to wait.

S_S receives PU_s2, and the first counter finds that it has received the program update signal sent by each processing core connected to S_S, and configures the DMAC to start the program update of C1 and C2.

DMAC reads the new program segments of C1 and C2 from Memory, and then sends them to the PRAM of C1 and C2 respectively, until the new program segment is updated.

When the first counter of S_S records that the number of program update signals received is 8, it means that the two processing cores have completed their four program segments, that is, the entire program of each of the two processing cores has been executed. S_S The first counter will be reset, and the update times will be re-recorded from the beginning; first configure the DMAC, start the program update of C1 and C2, that is, reload the first program segment of the next program, when the DMAC sends the program segment, it will send to S_S Send the configuration completion signal, S_S generates a synchronous operation signal Sync, and sends it to each core, instructing each core to start working at the same time. At this time, each core can transmit data to each other.

It should be noted that the working sequence of resetting the first counter, configuring the DMAC, and sending the synchronous operation signal may be in no particular order.

The above is an example of a 2-layer neural network. The calculation result of C1 is taken as the input data of C2 as an example. Of course, the data processing device provided by the embodiment of the present invention can be used in any neural network, and C1 and C2 can also be used. There is no data link.

Fig. 8 is a schematic flow chart of the data processing method provided by the present invention.

As shown in Figure 8, the data processing method includes:

Step S101, processing the core execution program;

Step S102, the synchronous scheduler generates and sends a configuration signal in response to the program update signal of each processing core connected to the synchronous scheduler;

Step S103: Based on the configuration signal, the direct storage access controller reads the program corresponding to each processing core from the external storage unit and sends it to the corresponding processing core.

In an embodiment, the synchronization scheduler also sends a synchronization operation signal to each processing core connected to the S_S in response to the program update signal, and the synchronization operation signal is used to instruct the processing cores connected to the synchronization scheduler to start at the same time. Perform the respective procedures.

Specifically, S_S also sends a synchronization operation signal to each processing core connected to S_S in response to the program update signal, including: when S_S receives a predetermined number of the program update signals, sending to each processing core connected to S_S Synchronous operation signal.

In one embodiment, the program executed by the processing core includes a plurality of program segments.

After executing each program segment, the processing core sends the program update signal to the synchronous scheduler.

Preferably, the first counter of S_S records the number of program update signals received.

S_S, in response to the number of program update signals recorded by the first counter being equal to the predetermined number, preparing to send the synchronous operation signal.

Optionally, the preset number recorded by the first counter can be set to one or more. For example, the data processing device includes two processing cores, and the program segments executed by each processing core are 4 segments.

When the preset number of the first counter record is set to one, the preset number of the first counter record is set to 8, that is, when the program update letter received by the first counter record is 8, S_S will send to the DMAC Configuration signal. After S_S receives the configuration completion signal returned by DMAC, S_S will send a synchronization signal to each core. At this time, the first counter is cleared and counting is restarted.

When the preset number of the first counter record is set to be multiple, the preset number of the counter record is 8, 16, 24... These preset numbers are the cumulative number of program update signals received by the synchronization controller. When the first counter records that the cumulative number of program update signals received is 8, S_S will send a configuration signal to the DMAC. When the DMAC returns a configuration completion signal, S_S will send a synchronization signal to each core; the counter is not cleared at this time, Continue counting. When the cumulative number of program update signals received by the counter records reaches 16, S_S will send a configuration signal to the DMAC. When receiving the configuration completion signal returned by the DMAC, S_S will send a synchronization signal to each core; the first counter continues Accumulate count, when it reaches the next preset number, S_S repeats the above steps.

Optionally, S_S generates and sends a configuration signal in response to a program update signal of each processing core, including: S_S, sends the configuration signal after receiving a program update signal sent by all processing cores connected to S_S. For example, if there are five processing cores connected to the synchronous scheduler, the synchronous scheduler sends the configuration signal after receiving the program update signal sent by all the five processing cores.

Optionally, the DMAC also reads the program segment corresponding to each processing core from the external Memory, and sends the program segment to the corresponding processing core. DMAC, after sending the program segment to the corresponding processing core, sends the configuration completion signal to S_S.

In one embodiment, S_S also sends a synchronization operation signal to each processing core connected to S_S in response to the program update signal, including: S_S, which is also used to respond to the program update signal and the configuration completion signal to each processing core connected to S_S The core sends a synchronous operation signal.

Specifically, in response to the number of program update signals recorded by the counter being equal to the predetermined number, S_S first sends the configuration signal to the DMAC. The DMAC sends the program or program segment indicated by the configuration signal to the corresponding processing core, and then sends the configuration complete to S_S After receiving the configuration completion signal, S_S sends a synchronization operation signal to each processing core connected to SS.

In one embodiment, the program includes an operation instruction and a program update instruction; after the operation instruction is completed, the processing core executes the program update instruction, and based on the completion of the program update instruction, generates and sends a program update signal. Among them, the completion operation instruction refers to when the operation instruction is completed or after the operation instruction is completed.

In one embodiment, the at least two processing cores include a first processing core and a second processing core; the programs executed by the first processing core and the second processing core are different, and the calculation result of the program executed by the first processing core is the second processing Input to the program executed by the nuclear.

According to another embodiment of the present invention, a computer storage medium is provided, and a computer program is stored on the computer storage medium, and when the program is executed by a processor, the data processing method provided in the above embodiment is implemented.

According to another embodiment of the present invention, there is provided an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the processor executes the program The data processing method provided by the foregoing embodiment is implemented.

According to another aspect of the present invention, a computer program product is provided, which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the data processing method provided in the above-mentioned embodiments.

In the data processing method provided by the embodiment of the present invention, the direct storage access controller is instructed by the synchronization scheduler to read the program from the external storage unit and send it to the processing core corresponding to the program. On the one hand, there is no need for the processing core from the outside. The storage unit fetches data to avoid the delay caused by multiple processing cores to read data, which improves the computing power of the chip. On the other hand, the programs executed by each processing core can be the same or different, and the synchronization scheduler responds to multiple processing cores. The update of the program can flexibly allocate tasks, give full play to the computing power of each processing core, and further enhance the computing power of the chip.

It should be understood that the above-mentioned specific embodiments of the present invention are only used to exemplarily illustrate or explain the principle of the present invention, and do not constitute a limitation to the present invention. Therefore, any modifications, equivalent substitutions, improvements, etc. made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. In addition, the appended claims of the present invention are intended to cover all changes and modifications that fall within the scope and boundary of the appended claims, or equivalent forms of such scope and boundary.

Claims

A data processing device, characterized in that it comprises:

At least 2 processing cores;

A synchronous scheduler, configured to generate and send configuration signals in response to program update signals of each of the processing cores connected to the synchronous scheduler;

The direct storage access controller is configured to read the program corresponding to each processing core from the external storage unit based on the configuration signal and send it to the corresponding processing core.
The chip of claim 1, wherein:

The synchronization scheduler is further configured to respond to the program update signal and send a synchronization operation signal to each of the processing cores connected to the synchronization scheduler, where the synchronization operation signal is a predetermined number received by the synchronization scheduler After the program update signal is sent to each of the processing cores, the synchronous operation signal is used to instruct each of the processing cores to start executing their respective programs at the same time.
The data processing device according to claim 1 or 2, wherein:

The program includes a plurality of program segments.
The data processing device according to claim 3, wherein:

The processing core is configured to send the program update signal to the synchronization scheduler after each program segment is executed.
The data processing device according to claim 3 or 4, wherein:

The direct access controller is configured to read the program segments corresponding to each of the processing cores from the external storage unit, and send them to the corresponding processing cores;

The direct access controller is further configured to send a configuration completion signal to the synchronization scheduler after sending the program segment to the corresponding processing core.
The data processing device according to claim 5, wherein the synchronization scheduler is further configured to send a synchronization operation signal to each of the processing cores connected to the synchronization scheduler in response to the program update signal, include:

The synchronization scheduler is further configured to send the synchronization operation signal to each of the processing cores connected to the synchronization scheduler in response to the program update signal and the configuration completion signal.
The data processing device according to any one of claims 2-6, wherein the synchronization scheduler comprises a counter; the counter is used to record the number of program update signals received;

The synchronization scheduler is further configured to prepare to send the synchronization operation signal in response to the number of the program update signals being equal to the predetermined number.
The data processing device according to any one of claims 2-7, wherein:

Each of the processing cores executes the same number of program segments;

The predetermined number is the number of the program segments, or the predetermined number is the product of the number of processing cores connected to the synchronization scheduler and the number of the program segments.
The data processing device according to any one of claims 1-8, wherein the at least two processing cores comprise a first processing core and a second processing core;

The programs executed by the first processing core and the second processing core are different, and the calculation result of the program executed by the first processing core is an input of the program executed by the second processing core.
A data processing method, characterized in that it comprises:

Processing nuclear execution procedures;

The synchronous scheduler generates and sends a configuration signal in response to the program update signal of each of the processing cores connected to the synchronous scheduler;

Based on the configuration signal, the direct storage access controller reads the program corresponding to each processing core from the external storage unit and sends it to the corresponding processing core.