WO2021218623A1 - 一种数据处理装置、芯片和数据处理方法 - Google Patents

一种数据处理装置、芯片和数据处理方法 Download PDF

Info

Publication number
WO2021218623A1
WO2021218623A1 PCT/CN2021/086850 CN2021086850W WO2021218623A1 WO 2021218623 A1 WO2021218623 A1 WO 2021218623A1 CN 2021086850 W CN2021086850 W CN 2021086850W WO 2021218623 A1 WO2021218623 A1 WO 2021218623A1
Authority
WO
WIPO (PCT)
Prior art keywords
program
processing
processing core
core
cores
Prior art date
Application number
PCT/CN2021/086850
Other languages
English (en)
French (fr)
Inventor
罗飞
王维伟
Original Assignee
北京希姆计算科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京希姆计算科技有限公司 filed Critical 北京希姆计算科技有限公司
Priority to EP21795267.0A priority Critical patent/EP4145277A4/en
Publication of WO2021218623A1 publication Critical patent/WO2021218623A1/zh
Priority to US18/049,483 priority patent/US20230069032A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline, look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • G06F9/4887Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues involving deadlines, e.g. rate based, periodic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8046Systolic arrays
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the technical field of processing cores, in particular to a data processing device, a chip and a data processing method.
  • the chip is the cornerstone of data processing, and it fundamentally determines the ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as a central processing unit (CPU), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as Tensor Processing Unit (TPU), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle.
  • a general-purpose chip route such as a central processing unit (CPU), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as Tensor Processing Unit (TPU), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even
  • the chip Due to the wide variety and huge amount of data in the intelligent era, the chip is required to have extremely high flexibility, capable of processing different fields and rapidly changing algorithms, and extremely strong processing capabilities, which can quickly process extremely large and rapidly increasing data. quantity.
  • multi-core or many-core chips are often used.
  • the processing cores in a multi-core chip have a certain degree of independent processing capability and have a relatively large internal storage space for storing the core's own programs, data, and weights.
  • the computing power of each core depends on many factors, such as task scheduling and distribution, chip architecture, core structure, and core circuit. Among them, task scheduling and allocation is a very critical factor. If task scheduling and allocation are reasonable, the effective computing power of each core can be fully utilized, otherwise the effective computing power of each core is low.
  • the chip includes a scheduler and multiple processing cores C 1 to CN .
  • the scheduler receives instructions sent from outside the chip.
  • the scheduler receives instructions from outside the chip.
  • the instructions sent by the instruction source are then transmitted to each processing core at the same time.
  • Each processing core executes the same instruction but processes different data.
  • the instruction is to calculate the sum of parameter a and parameter b, but the parameter a of different processing cores may represent different values, so although the two processing cores execute both a+b, due to the different parameters, The results obtained are different, that is, each processing core executes the same instruction and processes different data.
  • each processing core can have a relatively simple structure, such as a single instruction multiple data structure (SIMD) or a single instruction multiple thread structure (SIMT) .
  • SIMD single instruction multiple data structure
  • SIMT single instruction multiple thread structure
  • Fig. 2 is a schematic structural diagram of a chip provided by another prior art.
  • the chip includes a plurality of processing cores C 1 to CN and a memory unit Memory.
  • each core can independently read instructions from Memory (such as DDR SDRAM) and perform operations.
  • Memory such as DDR SDRAM
  • each core has a complete control circuit, register set and other circuits. This structure is used in a multi-core CPU or ASIC. Is more common in.
  • the present invention provides a data processing device, which solves the technical problem that multiple processing cores need to separately access instruction storage areas to execute the same program, resulting in higher power consumption.
  • a first aspect of the present invention provides a data processing device, including: a plurality of processing cores having a preset execution sequence, the plurality of processing cores including a first processing core and at least one other processing core; the first processing core is used to send instructions , Receive and execute the program obtained according to the instruction; each other processing core is used to receive and execute the program sent by the previous processing core in the preset execution sequence.
  • each other processing core is used to receive and execute the executed program sent by the previous processing core.
  • each of the other processing cores is used to receive and execute the just completed program sent by the previous processing core.
  • processing cores include an intermediate processing core and a tail processing core; the intermediate processing core is used to send the executed program to the latter processing core.
  • a storage management unit configured to receive an instruction sent by the first processing core, obtain a program from an external storage unit according to the instruction, and send the obtained program to the first processing core.
  • the first processing core is used to send instructions according to the synchronization signal; each other processing core is used to receive and execute the program sent by the previous processing core according to the synchronization signal.
  • processing cores include intermediate processing cores and tail processing cores; each intermediate processing core is used to send, according to the synchronization signal, a program that has been executed in the cycle of the previous synchronization signal to the next processing core.
  • the intermediate processing core is used for receiving the program sent by the previous processing core, and at the same time sending the program executed in the period of the previous synchronization signal to the latter processing core.
  • each intermediate processing core is also used to store a second update program, and each intermediate processing core is used to execute the second update program when receiving the synchronization signal, and update the data of the previous synchronization signal according to the second update program.
  • the program that has been executed in the cycle is sent to the next processing core.
  • each of the multiple processing cores is also used to send a synchronization request signal separately after executing the program it receives.
  • the program obtained according to the instruction is a program segment.
  • a chip including one or more data processing devices provided in the first aspect.
  • an electronic device including one or more cards provided in the third aspect.
  • a data processing method applied to a data processing device includes a plurality of processing cores having a preset execution sequence, and the plurality of processing cores includes a first processing core and at least One other processing core, the data processing method includes: the first processing core receives and executes the program obtained according to the instruction; each other processing core is used to receive and execute the program sent by the previous processing core in the preset execution sequence.
  • an electronic device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the data processing method of the fifth aspect when the program is executed.
  • a computer program product which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the data processing method of the fifth aspect.
  • the first processing core sends instructions and receives the program obtained according to the instructions, and each other processing core receives and executes the program sent by the previous processing core in the preset execution sequence, and there is no need to set each The processing cores read data from the Memory respectively, which reduces power consumption.
  • multiple processing cores can execute the same program.
  • each processing core will execute a complete program from start to finish, which can avoid data exchange between cores, reduce the delay and power consumption caused by data exchange, and improve the efficiency of data processing.
  • FIG. 1 is a schematic diagram of the structure of a chip provided by the prior art
  • Fig. 2 is a schematic structural diagram of a chip provided by another prior art
  • Figure 3 is a schematic structural diagram of a data processing device according to the present invention.
  • FIG. 4 is a schematic structural diagram of another data processing device according to the present invention.
  • FIG. 5 is a sequence diagram of a processing core execution program in the data processing device shown in FIG. 4;
  • Fig. 6 is a schematic flowchart of a data processing method provided by an embodiment of the present invention.
  • all processing cores logically form a logical core chain that is, C 1 to CN are sequentially numbered according to the logic of the sending program, that is, the The preset execution order refers to the order of processing cores in the logical core chain.
  • the present invention refers to the first core on the logical core chain as the first processing core, and the processing cores other than the first processing core as other processing cores.
  • the last processing core on the logical core chain among other processing cores is called the tail processing core, and the processing cores other than the tail processing core on the logical core chain among the other processing cores are called intermediate processing cores.
  • the multiple processing cores include a first processing core and at least one other processing core.
  • the first processing core is, for example, processing core C 1
  • the other processing cores are, for example, processing core C 2 to processing core CN .
  • the previous processing core refers to the processing core preceding and adjacent to the processing core according to the preset execution order. For example, for the processing core C 3 , the previous processing core is C 2 .
  • each processing core has the same circuit structure.
  • each processing core can execute the same program without the need to design a complicated parallel circuit, which saves the area of the data processing device, and each processing core can be executed from beginning to end.
  • the same complete program does not need to send the calculation result to another processing core, and the other processing core does not need to wait for the calculation result, thus reducing the delay and power consumption caused by data exchange, and improving the efficiency of data processing.
  • each other processing core is used to receive and execute the executed program sent by the previous processing core.
  • each other processing core is used to receive and execute the latest executed program sent by the previous processing core.
  • the latest program that has been executed refers to the program that has just been executed before the current moment, that is, the executed program that is closest to the current moment.
  • the subsequent processing cores in the preset execution sequence execute the programs that have just been executed by the previous processing cores adjacent to each other, so that in the data processing device, multiple processing cores execute the same program.
  • the program enables multiple processing cores to execute programs in parallel, which can complete tasks in large batches and improve the computing power of the entire chip.
  • This data processing device is more suitable for data processing and task execution in neural networks that execute batches.
  • the other processing cores include at least one intermediate processing core (for example, processing core C 2 to processing core CN-1 ) and tail processing core CN .
  • each intermediate processing core is used to send the executed program to the subsequent processing core.
  • each intermediate processing core is used to send the program that has just been executed to the next processing core.
  • the intermediate processing core is used to send the program that has just been executed to the next processing core while receiving the program sent by the previous processing core.
  • the first processing core While receiving the program obtained according to the instruction, the first processing core sends the program that has just been executed to the next processing core (C 2 in this example).
  • all the processing cores except the tail processing core receive the program and send the program at the same time, so that all the processing cores can implement the program update in parallel, which greatly reduces the delay caused by the program update.
  • each intermediate processing core sends the program that has just been executed to the next processing core, and after receiving the program sent by the previous processing core, it starts to execute the program it receives.
  • the first processing core is used to start executing the new program after sending the program that has just been executed to the next processing core and receiving the new program obtained according to the instruction.
  • the MME only needs to read the program from the external Memory according to the instruction sent by the first processing core, and only send the program to the first processing core, and there is no need to send the program to other processing cores separately. Therefore, There is no need to design a complex circuit structure in the MME. Without the need for the MME to send the program to all processing cores, all processing cores can execute the same program, thereby reducing delay and power consumption.
  • the data processing device further includes a synchronization generator (Synchronization Generator, S_G).
  • S_G Synchronization Generator
  • the S_G is used to generate a synchronization signal after receiving a synchronization request signal sent by each of the multiple processing cores, and send the synchronization signal to each processing core.
  • setting S_G, S_G can synchronize the update and operation of the program between the processing cores, reducing the complexity of synchronization between the processing cores.
  • the first processing core is used to send instructions according to the synchronization signal.
  • Each other processing core is used to receive and execute the program sent by the previous processing core according to the synchronization signal.
  • Each intermediate processing core is used to send the executed program in the cycle of the previous synchronization signal to the next processing core according to the synchronization signal.
  • the period of the last synchronization signal refers to the time period between the receipt of the last synchronization signal and the reception of the current synchronization signal.
  • FIG. 3 is represented as transfer process instruction, MME receives the command and analyzes the command and executes the instructions, i.e. new program taken from an external Memory, the transmission to the first processing core from the thin broken line a C 1 MME the arrowed PRAM.
  • the dotted line with arrows between adjacent processing cores represents the transfer process of the program.
  • each intermediate processing core is used to first determine whether the program was executed in the previous synchronization signal cycle according to the synchronization signal. If the program was executed in the previous synchronization signal cycle, the intermediate processing core that executed the program will The program that has been executed in the cycle of a synchronization signal is sent to the next processing core.
  • each intermediate processing core is used to send the executed program in the period of the previous synchronization signal to the next processing core according to the number of received synchronization signals. For example, when the number of synchronization signals received by the intermediate processing core exceeds the preset number of times, the program executed in the cycle of the previous synchronization signal is sent to the subsequent processing core.
  • the program executed by the core in the cycle of the previous synchronization signal will be sent to the next processing core on the logic core chain (That is, the next processing core according to the preset execution order).
  • all the processing cores except the tail processing core receive the program and send the program at the same time, so that all the processing cores can implement the program update in parallel, which greatly reduces the delay caused by the program update.
  • each processing core finishes sending the program that has been executed in the cycle of the previous synchronization signal, and after receiving the new program that needs to be run received in the cycle of the synchronization signal, it starts to execute the program just received.
  • Each of the multiple processing cores is also used to separately send synchronization request signals after executing the programs it receives.
  • the first processing core is also used to store the first update program.
  • the first update program is a resident program in the first processing core, which is written during initialization and saved in the first processing core under the control of the Host or the top-level MCU.
  • the core of the PRAM The core of the PRAM.
  • the first update program itself will not be changed. Only when the program change causes the first update program to be changed, the first update program needs to be changed by reinitializing the first processing core.
  • the program acquired according to the instruction may be, for example, a calculation program, stored in an external Memory, and the executed program sent by the first processing core to the next processing core refers to the executed calculation program.
  • the first processing core is used to send instructions, including:
  • the first processing core is used to execute the update program when receiving the synchronization signal, and send instructions according to the update program.
  • the intermediate processing core is also used to store the second update program.
  • the second update program is a resident program in the first processing core. It is imported under the control of Host or the top-level MCU. It is written and saved in the PRAM of each intermediate processing core during initialization. middle. During the entire task, the second update program itself will not be changed. Only when the program change causes the second update program to be changed, the change is made by reinitializing each intermediate processing core.
  • each intermediate processing core When each intermediate processing core receives the synchronization signal, it executes the second update program, and sends the executed program in the period of the previous synchronization signal to the next processing core in the logic core chain according to the second update program.
  • each intermediate processing core when each intermediate processing core receives the synchronization signal, it determines whether to execute the second update program according to the number of received synchronization signals, and sends the program that has been executed in the previous synchronization signal cycle by executing the second update program To the next processing core on the logical core chain.
  • the second update procedure is executed, where the preset number of times is the sequence number of the intermediate processing core in the preset execution order.
  • the intermediate processing core is located at the fifth position in the preset execution order, that is, the fifth processing core in the logical core chain, when the fifth processing core receives more than 5 synchronization signals, the second update program is executed .
  • task allocation and scheduling strategies are determined before the data processing device starts running, that is, when compiling.
  • the update program resides in each processing core, and the calculation program is stored externally. In the memory, the complexity of the program during the operation of the data processing device is reduced.
  • the program is divided into program segments one by one, and the program segments are updated and executed in sequence. That is, the program obtained according to the instruction is a program segment.
  • the above-mentioned data processing device further includes at least one interconnection structure, and the interconnection structure is, for example, a network on chip (Noc), a bus, or a switch.
  • NoC is selected as the interconnect structure.
  • the interconnect structure is used to connect the MME and each processing core; the first processing core transmits data with the MME through the interconnect structure. For example, the first processing core sends instructions to the MME through the interconnect structure, and the MME sends the program obtained according to the instructions to the first through the interconnect structure. Processing core; each other processing core receives the program sent by the previous processing core in the preset execution order through the interconnect structure.
  • the first processing core sends an instruction and receives a program obtained according to the instruction, and each other processing core receives and executes the program sent by the previous processing core in the preset execution sequence, without each processing
  • the cores read data from the Memory respectively, which reduces power consumption.
  • each processing core executes the same complete program, avoiding data exchange between cores, and reducing the delay and power consumption caused by data exchange. , Improve the efficiency of data processing.
  • Fig. 4 is a schematic structural diagram of a data processing device according to the present invention.
  • the data processing device includes S_G, MME, NoC, and three processing cores.
  • the preset execution order of the three processing cores is the first processing core C 1 , the middle processing core C 2 and the tail processing core C 3 , That is, the program is sent from C 1 to C 2 , and then from C 2 to C 3 .
  • Each processing core is equipped with PRAM, PRAM is used to store programs, and the storage capacity of each PRAM is set to 36KB.
  • the first part of the capacity of the PRAM of C 2 and C 3 is used to store the resident second update program, and the second part is used to store the calculation program.
  • the capacity of the second part of the three processing cores is the same.
  • each processing core can be set to store up to 32KB of calculation program segments each time.
  • calculation program of the neural network is 64KB, and each core can only store up to 32KB of the calculation program at a time, then the calculation program of the neural network will be updated and executed in two program segments.
  • the two program segments are the first program segment. P_1 and the second program segment P_2.
  • S_G As shown in Fig. 5, first, at time t0, S_G generates the first synchronization signal Sync and sends it to the three processing cores respectively.
  • C 1 runs the first resident update program, and according to the first update program, C 1 sends instructions to the MME. After receiving the instruction, the MME parses and executes the instruction, reads P_1 from the external Memory and sends it to C 1 .
  • C 1 Since each processing core receives the first Sync, C 1 will not send the executed program segment to C 2 , and C 2 will not send the program executed in the previous Sync cycle to C 3 . Therefore, neither C 2 nor C 3 receives the program segment from the previous processing core that has been executed in the last synchronization signal cycle. Therefore, C 1 starts to execute the P_1 after receiving the P_1 sent by the MME. However, C 2 and C 3 do not have any updates or calculations during t1, so when Sync comes, they will immediately send their respective synchronization request signals to S_G, that is, C 2 sends the second synchronization request signal SQ_2 to S_G, and C 3 sends the third synchronization request signal.
  • the synchronization request signal SQ_3 is given to S_G.
  • C 1 since C 1 receives not the first Sync, C 1 will send the program segment executed in the previous synchronization cycle to the next core, that is, C 1 will send P_1 to C 2 , and C 2 will receive the program Save it in PRAM.
  • C 2 will receive P_1 sent by C 1 , because C 2 has no executed program in the cycle of the last synchronization signal, so it will not send the executed program segment to C 3 , when C 2 receives C 1 sent After the P_1, start to execute P_1.
  • C 3 does not have any update or calculation in the second Sync cycle. After receiving the second Sync, it will immediately send SQ_3 to S_G.
  • C 2 finishes receiving P_1 and starts to execute P_1 until the execution is completed, and sends SQ_2 to S_G.
  • C 1 After C 1 receives the third Sync, it runs the first resident update program, and sends instructions to the MME according to the first update program; after receiving the instructions, the MME parses and executes the instructions, that is, reads from the external Memory
  • the first program of the new program is sent to C 1.
  • the first program of the new program is still P_1.
  • the present invention takes this as an example but is not limited to this.
  • C 2 receives the P_2 sent by C 1. Since C 2 executed the program in the last synchronization cycle, C 2 will run the resident second update program. According to the second update program, the cycle of the previous synchronization signal has been changed. The executed P_1 is sent to C 3 .
  • C 3 receives P_1 from C 2 and executes P_1.
  • each processing core will run P_1 and P_2 to completely process a neural network task.
  • the first processing core will receive the program segment sent by the MME at each time period to update the program segment, and then execute this program segment.
  • Processing cores other than the first processing core will receive the program segment sent by the previous core in the logic core chain and executed in the cycle of the last synchronization signal during this time period, and execute the received program segment , So as to realize the sequential transfer of the program segments.
  • a synchronization counter is set to record the number of Sync received, so that the first processing core will know the synchronization time period through the synchronization counter and how to configure the MME To instruct the MME to fetch the corresponding program segment from the Memory.
  • the invention is suitable for executing batch neural network processing tasks.
  • the present invention also provides a chip, which includes one or more data processing devices provided in the foregoing implementation manners.
  • the present invention also provides a card board, which includes one or more chips provided in the foregoing embodiments.
  • the present invention also provides an electronic device, including one or more of the card boards provided in the foregoing embodiments.
  • Fig. 6 is a schematic flowchart of a data processing method according to an embodiment.
  • the data processing method is applied to a data processing device, and the data processing device includes a plurality of processing cores having a preset execution sequence, and the plurality of processing cores includes a first processing core and at least one other processing core.
  • the data processing method includes:
  • Step S101 the first processing core sends an instruction, receives and executes the program obtained according to the instruction.
  • Step S102 each other processing core receives and executes the program sent by the previous processing core in the preset execution sequence.
  • each other processing core receives and executes the program sent by the previous processing core in the preset execution sequence, including: each other processing core receives and executes the executed program sent by the previous processing core program.
  • each other processing core receives and executes the executed program sent by the previous processing core, including: each other processing core receives and executes the latest executed program sent by the previous processing core.
  • the latest program that has been executed refers to the program that has just been executed before the current moment, that is, the executed program that is closest to the current moment.
  • the other processing cores include at least one intermediate processing core and a tail processing core.
  • each intermediate processing core also sends the executed program to the subsequent processing core.
  • each intermediate processing core is used to send the program that has just been executed to the next processing core.
  • the first processing core while receiving the program obtained according to the instruction, the first processing core sends the program that has just been executed to the next processing core.
  • each intermediate processing core sends the program that has just been executed to the next processing core, and after receiving the program sent by the previous processing core, it starts to execute the program it receives.
  • the first processing core After the first processing core sends the program that has just been executed to the next processing core, and receives the new program obtained according to the instruction, it starts to execute the received new program.
  • the tail processing core starts to execute the program after receiving the program that has just been executed from the previous processing core.
  • the first processing core sends an instruction, and receives and executes the program obtained according to the instruction.
  • the first processing core sends an instruction to the MME, and receives and executes the program read by the MME from the external memory according to the instruction.
  • step S101 the first processing core, after sending the instruction, before receiving and executing the program obtained according to the instruction, further includes:
  • the MME receives the instruction sent by the first processing core, obtains the program from the external Memory according to the instruction, and sends the obtained program to the first processing core.
  • the method further includes: each of the multiple processing cores sends a synchronization request signal after executing its own program.
  • the synchronization signal generator generates a synchronization signal after receiving the synchronization request signal sent by all the processing cores in the data processing device, and sends the synchronization signal to each processing core respectively.
  • the first processing core sending instructions includes: the first processing core sending instructions according to the synchronization signal.
  • Each other processing core receives and executes the program sent by the previous processing core in the preset execution sequence, including: each other processing core receives and executes the program sent by the previous processing core according to the synchronization signal.
  • each intermediate processing core is used to send the executed program in the cycle of the previous synchronization signal to the subsequent processing core according to the synchronization signal.
  • the first processing core is used to send instructions, receive and execute the program obtained according to the instructions, including: the first processing core sends the executed program in the last synchronization signal cycle to the The latter deals with the core.
  • Each intermediate processing core sends the executed program in the cycle of the previous synchronization signal to the next processing core according to the synchronization signal, including: each intermediate processing core synchronizes the previous one while receiving the program sent by the previous processing core The program that has been executed during the period of the signal is sent to the next processing core.
  • a computer storage medium having a computer program stored on the computer storage medium, and when the program is executed by a processor, the data processing method of the fifth aspect is implemented.
  • an electronic device including a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor implements The fifth aspect of the data processing method.
  • a computer program product which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the data processing method of the fifth aspect.
  • the embodiments of the present invention can be provided as a data processing method, a data processing system, or a computer program product. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Abstract

一种数据处理装置、芯片和数据处理方法,该数据处理装置包括:具有预设执行顺序的多个处理核,所述多个处理核包括首处理核和至少一个其他处理核;所述首处理核,用于发送指令,接收并执行根据所述指令获取的程序(S101);每个所述其他处理核,用于接收并执行所述预设执行顺序中前一个处理核发送的程序(S102)。该数据处理装置,无需由每个处理核分别从Memory中读取数据,从而降低了功耗,另外,各个处理核执行同一个完整的程序,避免了核之间的数据交换,降低了交换数据所带来的延时和功耗,提高了数据处理的效率。

Description

一种数据处理装置、芯片和数据处理方法 技术领域
本发明涉及到处理核技术领域,尤其是涉及到一种数据处理装置、芯片和数据处理方法。
背景技术
随着科学技术的发展,人类社会正在快速进入智能时代。智能时代的重要特点,就是人们获得数据的种类越来越多,获得数据的量越来越大,而对处理数据的速度要求越来越高。
芯片是数据处理的基石,它从根本上决定了人们处理数据的能力。从应用领域来看,芯片主要有两条路线:一条是通用芯片路线,例如中央处理器(Central Processing Unit,CPU)等,它们能提供极大的灵活性,但是在处理特定领域算法时有效算力比较低;另一条是专用芯片路线,例如张量处理器(Tensor Processing Unit,TPU)等,它们在某些特定领域,能发挥较高的有效算力,但是面对灵活多变的比较通用的领域,它们处理能力比较差甚至无法处理。
由于智能时代的数据种类繁多且数量巨大,所以要求芯片既具有极高的灵活性,能处理不同领域且日新月异的算法,又具有极强的处理能力,能快速处理极大的且急剧增长的数据量。
在神经网络计算中,经常会用到多核或者众核的芯片。而多(众)核的芯片中的处理核,都有一定独立处理能力,并且带有比较大的核内存储空间,用于存储核自身的程序、数据和权重。
如何让众多的核能够高效率的发挥算力,是决定整个芯片性能的关键。各核的算力发挥,取决于多种因素,例如任务的调度与分配、芯片的架构、 核的结构、核的电路等。其中,任务的调度与分配是一个非常关键的因素,如果任务的调度与分配合理,则能充分发挥各核的有效算力高,否则各核的有效算力低。
图1是一现有技术提供的芯片的结构示意图。
如图1所示,该芯片包括调度器和多个处理核C 1至C N,在图1所示的芯片中,调度器接收到来自芯片外部发送的指令,例如调度器接收到来自芯片外部的指令源发送的指令,然后将指令同时的分别传输给各个处理核,各个处理核执行相同的指令,但是处理不同的数据。例如,指令为计算参数a与参数b的加和,但是不同处理核的参数a可能是表示的不同的数值,那么这两个处理核虽然执行的都是a+b,但是由于参数的不同,所得到的结果是不同的,即各个处理核执行相同的指令,处理不同的数据。
对于图1所示的芯片架构,各个处理核可以是比较简单的结构,例如是单指令多数据结构(Single Instruction Multiple Data,SIMD),或者是单指令多线程结构(Single Instruction Multiple Threads,SIMT)。
通常这种方式存在如下的弊端:
如多个处理核需要执行相同的程序,只能通过调度器被动的从外部接收指令,然后再由调度器并行的发送给各个处理核,因此,需要在芯片中设计复杂的并行电路,导致面积大。
图2是另一现有技术提供的芯片的结构示意图。
如图2所示,该芯片包括多个处理核C 1至C N和存储单元Memory。在图2所示的芯片中,各核能从Memory中(例如DDR SDRAM)中独立读取指令,并进行运算,通常各核具有完整的控制电路、寄存器组等电路,该结构在多核CPU或者ASIC中比较常见。
通常这种方式存在如下的弊端:
多个处理核可能频繁访问指令存储区,引起存储访问效率的下降,进而影响芯片算力的发挥。
发明内容
本发明提供一种数据处理装置,解决了多个处理核执行相同的程序需分别访问指令存储区导致功耗较高的技术问题。
本发明的第一方面提供了一种数据处理装置,包括:具有预设执行顺序的多个处理核,多个处理核包括首处理核和至少一个其他处理核;首处理核,用于发送指令,接收并执行根据指令获取的程序;每个其他处理核,用于接收并执行预设执行顺序中前一个处理核发送的程序。
在本实施方式提供的数据处理装置中,首处理核发送指令并接收根据指令获取的程序,每个其他处理核都接收并执行预设执行顺序中前一个处理核发送的程序,无需设置每个处理核分别从Memory中读取数据,降低了功耗。
可选的,每个其他处理核,用于接收并执行前一个处理核发送的已执行的程序。
进一步可选的,每个其他处理核,用于接收并执行前一个处理核发送的刚刚完成的程序。
可选的,其他处理核包括中间处理核和尾处理核;中间处理核,用于将已执行的程序发送至后一个处理核。
可选的,还包括:存储管理单元,用于接收首处理核发送的指令,根据指令从外部存储单元中获取程序,并将获取的程序发送至首处理核。
可选的,还包括:同步信号生成器,用于在接收到多个处理核中的每个处理核发送的同步请求信号后生成同步信号,并将同步信号发送给每个处理核。
可选的,首处理核用于根据同步信号发送指令;每个其他处理核,用于根据同步信号接收并执行前一个处理核发送的程序。
可选的,其他处理核包括中间处理核和尾处理核;每个中间处理核,用于根据同步信号,将上一个同步信号的周期已执行的程序发送至后一个处理核。
可选的,中间处理核,用于在接收前一个处理核发送的程序的同时将上一个同步信号的周期已执行的程序发送至后一个处理核。
可选的,首处理核还用于存储第一更新程序;首处理核用于发送指令,包括:首处理核用于在收到同步信号时执行第一更新程序,根据更新程序发送指令。
进一步可选的,每个中间处理核还用于存储第二更新程序,每个中间处理核用于在收到同步信号时,执行第二更新程序,根据第二更新程序将上一个同步信号的周期已执行的程序发送至后一个处理核。
可选的,多个处理核中的每个处理核,还用于在执行完各自接收到的程序后,分别发送同步请求信号。
可选的,根据指令获取的程序为程序段。
根据本发明的第二方面,提供了一种芯片,包括一个或多个第一方面提供的数据处理装置。
根据本发明的第三方面,提供了一种卡板,包括一个或多个第二方面提供的芯片。
根据本发明的第四方面,提供了一种电子设备,包括一个或多个第三方面提供的卡板。
根据本发明的第五方面,提供了一种数据处理方法,应用于数据处理装置中,该数据处理装置中包括具有预设执行顺序的多个处理核,多个处理核包括首处理核和至少一个其他处理核,该数据处理方法包括:首处理核,接收并执行根据指令获取的程序;每个其他处理核,用于接收并执行预设执行顺序中前一个处理核发送的程序。
根据本发明的第六方面,提供了一种计算机存储介质,计算机存储介质上存储有计算机程序,程序被处理器执行时实现第五方面的数据处理方法。
根据本发明的第七方面,提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现第五方面的数据处理方法。
根据本发明的第八方面,提供一种计算机程序产品,其中,包括计算机指令,当计算机指令被计算设备执行时,计算设备可以执行第五方面的数据处理方法。
本发明实施方式提供的数据处理装置中,首处理核发送指令并接收根据指令获取的程序,每个其他处理核都接收并执行预设执行顺序中前一个处理核发送的程序,无需设置每个处理核分别从Memory中读取数据,降低了功耗,另外,无需设计复杂的并行电路,就能实现多个处理核执行相同的程序。另外,每个处理核都会从头到尾执行完整的程序,能够避免核之间的数据交换,降低了交换数据所带来的延时和功耗,提高了数据处理的效率。
附图说明
图1是一现有技术提供的芯片的结构示意图;
图2是另一现有技术提供的芯片的结构示意图;
图3是根据本发明提供的一种数据处理装置的结构示意图;
图4是根据本发明提供的另一种数据处理装置的结构示意图;
图5是图4所示数据处理装置中处理核执行程序的时序图;
图6是本发明一实施方式提供的数据处理方法流程示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明了,下面结合具体实施方式并参照附图,对本发明进一步详细说明。应该理解,这些描述只是示例性的,而并非要限制本发明的范围。此外,在以下说明中,省略了对公知结构和技术的描述,以避免不必要地混淆本发明的概念。
显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在本发明的描述中,需要说明的是,术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。
此外,下面所描述的本发明不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。
另外,对于不涉及本发明改进点的已有部件,将简单介绍或者不介绍,而重点介绍相对于现有技术作出改进的组成部件。
图3是根据本发明提供的一种数据处理装置的结构示意图。
如图3所示,该数据处理装置,包括:
具有预设执行顺序的多个处理核,多个处理核例如是指处理核C 1至处理核C N
需要说明的是,在本发明中,在芯片运行过程中,所有的处理核从逻辑上来说,组成一个逻辑核链,即C 1至C N是按照发送程序的逻辑而顺序编号的,即该预设执行顺序是指逻辑核链的处理核的顺序。为方便描述,本发明将逻辑核链上的第一个核称为首处理核,将除了首处理核之外的处理核称为其他处理核。将其他处理核中位于逻辑核链上的最后一个处理核,称为尾处理核,将其他处理核中的在逻辑核链上的除了尾处理核之外的处理核称为中间处理核。该逻辑核链上的所有处理核之间没有主从关系,只有逻辑上的先后关系。
其中,多个处理核中的每个处理核都设置有存储单元PRAM,PRAM用于存储各自接收到的程序。该PRAM例如可以为专门用于存程序的、且具有一定大小(通常>=16KB)的静态随机访问存储器(Static Random Access Memory,SRAM)。
其中,多个处理核包括首处理核和至少一个其他处理核,首处理核例如是处理核C 1,其他处理核例如是处理核C 2至处理核C N
其中,首处理核用于发送指令,接收并执行根据指令获取的程序。
每个其他处理核,用于接收并执行预设执行顺序中前一个处理核发送的程序。
前一个处理核是指按照预设执行顺序的在本处理核之前且与本处理核相 邻的处理核。例如,对于处理核C 3,其前一个处理核为C 2
可以理解的是,在本实施方式中,处理核接收程序是指将程序存储至各自的PRAM中,处理核执行程序执行的也是本核的PRAM中存储的。因此,本发明实施方式提供的数据处理装置,无需处理核从外部的Memory中读指令,避免各个处理核占用Memory,降低了功耗。而且由于处理核均是从本核的PRAM中读取程序,还可以避免设计复杂的Cache电路,而且处理核执行各自PRAM存储的程序,程序执行的速度快,时延小,能够极大的提升指令的执行效率。
可选的,该数据处理装置中,各个处理核具有相同的电路结构。
需要说明的是,一些现有技术中,数据处理装置在完成一个比较大的计算任务时,需要多个处理核的配合来完成,例如,一处理核执行第一部分计算任务,然后将计算结果发给另一处理核,由另一处理核将该计算结果作为输入,执行第二部分计算任务,得到最终的计算结果。而本发明实施方式提供的数据处理装置中,无需设计复杂的并联电路,就能够实现各个处理核执行相同的程序,节省了数据处理装置的面积,而且每个处理核都能从头至尾的执行同一个完整的程序,无需将计算结果发送给另一处理核,另一处理核无需等待计算结果,因此降低了交换数据所带来的延时和功耗,提高了数据处理的效率。
在一个优选的实施例中,每个其他处理核,用于接收并执行前一个处理核发送的已执行的程序。
进一步的,每个其他处理核,用于接收并执行前一个处理核发送的已执行的最新的程序。已执行的最新的程序是指在当前时刻之前刚刚执行完的程序,即最接近当前时刻的已执行的程序。
在本实施例中,预设执行顺序中的在后的处理核执行各自相邻的前一个的处理核刚刚执行完的程序,这样使得该数据处理装置中,多个处理核执行的都是相同的程序,使得多个处理核并行执行程序,能够大批量的完成任务,提升整个芯片的算力发挥,该数据处理装置更适合在执行批量的神经网络中进行数据处理,执行任务。
优选的,其他处理核包括至少一个中间处理核(例如是处理核C 2至处理核C N-1)和尾处理核C N
其中,每个中间处理核,用于将已执行的程序发送至后一个处理核。
进一步地,每个中间处理核用于将刚刚执行完的程序发送至后一个处理核。
在本实施例中,由于C N处于预设执行顺序中的最后一位,无需将刚刚执行完的程序发送出去。
在一个实施例中,中间处理核,用于在接收前一个处理核发送的程序的同时将刚刚执行完的程序发送至后一个处理核。
首处理核在接收根据指令获取的程序的同时,将刚刚执行完的程序发送至后一个处理核(本例中是C 2)。
在本实施例中,除尾处理核之外的所有处理核,同时接收程序和发送程序,从而使得所有处理核能实现程序更新的并行进行,大大降低程序更新带来的延时。
在一个实施例中,每个中间处理核将刚刚执行完的程序发送至后一个处理核,且接收到前一个处理核发送的程序后,开始执行各自接收到的程序。
首处理核,用于在将刚刚执行完的程序发送至后一个处理核,且接收到根据指令获取的新的程序后,开始执行新的程序。
尾处理核,用于在收到前一个处理核发送的刚刚执行完的程序后,开始执行该程序。
可以理解,本发明实施例中,具有预设执行顺序的多个处理核中,除了尾处理核之外,各个处理核都按照该预设执行顺序向各自的下一个处理核发送程序,也就是说,一个程序被按照预设执行顺序从首处理核依次传递至尾处理核。
在一个实施例中,数据处理装置还包括存储管理单元(Memory Management Engine,MME),MME用于接收首处理核发送的指令,根据指令从外部存储单元中获取程序,并将获取的程序发送至首处理核。MME还能存储 指令、解析指令和执行指令,还能完成芯片内部的RAM和Memory之间数据的传输。MME例如是直接存储访问控制器(Direct Memory Access Controller,DMAC)。
在本实施例中,只需要MME根据首处理核发送的指令,从外部的Memory中读取程序,且只将该程序发送给首处理核,无需将该程序分别发送给其他处理核,因此,MME中无需设计负复杂的电路结构,在无需MME将程序分别发送至所有处理核的情况下,就能实现所有处理核执行相同的程序,从而降低了延时,也降低了功耗。
在一个实施例中,数据处理装置还包括同步信号生成器(Synchronization Generator,S_G)。
其中S_G用于在接收到多个处理核中的每个处理核发送的同步请求信号后生成同步信号,并将同步信号发送给每个处理核。
在本实施例中,数据处理装置中,设置S_G,S_G能同步各个处理核之间程序的更新和运行,降低处理核间同步的复杂性。
在一个实施例中,首处理核用于根据同步信号发送指令。每个其他处理核,用于根据同步信号接收并执行前一个处理核发送的程序。
每个中间处理核,用于根据同步信号,将上一个同步信号的周期已执行的程序发送至后一个处理核。其中,上一个同步信号的周期是指在收到上一个同步信号之后至接收到本次同步信号之间的时间段。
具体地,当同步信号来临,标志着某一个同步周期的开始,对于逻辑核链中的首处理核在此同步信号来临时,将发送程序更新的指令给MME。图3中从C 1至MME的带箭头的细虚线表示为指令的传递过程,MME接收该指令并解析该指令,然后执行指令,即将新的程序从外部的Memory中取出,发送到首处理核的PRAM中。相邻的处理核之间带箭头的虚线表示为程序的传递过程。
可选的,每个中间处理核,用于根据同步信号,先确定在上一个同步信号周期是否执行了程序,若在上一个同步信号周期执行了程序,则执行了程 序的中间处理核将上一个同步信号的周期已执行的程序发送至后一个处理核。
可选的,每个中间处理核,用于根据收到的同步信号的个数,将上一个同步信号的周期已执行的程序发送至后一个处理核。例如,当中间处理核收到的同步信号的次数超过预设次数时,将上一个同步信号的周期已执行的程序发送至后一个处理核。
对于除逻辑核链上的尾处理核之外的处理核,在同步信号来临后,会将本核在上一个同步信号的周期内已执行的程序,发送给逻辑核链上的下一个处理核(即按照预设执行顺序的后一个处理核)。
优选的,中间处理核,用于在接收前一个处理核发送的程序的同时将上一个同步信号的周期已执行的程序发送至后一个处理核。首处理核,用于在接收到根据指令获取的程序的同时,将上一个同步信号的周期已执行的程序发送至后一个处理核。
在本实施例中,除尾处理核之外的所有处理核,同时接收程序和发送程序,从而使得所有处理核能实现程序更新的并行进行,大大降低程序更新带来的延时。
优选的,每一个处理核将上一个同步信号的周期中已执行的程序发送完毕,并且将本同步信号的周期接收到的需要运行的新的程序接收完毕后,开始执行刚接收到的程序。
多个处理核中的每个处理核,还用于在执行完各自接收到的程序后,分别发送同步请求信号。
在一个实施例中,首处理核还用于存储第一更新程序,第一更新程序是首处理核中的常驻程序,由Host或者顶层的MCU控制在初始化的时候写入并保存在首处理核的PRAM中。在整个任务过程中,第一更新程序自身不会被改变,只有程序变更导致第一更新程序需要变更时,通过重新对首处理核初始化来改变。根据指令所获取的程序例如可以是计算程序,存储于外部的Memory,首处理核发送给下一个处理核的已执行的程序是指已执行的计算程 序。
其中,首处理核用于发送指令,包括:
首处理核用于在收到同步信号时执行更新程序,根据更新程序发送指令。
中间处理核还用于存储第二更新程序,第二更新程序是首处理核中的常驻程序,由Host或者顶层的MCU控制导入,在初始化的时候写入并保存在各个中间处理核的PRAM中。在整个任务过程中,第二更新程序自身不会被改变。只有程序变更导致第二更新程序需要变更时,通过重新对各个中间处理核初始化来改变。
每个中间处理核在收到同步信号时,执行第二更新程序,并根据第二更新程序将上一个同步信号的周期已执行的程序发送至逻辑核链上的后一个处理核。
优选的,每个中间处理核收到同步信号时,根据收到的同步信号的次数,确定是否执行第二更新程序,并通过执行第二更新程序将上一个同步信号的周期已执行的程序发送至逻辑核链上的后一个处理核。
具体的,当中间处理核收到的同步信号的次数超过预设次数时,执行第二更新程序,其中,预设次数为中间处理核位于预设执行顺序上的序号。例如,中间处理核位于预设执行顺序上的第五位,即,逻辑核链上的排名第五的处理核,则当第五处理核收到超过5个同步信号后,执行第二更新程序。
需要说明的是,本数据处理装置中,任务的分配和调度策略在数据处理装置开始运行之前,即编译的时候就已经确定好了,更新程序常驻在各处理核内,计算程序存储在外部的Memory中,降低了数据处理装置运行时的程序复杂性。
在一个实施例中,由于PRAM的容量有限,当处理核的PRAM不能存储整个程序时,会将程序划分成一个一个的程序段,依次更新并执行程序段。即根据指令获取的程序为程序段。
优选的,上述数据处理装置还包括至少一个互联结构,互联结构例如是指片上网络(Network On Chip,Noc)、总线bus或开关switch。在本实施例 中,互联结构选用NoC。
互联结构,用于连接MME和各个处理核;首处理核通过互联结构与MME进行数据传输,例如首处理核通过互联结构将指令发送至MME,MME通过互联结构将根据指令获取的程序发送至首处理核;每个其他处理核,通过互联结构接收预设执行顺序中前一个处理核发送的程序。
本发明实施方式提供的数据处理装置中,首处理核发送指令并接收根据指令获取的程序,每个其他处理核都接收并执行预设执行顺序中前一个处理核发送的程序,无需每个处理核分别从Memory中读取数据,降低了功耗,另外,每个处理核都会执行同一个完整的程序,避免了核之间的数据交换,降低了交换数据所带来的延时和功耗,提高了数据处理的效率。
图4是根据本发明提供的一种数据处理装置的结构示意图。
如图4所示,该数据处理装置包括S_G、MME、NoC以及三个处理核,这三个处理核的预设执行顺序为首处理核C 1、中间处理核C 2和尾处理核C 3,即程序由C 1发送至C 2,再由C 2发送至C 3
每个处理核均设置有PRAM,PRAM用于存储程序,设置每个PRAM的存储容量为36KB。
其中C 1的PRAM的容量中第一部分用于存储常驻的第一更新程序,第二部分用于存储计算程序。
优选的,C 2和C 3的PRAM的容量的第一部分用于存储常驻的第二更新程序,第二部分用于存储计算程序。其中,三个处理核的第二部分容量相同。
由于C 1、C 2和C 3执行的计算程序都是一样的,可设置每个处理核每次都最多只存储32KB的计算程序段。
假如神经网络的计算程序是64KB,而每一个核每次最多只能存储32KB的计算程序,那么神经网络的计算程序会分两个程序段更新和执行,两个程序段分别是第一程序段P_1和第二程序段P_2。
图5是图4所示的数据处理装置运行程序的时序图。
如图5所示,首先,在t0时刻,S_G生成第1个同步信号Sync,并分别 发送给这三个处理核。
首先,C 1运行常驻的第一更新程序,根据第一更新程序,C 1向MME发送指令。MME接收到指令后,解析和执行指令,从外部的Memory中读取P_1发送给C 1
由于各个处理核接收到的是第1个Sync,因此,C 1不会发送已执行的程序段给C 2,C 2也不会将上一个Sync的周期内执行的程序发送给C 3。所以,C 2和C 3都没有接收来自前一个处理核在上一同步信号的周期已执行的程序段,因此,C 1在接收到MME发送的P_1后,开始执行该P_1。而C 2和C 3在t1时段,没有任何更新和计算,所以当Sync来临,会马上发送各自的同步请求信号给S_G,即C 2发送第二同步请求信号SQ_2给S_G,C 3发送第三同步请求信号SQ_3给S_G。
在t0-t1时间段,C 1接收完P_1,开始执行P_1,直到执行完毕,发送第一同步请求信号SQ_1发送给S_G。
在t1时刻,S_G已收齐了SQ_1、SQ_2和SQ_3后,生成第2个Sync。
C 1收到第2个Sync后,运行常驻的第一更新程序,根据第一更新程序,向MME发送指令;MME接收到指令后,解析和执行指令,将第二程序段P_2发送给C 1
同时,由于C 1接收到不是第1个Sync,C 1会发送上一个同步周期已执行的程序段给下一个核,也就是C 1会将P_1发送给C 2,C 2将接收到的程序保存在PRAM。
C 2会接收C 1发送的P_1,由于C 2在上一个同步信号的周期中,没有已执行的程序,因此,不会发送已执行的程序段给C 3,当C 2收到C 1发送的P_1后,开始执行P_1。
C 3在第2个Sync周期内,没有任何更新和计算,在收到第2个Sync后,会马上发送SQ_3给S_G。
在t1-t2时间段,C 1接收完P_2,且发送完P_1,开始执行P_2,直到执行完毕,发送SQ_1给S_G;
在t1-t2时间段,C 2接收完P_1,开始执行P_1,直到执行完毕,发送SQ_2给S_G。
在t2时刻,S_G收齐了SQ_1、SQ_2和SQ_3后,生成第3个Sync,并分别发送给三个处理核。
C 1收到第3个Sync后,运行常驻的第一更新程序,并根据第一更新程序,向MME发送指令;MME接收到指令后,解析和执行指令,即从外部的Memory中读取新的程序的第一段程序发送给C 1,在本实施例中新的程序的第一段程序还是P_1,本发明以此为例但不以此为限。
由于C 1将继续重新开始执行神经网络的程序,也就是重新执行P_1,由于C 1接收到的不是第1个Sync,C 1会发送上一个同步信号的周期已执行的程序段,也就是C 1会将P_2发送给C 2
C 2接收C 1发过来的P_2,由于C 2在上一个同步周期中执行过程序,因此,C 2会运行常驻的第二更新程序,根据第二更新程序将上一个同步信号的周期已执行的P_1发送给C 3
C 3接收C 2发过来的P_1,并执行P_1。
在t3-t4时段,C 1接收完P_1,且发送完P_2,开始执行P_1,直到执行完毕,发送SQ_1给S_G。
在t3-t4时段,C 2接收完P_2,且发送完P_1,开始执行P_2,直到执行完毕,发送SQ_2给S_G。
在t3-t4时段,C 3接收完P_1,开始执行P_1,直到执行完毕,发送SQ_3给S_G。
在t4时刻,S_G收齐了SQ_1、SQ_2和SQ_3,生成下一个同步信号Sync。
如此重复,每个处理核都会运行P_1和P_2,完整的处理一个神经网络任务。
首处理核在每一个时间段,都会接收MME发送的程序段,以实现程序段的更新,然后执行此程序段。除了首处理核之外的处理核,在本时间段,都会接收逻辑核链中前一个核发送过来的且在上一个同步信号的周期内已执行 的程序段,并且执行接收到的该程序段,从而实现程序段的依次传递。
优选的,首处理核内常驻的第一更新程序中,设置有同步计数器,用来记录接收到的Sync的次数,这样首处理核通过同步计数器就会知道本同步时间段,应该如何配置MME,以指示MME从Memory中取相应的程序段。本发明适合在执行批量的神经网络处理任务。
在一个实施例中,本发明还提供了一种芯片,包括一个或多个上述实施方式提供的数据处理装置。
在一个实施例中,本发明还提供了一种卡板,包括一个或多个上述实施方式提供的芯片。
在一个实施例中,本发明还提供了一种电子设备,包括一个或多个上述实施方式提供的卡板。
图6是根据一实施方式提供的一种数据处理方法流程示意图。
如图6所示,该数据处理方法应用于数据处理装置中,该数据处理装置中包括具有预设执行顺序的多个处理核,多个处理核包括首处理核和至少一个其他处理核。
其中,该数据处理方法包括:
步骤S101,首处理核,发送指令,接收并执行根据指令获取的程序。
步骤S102,每个其他处理核,接收并执行预设执行顺序中前一个处理核发送的程序。
在一个优选的实施例中,每个其他处理核,接收并执行预设执行顺序中前一个处理核发送的程序,包括:每个其他处理核,接收并执行前一个处理核发送的已执行的程序。
具体的,每个其他处理核,接收并执行前一个处理核发送的已执行的程序,包括:每个其他处理核,接收并执行前一个处理核发送的已执行的最新的程序。已执行的最新的程序是指在当前时刻之前刚刚执行完的程序,即最接近当前时刻的已执行的程序。
在一个实施例中,其他处理核包括至少一个中间处理核和尾处理核。在 上述步骤S102中,每个中间处理核,还将已执行的程序发送至后一个处理核。
具体地,每个中间处理核用于将刚刚执行完的程序发送至后一个处理核。
优选的,首处理核在接收根据指令获取的程序的同时,将刚刚执行完的程序发送至后一个处理核。
进一步具体地,每个中间处理核将刚刚执行完的程序发送至后一个处理核,且接收到前一个处理核发送的程序后,开始执行各自接收到的程序。
首处理核在将刚刚执行完的程序发送至后一个处理核,且接收到根据指令获取的新的程序后,开始执行接收到的新的程序。
尾处理核在收到前一个处理核发送的刚刚执行完的程序后,开始执行该程序。
在一个具体的实施例中,首处理核,发送指令,接收并执行根据指令获取的程序,包括,首处理核向MME发送指令,接收并执行MME根据该指令从外部Memory读取的程序。
在步骤S101中,首处理核,在发送指令后,在接收并执行根据指令获取的程序之前,还包括:
MME接收首处理核发送的指令,根据指令从外部的Memory中获取程序,并将获取的程序发送至首处理核。
在一个实施例中,方法还包括:多个处理核中的每个处理核在执行完各自的程序后,均发送同步请求信号。
同步信号生成器,收到数据处理装置中所有的处理核发送的同步请求信号后生成同步信号,并将同步信号分别发送给每个处理核。
其中,首处理核发送指令,包括:首处理核根据同步信号发送指令。每个其他处理核,接收并执行预设执行顺序中前一个处理核发送的程序,包括:每个其他处理核,根据同步信号接收并执行前一个处理核发送的程序。
进一步地,每个中间处理核,用于根据同步信号,将上一个同步信号的周期已执行的程序发送至后一个处理核。
优选的,首处理核,用于发送指令,接收并执行根据指令获取的程序, 包括:首处理核在接收到根据指令获取的程序的同时,将上一个同步信号的周期已执行的程序发送至后一个处理核。
每个中间处理核,根据同步信号,将上一个同步信号的周期已执行的程序发送至后一个处理核,包括:每个中间处理核在接收前一个处理核发送的程序的同时将上一个同步信号的周期已执行的程序发送至后一个处理核。
根据本发明的第六方面,提供了一种计算机存储介质,所述计算机存储介质上存储有计算机程序,所述程序被处理器执行时实现第五方面的数据处理方法。
根据本发明的第七方面,提供了一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现第五方面的数据处理方法。
根据本发明的第八方面,提供一种计算机程序产品,其中,包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行第五方面的数据处理方法。
应当理解的是,本发明的上述具体实施方式仅仅用于示例性说明或解释本发明的原理,而不构成对本发明的限制。因此,在不偏离本发明的精神和范围的情况下所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。此外,本发明所附权利要求旨在涵盖落入所附权利要求范围和边界、或者这种范围和边界的等同形式内的全部变化和修改例。
尽管已经详细描述了本发明的实施方式,但是应该理解的是,在不偏离本发明的精神和范围的情况下,可以对本发明的实施方式做出各种改变、替换和变更。
显然,上述实施例仅仅是为清楚地说明所作的举例,而并非对实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。
本领域内的技术人员应明白,本发明的实施例可提供为数据处理方法、数据处理系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。

Claims (15)

  1. 一种数据处理装置,其特征在于,包括:
    具有预设执行顺序的多个处理核,所述多个处理核包括首处理核和至少一个其他处理核;
    所述首处理核,用于发送指令,接收并执行根据所述指令获取的程序;
    每个所述其他处理核,用于接收并执行所述预设执行顺序中前一个处理核发送的程序。
  2. 根据权利要求1所述的数据处理装置,其特征在于,
    所述其他处理核包括中间处理核和尾处理核;
    每个所述中间处理核,用于将已执行的程序发送至后一个处理核。
  3. 根据权利要求1或2所述的数据处理装置,其特征在于,还包括:
    存储管理单元,用于接收所述首处理核发送的所述指令,根据所述指令从外部存储单元中获取程序,并将所述获取的程序发送至所述首处理核。
  4. 根据权利要求1-3任一项所述的数据处理装置,其特征在于,还包括:
    同步信号生成器,用于在接收到所述多个处理核中的每个处理核发送的同步请求信号后生成同步信号,并将所述同步信号发送给所述每个处理核。
  5. 根据权利要求4所述的数据处理装置,其特征在于,
    所述首处理核用于根据所述同步信号发送所述指令;
    每个所述其他处理核,用于根据所述同步信号接收并执行所述前一个处理核发送的程序。
  6. 根据权利要求4或5所述的数据处理装置,其特征在于,
    所述其他处理核包括中间处理核和尾处理核;
    每个所述中间处理核,用于根据所述同步信号,将上一个所述同步信号的周期已执行的程序发送至后一个处理核。
  7. 根据权利要求6所述的数据处理装置,其特征在于,
    所述中间处理核,用于在接收所述前一个处理核发送的程序的同时将所述上一个所述同步信号的周期已执行的所述程序发送至所述后一个处理核。
  8. 根据权利要求4-7任一项所述的数据处理装置,其特征在于,所述首处理核还用于存储第一更新程序;
    所述首处理核用于发送指令,包括:
    所述首处理核用于在收到所述同步信号时执行所述第一更新程序,根据所述第一更新程序发送所述指令。
  9. 根据权利要求6或7所述的数据处理装置,其特征在于,
    每个所述中间处理核还用于存储第二更新程序;
    所述中间处理核用于根据收到的所述同步信号,执行所述第二更新程序,根据所述第二更新程序将上一个所述同步信号的周期已执行的程序发送至后一个处理核。
  10. 根据权利要求9所述的数据处理装置,其特征在于,
    所述中间处理核用于根据收到的所述同步信号的次数超过预设次数,执行所述第二更新程序。
  11. 根据权利要求5-10任一项所述的数据处理装置,其特征在于,
    所述多个处理核中的每个所述处理核,还用于在执行完各自接收到的程序后,分别发送所述同步请求信号。
  12. 根据权利要求1-11任一项所述的数据处理装置,其特征在于,
    所述根据所述指令获取的程序为程序段。
  13. 一种芯片,其特征在于,包括一个或多个如权利要求1-12任一项所述的数据处理装置。
  14. 一种卡板,其特征在于,包括一个或多个如权利要求13所述的 芯片。
  15. 一种数据处理方法,其特征在于,应用在数据处理装置中,所述数据处理装置包括具有预设执行顺序的多个处理核,多个所述处理核包括首处理核和至少一个其他处理核,所述数据处理方法包括:
    首处理核,发送指令,接收并执行根据指令获取的程序;
    每个其他处理核,接收并执行预设执行顺序中前一个处理核发送的程序。
PCT/CN2021/086850 2020-04-29 2021-04-13 一种数据处理装置、芯片和数据处理方法 WO2021218623A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21795267.0A EP4145277A4 (en) 2020-04-29 2021-04-13 DATA PROCESSING DEVICE, CHIP AND DATA PROCESSING METHOD
US18/049,483 US20230069032A1 (en) 2020-04-29 2022-10-25 Data processing apparatus, chip, and data processing method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010355914.5 2020-04-29
CN202010355914.5A CN113568665B (zh) 2020-04-29 2020-04-29 一种数据处理装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/049,483 Continuation US20230069032A1 (en) 2020-04-29 2022-10-25 Data processing apparatus, chip, and data processing method

Publications (1)

Publication Number Publication Date
WO2021218623A1 true WO2021218623A1 (zh) 2021-11-04

Family

ID=78158540

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/086850 WO2021218623A1 (zh) 2020-04-29 2021-04-13 一种数据处理装置、芯片和数据处理方法

Country Status (4)

Country Link
US (1) US20230069032A1 (zh)
EP (1) EP4145277A4 (zh)
CN (1) CN113568665B (zh)
WO (1) WO2021218623A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901207A (zh) * 2010-07-23 2010-12-01 中国科学院计算技术研究所 异构共享存储多处理机系统的操作系统及其工作方法
CN103377085A (zh) * 2012-04-12 2013-10-30 无锡江南计算技术研究所 指令管理方法及装置、指令管理系统、运算核心
CN106547721A (zh) * 2015-09-16 2017-03-29 晨星半导体股份有限公司 例行工作的分配方法及应用其的多核心计算机
US20190173841A1 (en) * 2017-12-06 2019-06-06 Nicira, Inc. Load balancing ipsec tunnel processing with extended berkeley packet filer (ebpf)

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4200915A (en) * 1978-04-05 1980-04-29 Allen-Bradley Company Program loader for programmable controller
US8103866B2 (en) * 2004-06-18 2012-01-24 Nethra Imaging Inc. System for reconfiguring a processor array
US20080235493A1 (en) * 2007-03-23 2008-09-25 Qualcomm Incorporated Instruction communication techniques for multi-processor system
US9183614B2 (en) * 2011-09-03 2015-11-10 Mireplica Technology, Llc Processor, system, and method for efficient, high-throughput processing of two-dimensional, interrelated data sets
SG10201604445RA (en) * 2011-12-01 2016-07-28 Univ Singapore Polymorphic heterogeneous multi-core architecture
CN103092788B (zh) * 2012-12-24 2016-01-27 华为技术有限公司 多核处理器及数据访问方法
US9819544B2 (en) * 2014-10-31 2017-11-14 Distech Controls Inc. Method for configuring devices in a daisy chain communication configuration
GB2568776B (en) * 2017-08-11 2020-10-28 Google Llc Neural network accelerator with parameters resident on chip
JP2019204387A (ja) * 2018-05-25 2019-11-28 日立オートモティブシステムズ株式会社 プログラム実行制御方法およびプログラム変換装置
CN110222007B (zh) * 2019-06-20 2023-11-24 山东省计算中心(国家超级计算济南中心) 一种基于申威众核处理器的加速运行方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901207A (zh) * 2010-07-23 2010-12-01 中国科学院计算技术研究所 异构共享存储多处理机系统的操作系统及其工作方法
CN103377085A (zh) * 2012-04-12 2013-10-30 无锡江南计算技术研究所 指令管理方法及装置、指令管理系统、运算核心
CN106547721A (zh) * 2015-09-16 2017-03-29 晨星半导体股份有限公司 例行工作的分配方法及应用其的多核心计算机
US20190173841A1 (en) * 2017-12-06 2019-06-06 Nicira, Inc. Load balancing ipsec tunnel processing with extended berkeley packet filer (ebpf)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4145277A4 *

Also Published As

Publication number Publication date
US20230069032A1 (en) 2023-03-02
CN113568665B (zh) 2023-11-17
EP4145277A4 (en) 2023-08-16
CN113568665A (zh) 2021-10-29
EP4145277A1 (en) 2023-03-08

Similar Documents

Publication Publication Date Title
US8209690B2 (en) System and method for thread handling in multithreaded parallel computing of nested threads
US9715475B2 (en) Systems and methods for in-line stream processing of distributed dataflow based computations
US6363453B1 (en) Parallel processor with redundancy of processor pairs
US10089259B2 (en) Precise, efficient, and transparent transfer of execution between an auto-generated in-line accelerator and processor(s)
US20080098180A1 (en) Processor acquisition of ownership of access coordinator for shared resource
JPH0630094B2 (ja) マルチプロセツサ・システム
US20110265093A1 (en) Computer System and Program Product
CN112199173B (zh) 双核cpu实时操作系统数据处理方法
JP2003263331A (ja) マルチプロセッサシステム
EP3662376B1 (en) Reconfigurable cache architecture and methods for cache coherency
WO2021218623A1 (zh) 一种数据处理装置、芯片和数据处理方法
CN109992539B (zh) 双主机协同工作装置
KR100978082B1 (ko) 공유 메모리형 멀티 프로세서에 있어서의 비동기 원격 절차 호출 방법 및 비동기 원격 절차 호출 프로그램을 기록한 컴퓨터로 판독 가능한 기록 매체
WO2021218492A1 (zh) 任务分配方法、装置、电子设备及计算机可读存储介质
CN110647357B (zh) 同步多线程处理器
WO2021174446A1 (zh) 一种数据处理装置及数据处理方法
JP2008276322A (ja) 情報処理装置、情報処理システムおよび情報処理方法
WO2007088581A1 (ja) 共有メモリ型マルチプロセッサにおける手続き呼び出し方法、手続き呼び出しプログラムおよび記録媒体
Zhang et al. PMD
CN115658601A (zh) 多核处理器系统及其控制方法
JP2954671B2 (ja) メモリ管理用制御装置
JP2021117577A (ja) 情報処理装置、情報処理方法およびプログラム
JPH09218859A (ja) マルチプロセッサ制御システム
CN117687744A (zh) 一种硬件事务内存中对事务进行动态调度的方法
JP2008276321A (ja) 情報処理システムおよび情報処理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21795267

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021795267

Country of ref document: EP

Effective date: 20221129