WO2021036421A1 - 多核架构的同步信号产生电路、芯片和同步方法及装置 - Google Patents

多核架构的同步信号产生电路、芯片和同步方法及装置 Download PDF

Info

Publication number
WO2021036421A1
WO2021036421A1 PCT/CN2020/096390 CN2020096390W WO2021036421A1 WO 2021036421 A1 WO2021036421 A1 WO 2021036421A1 CN 2020096390 W CN2020096390 W CN 2020096390W WO 2021036421 A1 WO2021036421 A1 WO 2021036421A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
synchronization signal
node
processing
signal generating
Prior art date
Application number
PCT/CN2020/096390
Other languages
English (en)
French (fr)
Inventor
王维伟
罗飞
Original Assignee
北京希姆计算科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京希姆计算科技有限公司 filed Critical 北京希姆计算科技有限公司
Priority to EP20856466.6A priority Critical patent/EP3989038A4/en
Publication of WO2021036421A1 publication Critical patent/WO2021036421A1/zh
Priority to US17/587,770 priority patent/US12072730B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/04Generating or distributing clock signals or signals derived directly therefrom
    • G06F1/12Synchronisation of different clock signals provided by a plurality of clock generators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3243Power saving in microcontroller unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17325Synchronisation; Hardware support therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04JMULTIPLEX COMMUNICATION
    • H04J3/00Time-division multiplex systems
    • H04J3/02Details
    • H04J3/06Synchronising arrangements
    • H04J3/0635Clock or time synchronisation in a network
    • H04J3/0638Clock or time synchronisation among nodes; Internode synchronisation
    • H04J3/0644External master-clock
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to a synchronous signal generating circuit, a chip, and related methods and devices based on a multi-core architecture.
  • the chip is the cornerstone of data processing, and it fundamentally determines the ability to process data.
  • the parallel computing model between the chips is also a current research hotspot.
  • the BSP computing model (Bulk Synchronous Parallel Computing Model) is an overall synchronous parallel computing model. It can be used in system-level applications, such as parallel computing in a computer cluster composed of multiple servers. Used in chip-level applications, such as multi-core chips for parallel computing.
  • the BSP calculation model can effectively avoid deadlocks, conceal the specific interconnection network topology, and simplify the communication protocol. It can also adopt the obstacle synchronization method.
  • the hardware implementation of the global synchronization is at a controllable coarse-grained level, thereby providing tight execution. An effective way to couple synchronous parallel algorithms.
  • each node has to send messages to the controller through the bus or on-chip network, which has a large delay.
  • the controller needs to process the message sent by each node. Therefore, the signal transmission efficiency between the controller and the node is relatively low, the delay is large, and the overall system error is increased. In order to reduce this error, it is often It will increase the complexity of the software that controls each node, leading to a further decrease in the efficiency of the overall system.
  • the present invention is completed in view of the above-mentioned state of the art, and its purpose is to provide a multi-core architecture-based synchronous parallel device and a control method thereof that can reduce overall delay, improve signal transmission efficiency, and reduce transmission burden.
  • the first aspect of the present disclosure provides a synchronization signal generation circuit, which is characterized in that: the synchronization signal generation circuit is used to generate synchronization signals for M node groups, and the node groups include at least one node, so The M is an integer greater than or equal to 1; the synchronization signal generation circuit includes: a synchronization signal generation unit and M group preparation signal generation units; the M group preparation signal generation units have a one-to-one correspondence with the M node groups; The first group of preparation signal generation units in the M group preparation signal generation units are connected to the K nodes in the first node group to be synchronized; the first group preparation signal generation unit is used for The first node group generates a first signal to be started, and the K is an integer greater than or equal to 1; the output ends of the M group preparation signal generating units are connected to the synchronization signal generating unit; the synchronization signal generating unit A first synchronization signal is generated according to the first signal to be started, where the first synchronization signal is used to instruct
  • the M group preparation signal generation units correspond to the M node groups one-to-one
  • the first group preparation signal generation units correspond to K in the first node group to be synchronized.
  • the first group of ready signal generating unit is used to generate the first signal to be activated for the first node group to be synchronized
  • the synchronization signal generating unit generates the first synchronization signal according to the first signal to be activated.
  • the first group preparation signal generating unit is configured to generate a first to-be-started signal for the first node group to be synchronized
  • the method includes: the first preparation signal generating unit is configured to generate the first to-be-started signal according to the preparation signals of all K nodes in the first node group to be synchronized. In this case, it can be known whether all K nodes in the first node group are in the idle state according to the first to-be-started signal, thereby ensuring that all K nodes in the first node group can perform synchronous and parallel operations .
  • the synchronization signal generating unit includes: M masking units, M to-be-synchronized group indication units, and M group synchronization signal generating units
  • the M to-be-synchronized group indicating units are respectively connected to the M shielding units; the input end of each shielding unit of the M shielding units is opposite to the output end of the M group preparation signal generating units Connection; the output ends of the M shielding units are respectively connected to the corresponding group synchronization signal generating unit in the M group synchronization signal generating units; the first shielding unit of the M shielding units is connected to it according to The first to-be-synchronized group indicating unit instructs to output the first group of quasi-synchronization signals; the first group of synchronization signal generation unit in the M group of synchronization signal generation units generates all the signals according to the first group of quasi-synchronization signals The synchronization signal of the first group.
  • the first shielding unit shields the to-be-started signals of other groups and outputs the quasi-synchronous signals of the first group according to the instructions of the first to-be-synchronized group indicating unit connected to it, and the first group of synchronization signal generating unit
  • the synchronization signal of the first group is generated according to the quasi-synchronization signal of the first group, so that it can be determined by the first masking unit whether to generate the synchronization signal of the first group.
  • the group indicating unit to be synchronized includes a register; the register includes at least M register bits, and the M register bits are related to all
  • the M node groups have a one-to-one correspondence, the register bit corresponding to the first node group to be synchronized among the M register bits is configured as a first value, and the M register bits correspond to the M nodes
  • the register bits corresponding to the node groups other than the first node group to be synchronized in the group are configured as the second value.
  • the register bit configured as the first value can be selected as required.
  • the first synchronization signal is used to instruct the K nodes in the first node group to start synchronization, including:
  • the first synchronization signal is used to instruct the K nodes in the first node group to start calculation at the same time, or to start data transmission at the same time.
  • the K nodes in the first node group start calculating or transmitting data at the same time.
  • a second aspect of the present disclosure provides a chip including the synchronization signal generating circuit of the first aspect described above, and N processing nodes, the N processing nodes are divided into M processing node groups, wherein the N is an integer greater than 1, and M is less than or equal to N.
  • the N processing nodes are divided into M processing node groups. Therefore, the synchronization signal generating circuit can correspond to the M processing node groups one-to-one, and can correspond to the M processing node groups. The group receives and sends signals.
  • the chip involved in the second aspect of the present disclosure optionally, it further includes N first communication hardware lines, and the N first communication hardware lines are used to transmit data sent from the N processing nodes to The preparation signal of the corresponding group preparation signal generating unit.
  • the group preparation signal generation unit is independently connected to each processing node in the corresponding processing node group, thereby ensuring that the preparation signal generated by each processing node is accurately sent to the corresponding group preparation signal generation Unit, and increase the transmission rate of the ready signal.
  • the chip involved in the second aspect of the present disclosure optionally, it further includes N second communication hardware lines, and the N second communication hardware lines are used for transmission from the synchronization signal generating unit to the corresponding processing.
  • the synchronization signal of the node is independently connected to each processing node in the corresponding processing node group, thereby ensuring that the synchronization signal generated by the synchronization signal generation unit is accurately sent to the corresponding processing node group Each processing node, and improve the transmission rate of the synchronization signal.
  • the chip involved in the second aspect of the present disclosure optionally, it further includes a control unit configured to change the setting of the register.
  • the mask unit can be controlled by the control unit controlling the value of the register bit of the register.
  • the chip involved in the second aspect of the present disclosure may optionally further include a control unit configured to control the execution and distribution of tasks of each processing node in the processing node group.
  • a control unit configured to control the execution and distribution of tasks of each processing node in the processing node group.
  • the tasks of each processing node in the processing node group can be allocated by the control unit.
  • the N processing nodes include a RISC-V core.
  • the chip can be flexibly programmed based on general RISC-V basic instructions and extended instructions.
  • the third aspect of the present disclosure provides a synchronous parallel control method based on a multi-core architecture, which is based on the synchronous parallel device described in any one of the first aspects above, including: when the node group When the first node in the node group enters the idle state, the first node sends a preparation signal to the group preparation signal generating unit where the first node is located; in response to all nodes in the node group sending preparation signals, the group The preparation signal generation unit generates a signal to be started; the synchronization signal generation unit generates a synchronization signal according to the signal to be started; all nodes in the node group start synchronization in response to the received synchronization signal.
  • the preparation signal is sent to the group preparation signal generation unit where the first node is located, and when the group preparation signal generation unit receives After the preparation signal of all the nodes in the corresponding node group, the signal to be started is generated, and then the synchronization signal generating unit generates the corresponding synchronization signal. All the nodes in the node group respond to the received synchronization signal to make all nodes start to synchronize Therefore, the synchronization signal generation unit can process multiple signals to be started at the same time, and the group preparation signal generation unit and the synchronization signal generation unit can independently transmit signals.
  • the fourth aspect of the present disclosure provides a synchronous parallel device based on a multi-core architecture, which is characterized by comprising: a processing module having N processing nodes, and the N processing nodes are divided into M processing node groups , Wherein the N is an integer greater than 1, and M is less than or equal to N; and a synchronization signal generation module, which includes a synchronization signal generation unit and M group preparation signal generation units; the M group preparation signal generation units and the The M node groups have a one-to-one correspondence; the first group of preparation signal generation units in the M group preparation signal generation units are connected to the K nodes in the first node group to be synchronized; the first group of preparation signal generation units The unit is used to generate a first to-be-started signal for the first node group to be synchronized, where K is an integer greater than or equal to 1; the output ends of the M group preparation signal generating units are the same as the synchronization signal generating unit Connection; the synchronization signal generating unit generates a first
  • the M group preparation signal generation units correspond to the M node groups one-to-one
  • the first group preparation signal generation units correspond to K in the first node group to be synchronized.
  • the first group of ready signal generating unit is used to generate the first signal to be activated for the first node group to be synchronized
  • the synchronization signal generating unit generates the first synchronization signal according to the first signal to be activated.
  • the processing node is at least one of a processing circuit, a processing chip, and a server. Therefore, it can be selected according to the actual application.
  • a fifth aspect of the present disclosure provides a computing device, which includes a processor and a memory, and the processor executes computer instructions stored in the memory, so that the computing device executes the synchronous parallel control method described above.
  • the sixth aspect of the present disclosure provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, implements the steps of the synchronous parallel control method described above.
  • the seventh aspect of the present disclosure provides a computer program product, which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the synchronous parallel control method described above.
  • a synchronous parallel device, method and system based on a multi-core architecture that can reduce overall delay, improve signal transmission efficiency, and reduce bus burden.
  • FIG. 1 is a schematic diagram showing a scene for processing data of a synchronous parallel device involved in an embodiment of the present disclosure.
  • FIG. 2 is a schematic diagram showing a functional block diagram of a synchronous parallel device according to an embodiment of the present disclosure.
  • FIG. 3 is a functional block diagram showing the synchronization signal generation module of the synchronization parallel device according to the embodiment of the present disclosure.
  • FIG. 4 is a functional block diagram showing a group preparation signal generating unit of the synchronous parallel device according to the embodiment of the present disclosure.
  • FIG. 5 is a functional block diagram showing the synchronization signal generating unit of the synchronous parallel device according to the embodiment of the present disclosure.
  • FIG. 6 is a functional block diagram showing a synchronization signal generating unit of the synchronization parallel device according to the embodiment of the present disclosure.
  • FIG. 7 is a functional block diagram showing the registers of the synchronous parallel device according to the embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram showing partial signals of a synchronous parallel device according to an embodiment of the present disclosure.
  • FIG. 9 is a flowchart showing a synchronous parallel method involved in an embodiment of the present disclosure.
  • 1...processing module 10...processing group, 101...processor, 2...synchronization signal generation module, 21...group preparation signal generation unit, 22...synchronization signal generation unit, 221...filter, 222...register, 223...synchronization signal Generator, 3...control module.
  • FIG. 1 is a schematic diagram showing an application scenario for processing data of a synchronous parallel device S involved in an embodiment of the present disclosure.
  • the synchronous parallel device S may include a processing module 1, a synchronization signal generating module 2 and a control module (sometimes also referred to as a “control unit”) 3.
  • the control module 3 acquires computing tasks that need to be processed (for example, massive image processing, big data processing, etc.). Then, the control module 3 can allocate computing tasks to each processing node (for example: server, processor 101 or processing chip, etc.), and configure the synchronization signal generation module 2.
  • the processing node on the chip receives the data from the control module After the task is calculated, the respective tasks are processed according to the synchronization signal of the synchronization signal generation module, and respective preparation signals are generated when the processing is completed.
  • the synchronization signal generation module 2 can receive the preparation signal sent by each processing node. After receiving the preparation signal, according to the configuration of the synchronization signal generation module 2, the synchronization signal generation module 2 generates a synchronization signal and sends it to the processing node, so that The processing node starts to process the next computing task after receiving the synchronization signal.
  • the processing module 1 may include at least one processing group (node group) 10, the processing group 10 may have multiple processors (nodes) 101 (sometimes also called “cores"), and the processors 101 may have idle State and non-idle state, and generate a ready signal when the processor 101 enters the idle state (see FIG. 2 described later).
  • the synchronization signal generation module 2 may include a group preparation signal generation unit 21 and a synchronization signal generation unit 22, wherein the group preparation signal generation unit 21 and each processor 101 in the processing group 10 can be independently connected to each other in hardware, and can be based on The preparation signal generated by each processor 101 in the processing group 10 generates a group preparation signal (to-be-started signal).
  • the synchronization signal generating unit 22 and each processor 101 in the processing group 10 can be connected independently of each other, and can The group preparation signal generated by the group preparation signal generating unit 21 is received and a group synchronization signal (synchronization signal) is generated according to the group preparation signal, wherein when each processor 101 in the processing group 10 receives the group synchronization signal, each of the processing group 10
  • the processor 101 enters a non-idle state.
  • the non-idle state includes a calculation state and a transmission state. For BSP synchronization, the non-idle state can be calculated first and then transmitted, or transmitted first and then calculated. The calculation and transmission are in a synchronization cycle. Finished within.
  • the idle state refers to the state when the processing node in the processing group in the processing module has completed the calculation of the received calculation task and transmitted the processing result to the next processing core, and no task can be executed.
  • a low-level signal may be issued when the processor 101 is in an idle state, and a high-level telecommunication may be issued when the processor 101 is not idle (see FIG. 8 described later).
  • the processors 101 in the processing group 10 independently prepare the signal generating unit for the group corresponding to the processing group 10 in the synchronization signal generating module 2 when entering the idle state. 21 sends the preparation signal.
  • the group preparation signal generation unit 21 receives the preparation signals of all processors 101 in the corresponding processing group 10, it generates the group preparation signal, and then the synchronization signal generation unit 22 receives the group preparation signal and generates the corresponding The group synchronization signal is finally sent to all the processors 101 in the processing group 10, so that the processor 101 enters a non-idle state.
  • the synchronization signal generation unit 22 can prepare signal processing for multiple groups at the same time, and the group preparation signal unit
  • the synchronization signal generating unit 22 can independently transmit signals with the synchronization parallel device S.
  • the synchronous parallel device S may also include a control module 3 for distributing computing tasks.
  • the control module 3 can be communicatively connected with the processing module 1 (described later) and the synchronization signal generating module 2 (described later), and the control module 3 is communicatively connected with the register 222 (described later), and the status of the register 222 is controlled by the control module 3. control. Therefore, the control module 3 is used to allocate calculation tasks to the processing module 1 and to control the state of the register 222.
  • the control module can calculate how many processing groups are needed to participate in the processing based on the estimated calculation amount of the obtained calculation tasks, and then divide the calculation tasks into multiple tasks according to the computing power of each processor in each processing group. Copy and send to the corresponding processor. As a result, it can be ensured that the time for each processor in each processing group for calculation to enter the non-idle state is approximately the same, so that the computing resources of the processing modules can be effectively used and the waste of computing resources can be reduced.
  • control module 3 can read and write the register 222 through a programming language, so as to change the validity of the flag bit (sometimes referred to as "register bit") corresponding to the processing group in the register 222.
  • register bit sometimes referred to as "register bit"
  • the register 222 can be controlled by the control module 3 (described later).
  • the above-mentioned programming language may be C language, assembly language, Verilog language or other hardware languages.
  • each processing node has to send messages to the controller through the Fabirc, such as a bus or a network on a chip, which has a large delay.
  • the Fabirc such as a bus or a network on a chip
  • the number of processing nodes is large, it will increase significantly.
  • the load of the bus or the on-chip network, and the controller needs to process the message sent by each node, therefore, the signal transmission efficiency between the controller and the node is low, the delay is large, and the error of the overall system is increased.
  • traditional BSP operations in order to reduce this error, it is necessary to increase the complexity of the software that controls each node, resulting in a further reduction in the efficiency of the overall system.
  • the processor 101 in the idle state sends the ready signal to the controller through the bus or the on-chip network.
  • the bus or the on-chip network will be increased. The load brings a large delay, and the controller needs to process the ready signals of each processor 101 separately, which further increases the delay.
  • the synchronization signal generation module 2 is added, so that the generation of the synchronization signal occurs in the synchronization signal generation module 2, and the synchronization signal generation module 2 passes through a dedicated and independent line (for example, a communication hardware line) is directly connected to each processor 101 without passing through or occupying a bus or a network on chip for data transmission.
  • a dedicated and independent line For example, a communication hardware line
  • the synchronization signal generation module 2 is provided with a group preparation signal generation unit 21 and a synchronization signal generation unit 22, so that the synchronization signal generation module 2 can simultaneously process the preparation signals of the respective processors 101 in all the processing groups 10 and accurately Send a group synchronization signal (described later) to the processing group that requires synchronization calculations.
  • the processing node may be at least one of a processing circuit, a processor, a processing chip, and a server. In this way, different processing methods can be selected according to the actual application.
  • the processor 101 may be a central processing unit (CPU), a digital signal processor (DSP), a tensor processing unit (TPU), an image processing unit (GPU), an on-chip programmable logic array (FPGA), or a dedicated Custom chips, etc.
  • the processing chip may be a processor 101 integrated with multiple cores.
  • the processing chip may be a chip based on a RISC-V multi-core architecture.
  • each processor 101 in the chip is an extended RISC-V core and can support the RISC-V general instruction set.
  • the chip can be flexibly programmed based on general RISC-V basic instructions and extended instructions.
  • the server may also be a local server with computing capability, a cloud server, or a server group distributed in different physical locations and connected via a network.
  • the synchronous parallel device S may include: a processing module having N processing nodes, the N processing nodes are divided into M processing node groups, where N is an integer greater than 1, and M is less than or equal to N; and
  • the synchronization signal generation module includes a synchronization signal generation unit and M group preparation signal generation units; M group preparation signal generation units correspond to M node groups one-to-one; the first group preparation signal of the M group preparation signal generation units
  • the generating unit is connected to K nodes in the first node group to be synchronized; the first group preparation signal generating unit is used to generate the first to-be-started signal for the first node group to be synchronized, and K is an integer greater than or equal to 1;
  • the output ends of the M group preparation signal generation units are connected to the synchronization signal generation unit; the synchronization signal generation unit generates a first synchronization signal according to the first signal to be activated, and the first synchronization signal is used to indicate K nodes in the first node group Start synchronization.
  • the synchronization signal generation module 2 may be implemented in the form of a synchronization signal generation circuit.
  • the synchronization signal generation circuit may include: a synchronization signal generation unit and M group preparation signal generation units; M group preparation signal generation units correspond to M node groups in one-to-one correspondence; the first group preparation signal generation unit of the M group preparation signal generation units
  • the signal generating unit is connected to K nodes in the first node group to be synchronized; the first group preparation signal generating unit is used to generate the first signal to be activated for the first node group to be synchronized, and K is an integer greater than or equal to 1.
  • the output ends of the M group preparation signal generating units are connected to the synchronization signal generating unit; the synchronization signal generating unit generates a first synchronization signal according to the first signal to be started, and the first synchronization signal is used to indicate K in the first node group The node starts to synchronize.
  • the first set of preparation signal generating unit is configured to generate the first to-be-started signal for the first node group to be synchronized, including: the first preparation signal generating unit is used to generate the first node group to be synchronized according to the The preparation signals of all K nodes of, generate the first to-be-started signal.
  • the synchronization signal generating unit may include: M shielding units, M to-be-synchronized group indicating units, and M group-synchronizing signal generating units; the M to-be-synchronized group indicating units are respectively connected to the M shielding units ; The input end of each of the M shielding units is connected to the output end of the M group preparation signal generating units; the output ends of the M shielding units are respectively connected to the corresponding group synchronization signal in the M group synchronization signal generating units The generating units are connected; the first shielding unit of the M shielding units outputs the quasi-synchronization signal of the first group according to the instructions of the first to-be-synchronized group indicating unit connected to it; the first of the M group synchronization signal generating units A group of synchronization signal generating unit generates the first group of synchronization signals according to the first group of quasi-synchronization signals.
  • the group indication unit to be synchronized may include a register; the register may include at least M register bits, the M register bits correspond to the M node groups one-to-one, and the M register bits correspond to the first node to be synchronized.
  • the register bit corresponding to the group may be configured as the first value, and among the M register bits, the register bits corresponding to the node groups other than the first node group to be synchronized among the M node groups are configured as the second value.
  • the first synchronization signal may be used to instruct the K nodes in the first node group to start synchronization, including: the first synchronization signal is used to instruct the K nodes in the first node group to start computing at the same time, or At the same time, data transmission starts.
  • the control module 3 may be integrated in the chip.
  • the chip may include multiple cores and control modules 3.
  • the chip may include the synchronization signal generation circuit described above and N processing nodes.
  • the N processing nodes are divided into M processing node groups, where N is an integer greater than 1, and M is less than or equal to N.
  • the N processing nodes include RISC-V cores.
  • the chip may also include N first communication hardware lines, and the N first communication hardware lines are used to transmit preparation signals sent from the N processing nodes to the corresponding group preparation signal generating unit.
  • the chip may also include N second communication hardware lines, and the N second communication hardware lines are used to transmit the synchronization signal sent from the synchronization signal generating unit to the corresponding processing node.
  • the chip may also include a control unit, which is used to change the settings of the registers.
  • the control unit may be used to control the execution and distribution of tasks of each processing node in the processing node group.
  • FIG. 2 is a schematic diagram showing a functional block diagram of a synchronous parallel device S according to an embodiment of the present disclosure.
  • the processing module 1 may include at least one processing group 10, and the processing group 10 may have multiple processors 101.
  • each processor 101 in the processing group 10 may have an idle state and a non-idle state, and generate a ready signal when the processor 101 enters the idle state.
  • each processor 101 in the processing group 10 receives a group synchronization signal, each processor 101 in the processing group 10 enters a non-idle state.
  • the processing module 1 may be a chip with a multi-core architecture and divided into multiple processing groups 10, and each processor 101 in each processing group 10 has an independent in-core memory. In this case, the processor 101 can use its own internal core memory to temporarily access data.
  • each processing group 10 may also have local memory (sometimes also called “local memory”).
  • local memory sometimes also called “local memory”
  • the processing group 10 can use the local memory to complete data access without having to exchange data with the external memory through the Fabric, so the processing efficiency of the processing group can be significantly improved.
  • the multiple processors 101 on the chip as the processing module 1 are grouped, and the number of divided processing groups refers to the maximum number of groups on a chip. There is no particular limitation on the number of processing groups. The number can be pre-designed, for example, in the chip design process according to the expected application. In addition, in some examples, the number of processors 101 in each processing group may be the same. In other examples, the number of processors 101 in each processing group may also be different.
  • each processing group 10 may have at least one processor 101.
  • the number of processors 101 in each processing group 10 may be preset.
  • the chip used as the processing module 1 may be a chip based on the RISC-V multi-core architecture.
  • each processor 101 in the chip may be an extended RISC-V core supporting the RISC-V general instruction set.
  • the chip can be flexibly programmed based on common RISC-V basic instructions and extended instructions.
  • processing module 1 involved in this embodiment will be described in further detail with reference to the specific example in FIG. 2.
  • the processing module 1 includes m processing groups, namely, a processing group 10a1, a processing group 10a2, ..., a processing group 10am.
  • the m-th processing group 10am includes n processors, that is, processors 101am1, processors 101am2, ..., processors 101amn.
  • processors 101am1, processors 101am2, ..., processors 101amn a specific description will be given by taking the processing group 10a1 in FIG. 2 as an example.
  • each processor 101 receives a computing task (here, the processor 101a11, the processor 101a12, ..., the processor 101a1n).
  • the processors 101a11 to 101a1n are connected to the synchronization signal generating module 2 independently of each other. Specifically, the processors 101a11 to 101a1n may be connected to the synchronization signal generating module 2 via independent hardware lines.
  • the synchronization signal generation module 2 When the synchronization signal generation module 2 receives the preparation signals sent by all the processors 101 (processors 101a11, processors 101a12, ..., processors 101a1n) in the processing group 10a1, the synchronization signal generation module 2 is based on the received preparation signals The group synchronization signal is sent to each processor (processor 101a11, processor 101a12, ..., processor 101a1n) in the processing group 10a1. After receiving the group synchronization signal, each processor (processor 101a11, processor 101a12, ..., processor 101a1n) in the processing group 10a1 enters a non-idle state to start processing the assigned computing tasks.
  • each processor can read and write data with the local memory of the processing group 10a1 and temporarily store temporary data during processing. After the calculation is completed , And then send the calculation result to the control module 3 through Fabric.
  • the local memory may be set as the memory only used for each processor 101 in one processing group 10.
  • the data of the processor 101 can be temporarily stored in the memory, and the calculation speed of the processor 101 can be improved.
  • the processors 101 in the processing group 10 can read and write data to the local memory independently of each other.
  • multiple processing groups may also share a piece of physical memory.
  • the local memory or the physical memory used may be flash memory, hard disk type memory, micro multimedia card type memory, card type memory (such as SD or XD memory), random access memory (random access memory).
  • memory RAM
  • static random access memory static RAM, SRAM
  • read only memory ROM
  • EEPROM electrically erasable programmable read-only memory
  • PROM Programmable ROM
  • RPMB replay protected memory block
  • magnetic storage magnetic disk or optical disk.
  • the local memory or the physical memory used preferably adopts random access memory or static random access memory.
  • the local memory may also be a network storage device on the network.
  • the processing node may perform operations such as access to the storage on the Internet.
  • the processor 101 may generate and send a ready signal to indicate that the processor 101 is in the idle state when the processor 101 completes the calculation and the transmission enters the idle state. In other examples, the processor 101 may generate and send a ready signal every predetermined time interval when in an idle state. In addition, in some examples, when the processor 101 is in an idle state, the ready signal may be continuously generated and sent.
  • each processor 101 in the processing group 10 may enter the non-idle state from the idle state after receiving the group synchronization signal, and start to process the obtained computing task.
  • the transmission status is also included after calculating the status.
  • the processor 101 can enter the transmission state to receive and send data after the calculation is completed. Therefore, the processor 101 can send the calculated data and receive the calculation task.
  • the processor 101 when the processor 101 completes the calculation and transmission tasks, it enters the idle state, and the processor 101 in the idle state can receive new calculation tasks at any time.
  • the transmission state can also be set before the calculation state.
  • all the processors 101 in each processing group 10 change from an idle state to a non-idle state after receiving a group synchronization signal until the next group synchronization signal arrives, which is sometimes referred to as “superstepping”.
  • the processor 101 may include one or more different types of processors.
  • a processor 101 can be a central processing unit (CPU), a tensor processing unit (TPU) or a graphics processing unit (GPU), etc., or it can be a combination of a central processing unit (CPU) and a graphics processing unit (GPU).
  • the processor 101 may also be a customized chip, for example, a chip that supports RISC-V general instruction set and extended instruction set.
  • FIG. 3 is a functional block diagram showing the synchronization signal generation module 2 of the synchronization parallel device S according to the embodiment of the present disclosure.
  • the synchronization signal generation module 2 may include a group preparation signal generation unit 21 and a synchronization signal generation unit 22.
  • the synchronization signal generating module 2 may be connected to all (all) processors 101 in the processing module 1 independently of each other.
  • the synchronization signal generating module 2 can be connected to all the processors 101 in the processing module 1 through a dedicated and independent hardware circuit. In this case, the signal transmission between each processor 101 and the synchronization signal generation module 2 does not interfere with each other, and each processor 101 and the synchronization signal generation module 2 do not pass through the fabric, which can improve the overall operating efficiency .
  • the synchronization signal generating module 2 may be a logic circuit for generating a group synchronization signal.
  • the synchronization signal generating module 2 can be implemented by a field programmable logic array (FPGA).
  • the group synchronization signal is generated by logical operations such as "or", "and” or “not” and combinations thereof via the preparation signal of the input synchronization signal generating module 2.
  • the group synchronization signal may be a pulse signal or a level signal.
  • the synchronization signal generation module 2 includes a group preparation signal generation unit and a synchronization signal generation unit.
  • each group preparation signal generation unit (group preparation signal generation unit 21a1, group preparation signal generation unit 21a2, ..., group preparation signal generation unit 21am) is used to receive all of the corresponding processing groups (processing groups 10a1 to 10am).
  • the preparation signal sent by the processor 101 generates a group preparation signal and sends it to the synchronization signal generating unit 22, and then the synchronization signal generating unit 22 generates a group synchronization signal and sends it to each processor in the corresponding processing group.
  • the group preparation signal generating unit 21a1 will be described as an example.
  • the group preparation signal generating unit 21a1 receives each preparation signal from all processors (processor 101a11, processor 101a12, ..., processor 101a1n), and generates a group based on all the received preparation signals.
  • the signal is prepared, and the group preparation signal is sent to the synchronization signal generating unit 22.
  • the synchronization signal generating unit 22 generates a group synchronization signal based on the received respective group preparation signals and sends the group synchronization signal to each processor in the corresponding processing group.
  • the synchronization signal generation module 2 sends the group synchronization signal to each processor (for example, the processor 101a11, the processor 101a12, ..., the processor 101a1n), the each processor 101, namely the processor 101a11, processor 101a1n, 101a12..., the processor 101a1n synchronously processes the assigned computing tasks.
  • each processor for example, the processor 101a11, the processor 101a12, ..., the processor 101a1n
  • the processor 101 namely the processor 101a11, processor 101a1n, 101a12..., the processor 101a1n synchronously processes the assigned computing tasks.
  • FIG. 4 is a functional block diagram showing the group preparation signal generating unit 21 of the synchronous parallel device S according to the embodiment of the present disclosure.
  • the group preparation signal generating unit 21 and each processor 101 in the processing group 10 can perform independently of each other.
  • the processors 101a11 to 101a1n may be connected to the group preparation signal generating unit 21 via independent hardware lines, respectively.
  • the group preparation signal generating unit 21 may generate a group preparation signal based on the preparation signal generated by each processor 101 in the processing group 10.
  • the group preparation signal generating unit 21 may generate a group preparation signal after receiving preparation signals of all processors 101 in the processing group 10 (for example, processors 101a11, processors 101a12, ..., processors 101a1n). In this case, it can be known from the group preparation signal whether all the processors 101 in the processing group 10 have entered an idle state, thereby ensuring that all the processors 101 in the processing group 10 can perform synchronous and parallel operations.
  • the group preparation signal generating unit 21 may correspond to the processing group 10, and the respective processors 101 in the corresponding processing group 10 are independently connected to the group preparation signal generating unit 21.
  • the group preparation signal generation unit 21 may correspond to the processing group 10 means that one group preparation signal generation unit 21 corresponds to one processing group 10, and the number of the group preparation signal generation units 21 and the processing group 10 is equal.
  • the group preparation signal generation unit 21 can simultaneously receive and process the preparation signals of the respective processors 101 in the corresponding processing group 10, so that the synchronization signal generation module 2 can simultaneously receive and process the preparation signals from the processing module 1 All processing group 10 signals.
  • each processor (processor 101a11, processor 101a12, ..., processor 101a1n) in the processing group 10a1 generates preparation signals a11, The preparation signal a12,..., the preparation signal a1n, each signal is independently sent to the group preparation signal generation unit 21a1 corresponding to the processing group 10a1 through independent hardware lines, and the group preparation signal generation unit 21a1 receives the information in the processing group 10a1
  • the preparation signals of all processors (processor 101a11, processor 101a12, ..., processor 101a1n), that is, after receiving preparation signal a11, preparation signal a12, ..., preparation signal a1n, generate group preparation signal a1 to indicate All the processors 101 in the processing group 10a1 are in an idle state. At this time, all the processors 101 in the processing group 10a1 are in a state of waiting for synchronization.
  • the group preparation signal generating unit 21 may be directly connected to each processor 101 in the processing group 10 through independent dedicated hardware lines (first communication hardware lines), and the synchronization signal generating unit 22 may be directly connected to the processing group 10
  • the respective processors 101 are directly connected through independent dedicated hardware lines.
  • each processor 101 in the processing group 10 can send and receive signals through a dedicated hardware line, thereby reducing interference between signals and improving the efficiency of the processor 101 in sending and receiving signals.
  • the group preparation signal may be generated by logical operations such as “or”, “and” or “not” and a combination of the preparation signal input to the synchronization signal generation module 2.
  • the group preparation signal may be a pulse signal or a level signal.
  • FIG. 5 is a functional block diagram showing the synchronization signal generating unit 22 of the synchronous parallel device S according to the embodiment of the present disclosure.
  • the synchronization signal generation unit 22 and each processor 101 in the processing group 10 can be connected independently of each other, and can receive the group preparation signal generated by the group preparation signal generation unit 21 and The group synchronization signal is generated according to the group preparation signal.
  • the synchronization signal generating unit 22 may be connected to the group preparation signal generating unit 21. In this way, the synchronization signal generating unit 22 can receive the signals from all the group preparation signal generating units 21.
  • the group preparation signal generating unit 21 may be connected to the synchronization signal generating unit 22 independently of each other. In other examples, the group preparation signal generating unit 21 may be connected to the synchronization signal generating unit 22 via an internal Fabirc.
  • the group preparation signal a1, the group preparation signal a2, ..., the group preparation signal am are sent to the filters (masking units) 221a1 to 221am.
  • the filter is sometimes called a shield.
  • Each filter 221 (filter 221a1, filter 221a2, ..., filter 221am) will receive each group preparation signal such as group preparation signal a1, group preparation signal a2, ..., group preparation signal am.
  • registers (group indication unit to be synchronized) 222a1, registers 222a2, ..., registers 222am respectively control corresponding filters (filter 221a1, filter 221a2, ..., filter 221am), and are allocated according to control module 3.
  • the effective bit of the calculation task determines which groups of the filter are ready for the signal to be effective.
  • Each filter (filter 221a1, filter 221a2, ..., filter 221am) generates a start signal (quasi-synchronous signal) after receiving a valid group preparation signal, and sends it to the corresponding synchronization signal generator (group synchronization Signal generation unit) (synchronization signal generator 223a1, synchronization signal generator 223a2, ..., synchronization signal generator 223am), and then by each synchronization signal generator (synchronization signal generator 223a1, synchronization signal generator 223a2, ..., The synchronization signal generator 223am) respectively sends the group synchronization signal to each processor.
  • group synchronization Signal generation unit synchronization signal generator 223a1, synchronization signal generator 223a2, ..., synchronization signal generator 223am
  • FIG. 6 is a functional block diagram showing a synchronization signal generating unit 22 of the synchronization parallel device S according to the embodiment of the present disclosure.
  • the synchronization signal generating unit 22 may further include a filter 221 for receiving and filtering the group preparation signal and a register 222 for controlling whether the group preparation signal received by the filter 221 is valid.
  • the device 221 determines whether to generate a start signal according to the group preparation signal and the state of the register 222.
  • the register 222 can filter the received group preparation signal by controlling the filter 221, and thus, the synchronization signal generating unit 22 can determine whether to generate a group synchronization signal through the filter 221.
  • the group preparation signal generated by the group preparation signal generating unit 21 may be sent to each filter 221.
  • the filter 221 may filter the received group preparation signal, and generate a start signal when the received signal is the signal sent by the corresponding group preparation signal generating unit 21, so that the filter 221 may The group preparation signal is screened independently of each other, and the start signal is generated independently.
  • the filter 221 may correspond to a plurality of group preparation signal generating units 21, and the start signal is generated when the filter 221 receives the corresponding plurality of group preparation signals.
  • the corresponding group preparation signal generation units 21 are connected to the group synchronization signal generation unit 223 independently of each other.
  • the start signal is generated after logical operations such as "or”, “and” or “not” and a combination thereof through the group preparation signal of the input filter 221.
  • the start signal may be a pulse signal or a level signal.
  • each group preparation signal a1, group preparation signal a1, ..., group preparation signal am is sent to the filter 221a1.
  • the filter 221a1 According to the setting of the register 222a1 (for example, determined by the control module 3 according to the calculation task), the filter 221a1 generates a start signal after receiving a valid group preparation signal a1, and sends it to the corresponding synchronization signal generator (synchronization signal generator 223a1). , The synchronization signal generator 223a2,..., the synchronization signal generator 223am), and then the synchronization signal generator 223a1 sends the group synchronization signal to each processor in the group.
  • FIG. 7 is a functional block diagram showing the register 222 of the synchronous parallel device S according to the embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram showing part of the signals of the synchronous parallel device S involved in the embodiment of the present disclosure.
  • the register 222 may have a flag bit corresponding to at least the group preparation signal generating unit 21.
  • the filter 221 receives the group preparation signal generated by the group preparation signal generating unit 21. In this way, it is possible to select a group preparation signal that can be received by the filter 221 as necessary.
  • the register 222 may have a flag bit corresponding to the group preparation signal generating unit 21.
  • the register 222 has m-bit mark bits (mark bit t1, mark bit t2, ..., mark bit tm), mark bit t1 corresponds to group preparation signal a1, and mark bit t2 corresponds to group preparation signal a2, ..., mark bit tm corresponds to the group preparation signal am.
  • mark bit t1 corresponds to group preparation signal a1
  • mark bit t2 corresponds to group preparation signal a2
  • mark bit tm corresponds to the group preparation signal am.
  • the flag bit t1 and the flag bit t2 of the register 222a1 are valid flag bits.
  • the filter 221a1 controlled by the register 222a1 wants to generate a start signal, it needs to receive the group preparation signal a1 and the group preparation signal a1 of the group preparation signal generating unit 21a1.
  • the flag bit of the register 222 is set to control whether the group preparation signal received by the filter 221 is valid, and then the time when the group synchronization signal is generated is controlled.
  • the filter 221 when a plurality of flag bits of the register 222 are set to be valid, the filter 221 generates a start signal after receiving a plurality of group preparation signal generating units 21 corresponding to the plurality of flag bits. In this case, the filter 221 can be made an obstacle mechanism in the generation of the group synchronization signal. Therefore, the filter 221 can be controlled by the flag bit of the control register 222, thereby controlling the generation time of the group synchronization signal.
  • the tag bit t1 and the tag bit t2 of the register 222a1 are valid tag bits
  • the tag bit t1 and the tag bit t2 of the register 222a2 are also set as valid tag bits.
  • the filter 221a1 and the filter 221a2 Both need to receive the group preparation signal a1 from the group preparation signal generating unit 21a1 and the group preparation signal a2 from the group preparation signal generating unit 21a2 before the start signal can be generated.
  • multiple filters 221 can generate start signals at the same time, thereby enabling multiple synchronization signal generating units 22 to generate group synchronization signals at the same time, thereby enabling multiple The processing group 10 enters the non-idle state at the same time.
  • the register 222 may be selected from multi-function registers, pointer registers, index registers, special registers, segment registers, control registers, debug registers, task registers, floating point registers, multimedia registers, single instruction stream multiple data stream registers One or more of.
  • the register 222 may be a control register.
  • the start signal undergoes logical operations such as "or”, “and” or “not” and combinations thereof via the group preparation signal of the input filter 221 and the flag bit signal of the register 222. Later produced.
  • the flag signal may be a pulse signal or a level signal.
  • the group preparation signal a1, the group preparation signal a2,..., the group preparation signal am are sent to the filter 221a1, when the two flag bits in the filter 222a1 (assumed to be related to the group
  • the filter 221a1 masks the preparation signals other than the group preparation signals a1 and a2, and generates the start signal a1 after receiving the group preparation signal a1 and the group preparation signal a2.
  • the synchronization signal generator 223a1 sends the group synchronization signal a1 to each processor 101.
  • Figure 8 reflects the changes in the input and output of the timing level signal shown in Figure 7.
  • the filter 221a1 receives the group preparation signal a1, the group preparation signal a2, ..., the group preparation signal am, and the group preparation signal (group preparation signal a1, the group preparation signal a2, ... ..., the group preparation signal am) respectively correspond to the flag bits in the register 222a1 (the flag bit t1, the flag bit t2, ..., the flag bit tm).
  • the filter 221a1 generates the start signal a1 according to whether the flag bit of the register 222a1 is valid.
  • the filter 221a1 can only receive the group preparation signal a1 and Group preparation signal a2 and shield other group preparation signals.
  • the filter 221a1 generates a start signal a1 based on the group preparation signal a1 and the group preparation signal a2.
  • the group synchronization signal generated by the synchronization signal generation unit 22 may be sent to each processor 101 in the processing group 10 at the same time. As a result, it can be ensured that each processor 101 in the processing group 10 starts calculations at the same time.
  • the synchronization signal generating unit 22 may further include a synchronization signal generator 223 connected to the filter 221 and receiving a start signal.
  • the synchronization signal generator 223 generates a group synchronization signal according to the start signal and sends it to the processing group 10
  • Each processor 101, the synchronization signal generator 223, and each processor 101 in the processing group 10 are independently connected to each other, for example, connected through a second communication hardware line.
  • the synchronization signal generator 223 is only independently connected to each processor 101 in the corresponding processing group 10, thereby ensuring that the generated group synchronization signal is accurately sent to the corresponding processing group 10
  • Each of the processors 101, and the transmission rate of the group synchronization signal is increased.
  • each filter 221 is connected to a synchronization signal generator 223. Further, since the filter 221 corresponds to the processing group 10, the synchronization signal generator 223 corresponding to the filter 221 corresponds to the processing group 10.
  • the synchronization signal generator 223 may be connected to each processor 101 in the corresponding processing group 10 independently of each other. In this case, the group synchronization signal can be directly sent to each processor 101 in the processing group 10 corresponding to the synchronization signal generator 223, thereby further reducing the signal transmission delay.
  • the group synchronization signal is generated after logical operations such as "or”, “and” or “not” and combinations thereof via the start signal of the input synchronization signal generator 223.
  • the group synchronization signal may be a pulse signal or a level signal.
  • the flag bit of the register 222 may be controlled by the control module 3. In this way, the flag bit of the register 222 can be controlled by the control module 3 to control the state of the register 222.
  • control module 3 may communicate with the processing module 1 and the synchronization signal generation module 2 through Fabirc. In this case, the control module 3 can exchange data with the processing module 1 and the synchronization signal generating module 2 through Fabirc. In some examples, the processor 101 in the processing module 1 may send the calculation result through Fabirc after the calculation is completed.
  • control module 3 may also include a wireless communication unit. In this case, the control module 3 can send and receive data through wireless signals.
  • control module 3 may be a top-level micro control unit (MCU) of the chip, a main control circuit (Host) located outside the chip, other chips, or other program applications (Server).
  • MCU top-level micro control unit
  • Host main control circuit
  • Server program applications
  • FIG. 9 is a flowchart showing a synchronous parallel method related to an embodiment of the present disclosure.
  • the synchronous parallel control method of the synchronous parallel device S includes the following steps: when the first node in the node group enters an idle state, the first node prepares signal generation
  • the unit 21 sends a preparation signal (step S100); in response to all nodes in the node group sending a preparation signal, the group preparation signal generating unit 21 generates a signal to be started (step S200); the synchronization signal generating unit 22 generates a synchronization signal according to the signal to be started (Step S300); All nodes in the node group respond to the received synchronization signal to start synchronization (Step S400).
  • the processors (nodes) 101 in the processing group (node group) 10 independently report to the group corresponding to the processing group 10 in the synchronization signal generating module 2 when entering the idle state.
  • the preparation signal generation unit 21 sends a preparation signal.
  • the group preparation signal generation unit 21 receives the preparation signals of all processors 101 in the corresponding processing group 10, it generates a group preparation signal, and the synchronization signal generation unit 22 receives the group preparation signal. And generate the corresponding group synchronization signal, and finally send it to all the processors 101 in the processing group 10, so that the processors 101 start to synchronize. Therefore, the synchronization signal generation unit 22 can prepare signal processing for multiple groups at the same time, and group preparation
  • the signal unit and the synchronization signal generating unit 22 can independently transmit signals.
  • step S100 when each processor 101 in the processing group 10 enters an idle state, each processor 101 in the processing group 10 is caused to send a ready signal.
  • the setting manners of the processing group 10 and each processor 101 in the processing group 10 can be specifically referred to the description of the above-mentioned processing module 1, which will not be repeated here.
  • the processor 101 performs calculations or transmissions when the processor 101 starts to synchronize, and enters an idle state after completing the calculations. In this case, the processor 101 after completing the calculation and transmission can wait for other processors 101 in calculation or transmission, thereby enabling the processor 101 to simultaneously start synchronization when the next group synchronization signal arrives.
  • step S200 the group preparation signal generation unit 21 receives the preparation signals sent by each processor 101 in the processing group 10, and when the group preparation signal generation unit 21 receives the preparation signals sent by all the processors 101 in the processing group 10 At this time, the group preparation signal generating unit 21 is caused to generate a group preparation signal.
  • the group preparation signal generating unit 21 For the setting method of the group preparation signal generating unit 21, refer to the description of the above-mentioned group preparation signal generating unit 21, which will not be repeated here.
  • step S300 and step S400 the group preparation signal is received by the synchronization signal generating unit 22, and the group synchronization signal is generated according to the group preparation signal and sent to each processor 101 in the processing group 10.
  • the device 101 enters synchronization when receiving the group synchronization signal.
  • the setting method of the synchronization signal generating unit 22 please refer to the description of the synchronization signal generating unit 22, which will not be repeated here.
  • computing tasks are assigned to each processor 101 in the processing group 10, and each processor 101 in the processing group 10 is used to perform operations. In this way, each processor 101 in the processing group 10 can complete calculations locally.
  • the filter with the register receives the group preparation signal generated by the group preparation signal generating unit corresponding to the flag bit. Therefore, the validity of the group preparation signal received by the filter can be controlled through the control register.
  • the present disclosure also provides a computing device that includes a processor and a memory, and the processor executes computer instructions stored in the memory, so that the computing device executes the parallel operation described in the foregoing disclosure. Control Method.
  • the present disclosure also provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, implements the steps of the synchronous parallel control method described in the present disclosure.
  • the present disclosure also provides a computer program product, which includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the synchronous parallel control method described in the foregoing disclosure.
  • the disclosed device may be implemented in other ways.
  • the device examples described above are only illustrative.
  • the division of the above-mentioned units is only a logical function division.
  • there may be other division methods for example, multiple units or components can be combined or integrated.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of this example solution.
  • the functional units in the examples of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Multi Processors (AREA)

Abstract

一种多核架构的同步信号产生电路、芯片和同步方法及装置,其包括:用于为M个节点组产生同步信号,节点组中包括至少一个节点,M为大于等于1的整数;同步信号产生电路包括:同步信号生成单元(22)和M个组准备信号生成单元(21a1-21am);M个组准备信号生成单元(21a1-21am)与M个节点组一一对应;M个组准备信号生成单元(21a1-21am)中的第一组准备信号生成单元(21a1)与待同步的第一节点组中的K个节点相连接;第一组准备信号生成单元(21a1)用于为待同步的第一节点组生成第一待启动信号(a1),K为大于等于1的整数;M个组准备信号生成单元(21a1-21am)的输出端与同步信号生成单元(22)相连接,同步信号生成单元(22)根据第一待启动信号(a1)生成第一同步信号(a1),第一同步信号(a1)用于指示第一节点组内的K个节点开始同步。

Description

多核架构的同步信号产生电路、芯片和同步方法及装置 技术领域
本发明涉及一种基于多核架构的同步信号产生电路、芯片和相关方法及装置。
背景技术
随着科学技术的发展,人类社会正在快速进入智能时代。智能时代的重要特点,就是人们获得数据的种类越来越多,获得数据的量越来越大,而对处理数据的速度要求越来越高。芯片是数据处理的基石,它从根本上决定了处理数据的能力。为了提高芯片的数据处理能力,除了通过芯片的处理速度和定制化的专用芯片外,芯片之间的并行计算模型也是当前的研究热点。
在现有的并行计算模型中,BSP计算模型(Bulk Synchronous Parallel Computing Model)是一种整体同步并行计算模型,它可以用于系统级应用中,例如多服务器组成的计算机群进行并行计算,也可以用于芯片级应用中,例如多(众)核芯片进行并行计算。BSP计算模型能够有效避免死锁,掩盖具体的互连网络拓扑,又简化了通信协议,还能够采用障碍同步的方式以硬件实现的全局同步是在可控的粗粒度级,从而提供了执行紧耦合同步式并行算法的有效方式。
然而,在目前的BSP计算模型中,各节点要通过总线或者片上网络发送消息给控制器,具有较大的延时,当节点数量较多的时候,会显著加大总线或者片上网络的负载,且控制器需要对每一个节点发送的消息进行处理,因此,控制器与节点之间的信号传递效率较低,延时较大,增大了整体系统的误差,而为了降低这种误差,经常会增加控制各节点的软件的复杂性,导致整体系统的效率进一步降低。
发明内容
本发明有鉴于上述现有技术的状况而完成,其目的在于提供一种 能够降低整体延迟,提高信号传递效率并减小传输负担的基于多核架构的同步并行装置及其控制方法。
为此,本公开的第1方面提供了一种同步信号产生电路,其特征在于:所述同步信号产生电路用于为M个节点组产生同步信号,所述节点组中包括至少一个节点,所述M为大于等于1的整数;所述同步信号产生电路包括:同步信号生成单元和M个组准备信号生成单元;所述M个组准备信号生成单元与所述M个节点组一一对应;所述M个组准备信号生成单元中的第一组准备信号生成单元与待同步的第一节点组中的K个节点相连接;所述第一组准备信号生成单元用于为所述待同步的第一节点组生成第一待启动信号,所述K为大于等于1的整数;所述M个组准备信号生成单元的输出端与所述同步信号生成单元相连接;所述同步信号生成单元根据所述第一待启动信号生成第一同步信号,所述第一同步信号用于指示所述第一节点组内的所述K个节点开始同步。
在本公开第1方面所涉及的同步信号产生电路中,M个组准备信号生成单元与M个节点组一一对应,并且第一组准备信号生成单元与待同步的第一节点组中的K个节点相连接,第一组准备信号生成单元用于为待同步的第一节点组生成第一待启动信号,同步信号生成单元根据第一待启动信号生成第一同步信号,第一节点组内的K个节点开始同步,由此,同步信号生成单元能够同时对多个待启动信号进行处理,以及M个组准备信号生成单元和同步信号生成单元能够独立地进行信号的传递。
另外,在本公开的第1方面所涉及的同步信号产生电路中,可选地,所述第一组准备信号生成单元用于为所述待同步的第一节点组生成第一待启动信号,包括:所述第一准备信号生成单元用于根据所述待同步的第一节点组中的全部K个节点的准备信号生成所述第一待启动信号。在这种情况下,能够根据第一待启动信号得知第一节点组中的全部K个节点是否全部进入空闲状态,由此,能够确保第一节点组中的全部K个节点能够同步并行运算。
另外,在本公开的第1方面所涉及的同步信号产生电路中,可选地,所述同步信号生成单元包括:M个屏蔽单元、M个待同步组指示 单元和M个组同步信号生成单元;所述M个待同步组指示单元分别与所述M个屏蔽单元相连接;所述M个屏蔽单元中的每个屏蔽单元的输入端与所述M个组准备信号生成单元的输出端相连接;所述M个屏蔽单元的输出端分别与所述M个组同步信号生成单元中对应的组同步信号生成单元相连接;所述M个屏蔽单元中的第一屏蔽单元根据连接在其上的第一待同步组指示单元的指示,输出第一组的准同步信号;所述M个组同步信号产生单元中的第一组同步信号生成单元根据所述第一组的准同步信号生成所述第一组的同步信号。在这种情况下,第一屏蔽单元根据连接在其上的第一待同步组指示单元的指示,屏蔽其它组的待启动信号并输出第一组的准同步信号,第一组同步信号生成单元根据第一组的准同步信号生成第一组的同步信号,由此,能够通过第一屏蔽单元判断是否生成第一组的同步信号。
另外,在本公开的第1方面所涉及的同步信号产生电路中,可选地,所述待同步组指示单元包括寄存器;所述寄存器包括至少M个寄存器位,所述M个寄存器位与所述M个节点组一一对应,所述M个寄存器位中与所述待同步的第一节点组对应的寄存器位被配置为第一值,所述M个寄存器位中与所述M个节点组中除所述待同步的第一节点组之外的节点组对应的寄存器位被配置为第二值。由此,能够根据需要选择配置为第一值的寄存器位。
另外,在本公开的第1方面所涉及的同步信号产生电路中,可选地,所述第一同步信号用于指示所述第一节点组内的所述K个节点开始同步,包括:所述第一同步信号用于指示所述第一节点组内的所述K个节点同时开始计算,或同时开始传输数据。由此,能够确第一节点组内的所述K个节点同时开始计算或传输数据。
此外,本公开的第2方面提供了一种芯片,包括上述第1方面的同步信号产生电路,以及N个处理节点,所述N个处理节点被分为M个处理节点组,其中,所述N为大于1的整数,M小于等于N。
在本公开第2方面所涉及的芯片中,N个处理节点被分为M个处理节点组,由此,同步信号产生电路能够对M个处理节点组一一对应,并能够对M个处理节点组进行信号的接收与发送。
另外,在本公开的第2方面所涉及的芯片中,可选地,还包括N 个第一通信硬件线路,所述N个第一通信硬件线路用于传输从所述N个处理节点发送至对应的组准备信号生成单元的准备信号。在这种情况下,组准备信号生成单元与对应的处理节点组中的各个处理节点分别独立地进行连接,由此,能够确保各个处理节点生成的准备信号准确地发送至对应的组准备信号生成单元,且提高了准备信号的传输速率。
另外,在本公开的第2方面所涉及的芯片中,可选地,还包括N个第二通信硬件线路,所述N个第二通信硬件线路用于传输从同步信号生成单元发送至对应处理节点的同步信号。在这种情况下,同步信号生成单元与对应的处理节点组中的各个处理节点分别独立地进行连接,由此,能够确保同步信号生成单元生成的同步信号准确地发送至对应的处理节点组中的各个处理节点,且提高了同步信号的传输速率。
另外,在本公开的第2方面所涉及的芯片中,可选地,还包括控制单元,所述控制单元用于改变寄存器的设置。由此,能够通过控制单元控制寄存器的寄存器位的值从而控制屏蔽单元。
另外,在本公开的第2方面所涉及的芯片中,可选地,还包括控制单元,所述控制单元用于控制处理节点组中的各个处理节点的任务的执行和分配。由此,能够通过控制单元对处理节点组中的各个处理节点的任务进行分配。
另外,在本公开的第2方面所涉及的芯片中,可选地,所述N个处理节点包括RISC-V核。由此,能够基于通用RISC-V基础指令和扩展指令对芯片进行灵活的编程。
此外,本公开的第3方面提供了一种基于多核架构的同步并行控制方法,其是基于上述第1方面任一项所述的同步并行装置的同步并行控制方法,包括:当所述节点组中的第一节点进入空闲状态时,所述第一节点向所述第一节点所在的组准备信号生成单元发送准备信号;响应于所述节点组中所有节点均发送了准备信号,所述组准备信号生成单元生成待启动信号;同步信号生成单元根据所述待启动信号生成同步信号;所述节点组中的所有节点响应于接收到的所述同步信号,开始同步。
在本公开的第3方面所涉及的同步并行控制方法中,节点组中的 第一节点在进入空闲状态时向第一节点所在的组准备信号生成单元发送准备信号,当组准备信号生成单元接收到与之对应的节点组中全部节点的准备信号后生成待启动信号,再由同步信号生成单元生成对应的同步信号,节点组中的所有节点响应于接收到的同步信号,使得所有节点开始同步,由此,同步信号生成单元能够同时对多个待启动信号处理,以及组准备信号生成单元和同步信号生成单元能够独立地进行信号的传递。
此外,本公开的第4方面提供了一种基于多核架构的同步并行装置,其特征在于,包括:处理模块,其具有N个处理节点,所述N个处理节点被分为M个处理节点组,其中,所述N为大于1的整数,M小于等于N;以及同步信号发生模块,其包括同步信号生成单元和M个组准备信号生成单元;所述M个组准备信号生成单元与所述M个节点组一一对应;所述M个组准备信号生成单元中的第一组准备信号生成单元与待同步的第一节点组中的K个节点相连接;所述第一组准备信号生成单元用于为所述待同步的第一节点组生成第一待启动信号,所述K为大于等于1的整数;所述M个组准备信号生成单元的输出端与所述同步信号生成单元相连接;所述同步信号生成单元根据所述第一待启动信号生成第一同步信号,所述第一同步信号用于指示所述第一节点组内的所述K个节点开始同步。
在本公开的第4方面所涉及的同步并行装置中,M个组准备信号生成单元与M个节点组一一对应,并且第一组准备信号生成单元与待同步的第一节点组中的K个节点相连接,第一组准备信号生成单元用于为待同步的第一节点组生成第一待启动信号,同步信号生成单元根据第一待启动信号生成第一同步信号,第一节点组内的K个节点开始同步,由此,同步信号生成单元能够同时对多个待启动信号进行处理,以及M个组准备信号生成单元和同步信号生成单元能够独立地进行信号的传递。
另外,在本公开的第4方面所涉及的同步并行装置中,可选地,所述处理节点为处理电路、处理芯片和服务器中的至少一种。由此,能够根据实际应用情况选择。
此外,本公开的第5方面提供了一种计算设备,其包括处理器和 存储器,所述处理器执行所述存储器存储的计算机指令,使得所述计算设备执行上述所描述的同步并行控制方法。
另外,本公开的第6方面提供了一种计算机可读存储介质,其存储有计算机程序,并且当所述计算机程序被处理器执行时实现上述所描述的同步并行控制方法的步骤。
此外,本公开的第7方面提供了一种计算机程序产品,其包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行上述所描述的同步并行控制方法。
根据本公开,能够提供一种能够降低整体延迟,提高信号传递效率,减小总线负担的基于多核架构的同步并行装置、方法及系统。
附图说明
现在将仅通过参考附图的例子进一步详细地解释本公开的实施例,其中:
图1是示出了本公开的实施方式所涉及的同步并行装置的用于处理数据的场景示意图。
图2是示出了本公开的实施方式所涉及的同步并行装置的功能框图示意图。
图3是示出了本公开的实施方式所涉及的同步并行装置的同步信号发生模块的功能框图。
图4是示出了本公开的实施方式所涉及的同步并行装置的一个组准备信号生成单元的功能框图。
图5是示出了本公开的实施方式所涉及的同步并行装置的同步信号生成单元的功能框图。
图6是示出了本公开的实施方式所涉及的同步并行装置的一个同步信号生成单元的功能框图。
图7是示出了本公开的实施方式所涉及的同步并行装置的寄存器的功能框图。
图8是示出了本公开的实施方式所涉及的同步并行装置的部分信号示意图。
图9是示出了本公开的实施方式所涉及的同步并行方法的流程图。
附图标号说明:
1…处理模块,10…处理组,101…处理器,2…同步信号发生模块,21…组准备信号生成单元,22…同步信号生成单元,221…筛选器,222…寄存器,223…同步信号生成器,3…控制模块。
具体实施方式
下面,结合附图和具体实施方式,进一步详细地说明本发明。在附图中,相同的部件或具有相同功能的部件采用相同的符号标记,省略对其的重复说明。
图1是示出了本公开的实施方式所涉及的同步并行装置S的用于处理数据应用场景示意图。
本公开的实施方式所涉及的基于多核架构的同步并行装置S可以包括处理模块1、同步信号发生模块2和控制模块(有时也称“控制单元”)3。如图1所示,控制模块3获取需要处理的计算任务(例如海量图片处理、大数据处理等)。接着,控制模块3可以将计算任务分配至各个处理节点(例如:服务器、处理器101或处理芯片等),并对同步信号发生模块2进行配置,当芯片上的处理节点接收到来自控制模块的计算任务后根据同步信号发生模块的同步信号,处理各自的任务,并在处理结束时生成各自的准备信号。此外,同步信号发生模块2可以接收各个处理节点所发出的准备信号,接收到准备信号后,根据同步信号发生模块2的配置情况,同步信号发生模块2生成同步信号,并发送至处理节点,使得处理节点在接收到同步信号后开始处理下一个计算任务。
在同步并行装置S中,处理模块1可以包括至少一个处理组(节点组)10,处理组10可以具有多个处理器(节点)101(有时也称“核”),处理器101可以具有空闲状态和非空闲状态,并且在处理器101进入空闲状态时生成准备信号(参见稍后描述的图2)。另外,同步信号发生模块2可以包括组准备信号生成单元21和同步信号生成单元22,其中,组准备信号生成单元21与处理组10中的各个处理器101可以彼此独立地硬件连接,并且可以基于由该处理组10中的各个处理器101 所生成的准备信号产生组准备信号(待启动信号),同步信号生成单元22与处理组10中的各个处理器101可以彼此独立地进行连接,并且能够接收由组准备信号生成单元21产生的组准备信号并根据组准备信号生成组同步信号(同步信号),其中,当处理组10中各个处理器101接收组同步信号时该处理组10中的各个处理器101进入非空闲状态,所述非空闲状态包括计算状态和传输状态,对于BSP同步来说,非空闲状态可以先进行计算再进行传输,或者先传输再计算,计算和传输在一个同步周期内完成。
空闲状态是指处理模块中的处理组中的处理节点在计算完所接收到的计算任务并将处理结果传输至下一处理核后,没有任务可以执行时的状态。在一些示例中,当处理器101处于空闲状态时可以发出低电平信号,而当处理器101处于非空闲时发出高电平电信(参见稍后描述的图8)。
在本公开的实施方式所涉及的同步并行装置S中,处理组10中的处理器101在进入空闲状态时彼此独立地向同步信号发生模块2中与该处理组10对应的组准备信号生成单元21发送准备信号,当组准备信号生成单元21接收到与之对应的处理组10中全部处理器101的准备信号后生成组准备信号,再由同步信号生成单元22接收组准备信号并生成对应的组同步信号,最后发送至该处理组10中的全部处理器101,使得处理器101进入非空闲状态,由此,同步信号生成单元22能够同时对多个组准备信号处理,以及组准备信号单元和同步信号生成单元22能够独立地与同步并行装置S进行信号的传递。
另外,如上所述,同步并行装置S还可以包括用于分配计算任务的控制模块3。控制模块3可以与处理模块1(稍后描述)和同步信号发生模块2(稍后描述)通信连接,并且控制模块3与寄存器222(稍后描述)通信连接,寄存器222的状态由控制模块3控制。由此,控制模块3用于对处理模块1分配计算任务,并且对寄存器222的状态进行控制。
在同步并行装置S中,控制模块可以将获得的计算任务根据预估的计算量算出需要多少处理组参与处理,再根据各处理组中的各个处理器的计算能力将计算任务相应地分为多份并发送至对应的处理器中。 由此,能够确保进行计算的各处理组中的各个处理器进入非空闲状态下的时间大致相同,从而能够有效地利用处理模块的计算资源,减少计算资源的浪费。
在一些示例中,控制模块3可以通过编程语言对寄存器222进行读写,从而改变寄存器222中与处理组对应的标记位(有时也称“寄存器位”)的有效性。由此,能够通过控制模块3对寄存器222进行控制(稍后描述)。作为例子,上述编程语言可以是C语言、汇编语言、Verilog语言或其他硬件语言。
如上所述,在传统的BSP运算中,各处理节点要通过Fabirc,例如:总线或者片上网络发送消息给控制器,具有较大的延时,当处理节点数量较多的时候,会显著加大总线或者片上网络的负载,且控制器需要对每一个节点发送的消息进行处理,因此,控制器与节点之间的信号传递效率较低,延时较大,增大了整体系统的误差。在传统的BSP运算中,为了降低这种误差,需要增加控制各节点的软件的复杂性,导致整体系统的效率进一步降低。而且,空闲状态的处理器101要将准备信号通过总线或片上网络发送至控制器,当该处理组中的各个处理器101同时将准备信号发送至控制器时,会增大总线或者片上网络的负载,带来较大延时,且控制器需要单独对各个处理器101的准备信号进行处理,进一步加大了延时。
在本公开的实施方式所涉及的同步并行装置S中,通过增加同步信号发生模块2,使得同步信号的生成发生在同步信号发生模块2内,同步信号发生模块2通过专有且独立的线路(如通信硬件线路)与各个处理器101直接连接,无需经由或占用数据传输的总线或者片上网络。由此,能够提高处理器101之间同步的信号传输效率,降低总线或片上网络的负载和芯片的功耗。此外,同步信号发生模块2通过设置有组准备信号生成单元21和同步信号生成单元22,使得同步信号发生模块2能够同时对所有的处理组10中的各个处理器101的准备信号进行处理并准确地向需要同步计算的处理组发送组同步信号(稍后描述)。
在一些示例中,处理节点可以为处理电路、处理器、处理芯片和服务器中的至少一种。由此,能够根据实际应用情况选择不同的处理 方式。在一些示例中,处理器101可以为中央处理单元(CPU)、数字信号处理器(DSP)、张量处理单元(TPU)、图像处理单元(GPU)、片上可编程逻辑阵列(FPGA)或专用定制芯片等。在一些示例中,处理芯片可以是集成有多核的处理器101。例如,处理芯片可以为基于RISC-V多核架构的芯片。具体而言,该芯片中的每一个处理器101都是一个扩展的RISC-V核,并且能够支持RISC-V的通用指令集。由此,能够基于通用RISC-V基础指令和扩展指令对芯片进行灵活的编程。另外,在一些示例中,服务器也可以是具有运算能力的本地服务器、云端服务器或者分布在不同的物理位置而通过网络连接的服务器群等。
在一些示例中,同步并行装置S可以包括:处理模块,其具有N个处理节点,N个处理节点被分为M个处理节点组,其中,N为大于1的整数,M小于等于N;以及同步信号发生模块,其包括同步信号生成单元和M个组准备信号生成单元;M个组准备信号生成单元与M个节点组一一对应;M个组准备信号生成单元中的第一组准备信号生成单元与待同步的第一节点组中的K个节点相连接;第一组准备信号生成单元用于为待同步的第一节点组生成第一待启动信号,K为大于等于1的整数;M个组准备信号生成单元的输出端与同步信号生成单元相连接;同步信号生成单元根据第一待启动信号生成第一同步信号,第一同步信号用于指示第一节点组内的K个节点开始同步。
在一些示例中,同步信号发生模块2可以以同步信号发生电路的形式实现。该同步信号产生电路可以包括:同步信号生成单元和M个组准备信号生成单元;M个组准备信号生成单元与M个节点组一一对应;M个组准备信号生成单元中的第一组准备信号生成单元与待同步的第一节点组中的K个节点相连接;第一组准备信号生成单元用于为待同步的第一节点组生成第一待启动信号,K为大于等于1的整数;M个组准备信号生成单元的输出端与同步信号生成单元相连接;同步信号生成单元根据第一待启动信号生成第一同步信号,第一同步信号用于指示第一节点组内的K个节点开始同步。
另外,在一些示例中,第一组准备信号生成单元用于为待同步的第一节点组生成第一待启动信号,包括:第一准备信号生成单元用于根据待同步的第一节点组中的全部K个节点的准备信号生成第一待启 动信号。
另外,在一些示例中,同步信号生成单元可以包括:M个屏蔽单元、M个待同步组指示单元和M个组同步信号生成单元;M个待同步组指示单元分别与M个屏蔽单元相连接;M个屏蔽单元中的每个屏蔽单元的输入端与M个组准备信号生成单元的输出端相连接;M个屏蔽单元的输出端分别与M个组同步信号生成单元中对应的组同步信号生成单元相连接;M个屏蔽单元中的第一屏蔽单元根据连接在其上的第一待同步组指示单元的指示,输出第一组的准同步信号;M个组同步信号产生单元中的第一组同步信号生成单元根据第一组的准同步信号生成第一组的同步信号。
另外,在一些示例中,待同步组指示单元可以包括寄存器;寄存器可以包括至少M个寄存器位,M个寄存器位与M个节点组一一对应,M个寄存器位中与待同步的第一节点组对应的寄存器位可以被配置为第一值,M个寄存器位中与M个节点组中除待同步的第一节点组之外的节点组对应的寄存器位被配置为第二值。
另外,在一些示例中,第一同步信号可以用于指示第一节点组内的K个节点开始同步,包括:第一同步信号用于指示第一节点组内的K个节点同时开始计算,或同时开始传输数据。
此外,在一些示例中,当处理模块1为芯片时,可选地,控制模块3可以集成于芯片内。在这种情况下,芯片可以包括多个核和控制模块3。具体而言,该芯片可以包括上述描述的同步信号产生电路、以及N个处理节点,N个处理节点被分为M个处理节点组,其中,N为大于1的整数,M小于等于N。在一些示例中,N个处理节点包括RISC-V核。
另外,在一些示例中,该芯片还可以包括N个第一通信硬件线路,N个第一通信硬件线路用于传输从N个处理节点发送至对应的组准备信号生成单元的准备信号。另外,在一些示例中,该芯片还可以包括N个第二通信硬件线路,N个第二通信硬件线路用于传输从同步信号生成单元发送至对应处理节点的同步信号。
另外,在一些示例中,该芯片还可以包括控制单元,控制单元用于改变寄存器的设置。在另一些示例中,控制单元可以用于控制处理 节点组中的各个处理节点的任务的执行和分配。
图2是示出了本公开的实施方式所涉及的同步并行装置S的功能框图示意图。如图2所示,如上所述,处理模块1可以包括至少一个处理组10,处理组10可以具有多个处理器101。另外,处理组10中的各个处理器101可以具有空闲状态和非空闲状态,并且在处理器101进入空闲状态时生成准备信号。当处理组10中各个处理器101接收组同步信号时该处理组10中的各个处理器101进入非空闲状态。
一些示例中,处理模块1可以是具有多核架构的芯片,并划分为多个处理组10,各个处理组10中的各个处理器101具有独立的核内内存。在这种情况下,处理器101可以利用自身的核内内存进行数据的临时存取。
可选的,各个处理组10还可以具有本地内存(有时也称“局部内存”)。在这种情况下,处理组10可以利用本地内存完成数据的存取,无需通过Fabric向外部存储器交互数据,因此能够明显地提高处理组的处理效率。
另外,对作为处理模块1的芯片上的多个处理器101进行分组,所划分的处理组的个数是指一个芯片上最大分组数,处理组的个数没有什么特别限制,处理组的个数可以是预先设计好的,例如在芯片设计过程中根据预期应用设计好。另外,在一些示例中,各个处理组中处理器101的个数可以相同。在另一些示例中,各个处理组中处理器101的个数也可以不同。
在一些示例中,每个处理组10可以具有至少一个处理器101。此外,每一个处理组10中的处理器101的个数可以是预先设定的。
在一些示例中,作为处理模块1的芯片可以为基于RISC-V多核架构的芯片。此时,该芯片中的各个处理器101可以为支持RISC-V的通用指令集的扩展RISC-V核。在这种情况下,能够基于通用RISC-V基础指令和扩展指令对芯片进行灵活的编程。
以下,结合图2的具体例子对本实施方式所涉及的处理模块1进行进一步详细描述。
在图2所示的同步并行装置S中,处理模块1包括m个处理组,即处理组10a1、处理组10a2、……、处理组10am。其中,第m个处 理组10am包括n个处理器,即处理器101am1、处理器101am2、……、处理器101amn。下面,以图2中的处理组10a1为例进行具体说明。
如图2所示,在处理组10a1中,各个处理器101接收计算任务(这里是处理器101a11、处理器101a12、……、处理器101a1n)。其中,处理器101a11至处理器101a1n彼此独立地连接至同步信号发生模块2,具体而言,处理器101a11至处理器101a1n可以分别经由独立的硬件线路分别连接至同步信号发生模块2。当同步信号发生模块2接收到处理组10a1中的全部处理器101(处理器101a11、处理器101a12、……、处理器101a1n)发送的准备信号,同步信号发生模块2基于所接收的各个准备信号组同步信号并发送至处理组10a1中的各个处理器(处理器101a11、处理器101a12、……、处理器101a1n)。在收到组同步信号后,处理组10a1中的各个处理器(处理器101a11、处理器101a12、……、处理器101a1n)进入非空闲状态开始处理所分配的计算任务。
在一些示例中,各个处理器(处理器101a11、处理器101a12、……、处理器101a1n)可以与处理组10a1的本地内存进行数据读写并暂时存储处理过程中的临时数据,在计算完毕后,再通过Fabric将计算结果发送至控制模块3。
在一些示例中,本地内存可以被设置为仅用于一个处理组10中的各个处理器101的存储器。由此,能够将处理器101的数据暂时地存放在存储器中,提高处理器101的计算速度。在另一些示例中,处理组10中的处理器101可以彼此独立地向本地内存读写数据。另外,在一些示例中,多个处理组也可以共用一块物理内存。
在一些示例中,本地内存或所使用的物理内存可以为闪速(flash)存储器、硬盘类型存储器、微型多媒体卡型存储器、卡式存储器(例如SD或XD存储器)、随机存取存储器(random access memory,RAM)、静态随机存取存储器(static RAM,SRAM)、只读存储器(read only memory,ROM)、电可擦除可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、可编程只读存储器(programmable ROM,PROM)、回滚保护存储块(replay protected memory block,RPMB)、磁存储器、磁盘或光盘。出于从运算速率的角度考虑,本地内存或所使用的物理内存优选采用随机存取存储器或 静态随机存取存储器。
在另一些示例中,对于处理节点为服务器的情况而言,本地内存也可以是网络上的网络存储设备,在这种情况下处理节点可以对在因特网上的存储器执行存取等操作。
在一些示例中,处理器101完成计算和传输进入空闲状态时可以生成并发送准备信号以表示处理器101正处于空闲状态。在另一些示例中,处理器101可以在处于空闲状态时每隔预定时间间隔生成并发送准备信号。另外,在一些示例中,当处理器101处于空闲状态时可以持续地生成并发送准备信号。
在一些示例中,处理组10中的各个处理器101在接收到组同步信号后可以由空闲状态进入非空闲状态,并开始对获得的计算任务进行处理。
在一些示例中,计算状态之后还包括传输状态。在这种情况下,处理器101在计算完成后可以进入传输状态进行数据的接收与发送,由此,处理器101能够将计算得出的数据进行发送并接收计算任务。在另一些示例中,当处理器101完成计算和传输任务后便进入空闲状态,处于空闲状态的处理器101能够随时接收新的计算任务。在另一些示例中,传输状态也可以设置在计算状态之前。
在一些示例中,各个处理组10中的全部处理器101在接收到组同步信号由空闲状态进入非空闲状态直至下一个组同步信号到来的这段时间有时也被称为“超步”。
在一些示例中,处理器101可以包括一个或多个不同类型的处理器。例如,一个处理器101可以是中央处理单元(CPU)、张量处理单元(TPU)或图形处理单元(GPU)等,也可以是组合了一个中央处理单元(CPU)和一个图形处理器(GPU)或组合了一个张量处理单元(TPU)和一个图形处理单元(GPU)的处理器。另外,处理器101也可以是定制化的芯片,例如是支持RISC-V的通用指令集和扩展指令集的芯片。
图3是示出了本公开的实施方式所涉及的同步并行装置S的同步信号发生模块2的功能框图。如图3所示,在本实施方式中,同步信号发生模块2可以包括组准备信号生成单元21和同步信号生成单元22。 在一些示例中,同步信号发生模块2可以与处理模块1中的所有(全部)处理器101彼此独立地进行连接。换言之,同步信号发生模块2可以通过专有且独立的硬件线路与处理模块1中的全部处理器101进行连接。在这种情况下,各个处理器101与同步信号发生模块2之间的信号传递互不干扰,而且各个处理器101与同步信号发生模块2之间不经过Fabric,由此能够提高整体的运行效率。
在一些示例中,同步信号发生模块2可以是用于产生组同步信号的逻辑电路。例如,在一个例子中,同步信号发生模块2可以由现场可编程逻辑阵列(FPGA)来实现。在一些示例中,组同步信号经由输入同步信号发生模块2的准备信号经过“或(or)”,“与(and)”或者“取反(not)”及其组合等逻辑运算后产生。在一些示例中,组同步信号可以为脉冲信号或电平信号。
以下,结合图3和图4的具体例子对本实施方式所涉及的同步信号发生模块2进行进一步详细描述。
在图3所示的同步信号发生模块2中,同步信号发生模块2包括组准备信号生成单元和同步信号生成单元。首先,各个组准备信号生成单元(组准备信号生成单元21a1、组准备信号生成单元21a2、……、组准备信号生成单元21am)分别用于接收对应处理组(处理组10a1至10am)中的全部处理器101所发出的准备信号,并生成组准备信号并发送至同步信号生成单元22,再由同步信号生成单元22生成组同步信号发送至对应的处理组中的各个处理器。下面,以组准备信号生成单元21a1为例进行说明。
在同步信号发生模块2中,组准备信号生成单元21a1接收来自于全部处理器(处理器101a11、处理器101a12、……、处理器101a1n)的各个准备信号,基于所接收的全部准备信号生成组准备信号,并将组准备信号发送到同步信号生成单元22。同步信号生成单元22基于所接收的各个组准备信号,生成组同步信号并将组同步信号发送至对应的处理组中的各个处理器。如上所述,当同步信号发生模块2将组同步信号发送到各个处理器(例如处理器101a11、处理器101a12、……、处理器101a1n)时,使各个处理器101即处理器101a11、处理器101a12、……、处理器101a1n同步处理所分配的计算任务。
图4是示出了本公开的实施方式所涉及的同步并行装置S的组准备信号生成单元21的功能框图。如图4所示,在本实施方式中,组准备信号生成单元21与处理组10中的各个处理器101(例如处理器101a11、处理器101a12、……、处理器101a1n)可以彼此独立地进行连接,具体而言,处理器101a11至处理器101a1n可以分别经由独立的硬件线路分别连接至组准备信号生成单元21。接着,组准备信号生成单元21可以基于由该处理组10中的各个处理器101所生成的准备信号产生组准备信号。
在一些示例中,组准备信号生成单元21可以在接收到处理组10中的全部处理器101(例如处理器101a11、处理器101a12、……、处理器101a1n)的准备信号后生成组准备信号。在这种情况下,能够根据组准备信号得知处理组10中的处理器101是否全部进入空闲状态,由此,能够确保处理组10中的全部处理器101能够同步并行运算。
在一些示例中,组准备信号生成单元21可以与处理组10相对应,相对应的处理组10中的各个处理器101彼此独立地与组准备信号生成单元21进行连接。这里,组准备信号生成单元21可以与处理组10相对应是指,一个组准备信号生成单元21对应一个处理组10,且组准备信号生成单元21与处理组10的数量相等。在这种情况下,组准备信号生成单元21能够同时接收并处理对应地处理组10中的各个处理器101的准备信号,由此,同步信号发生模块2能够同时接收并处理来自处理模块1中所有处理组10的信号。
如图4所示,以处理组10a1和组准备信号生成单元21a1为例,处理组10a1中的各个处理器(处理器101a11、处理器101a12、……、处理器101a1n)分别生成准备信号a11、准备信号a12、……、准备信号a1n,各个信号通过独立的硬件线路彼此独立地发送至与处理组10a1对应的组准备信号生成单元21a1,组准备信号生成单元21a1在收到处理组10a1中的全部处理器(处理器101a11、处理器101a12、……、处理器101a1n)的准备信号也即收到准备信号a11、准备信号a12、……、准备信号a1n后,生成组准备信号a1,以表示处理组10a1中的全部处理器101都处于空闲状态。此时,处理组10a1中的全部处理器101处于等待同步的状态。
在一些示例中,组准备信号生成单元21可以与处理组10中的各个处理器101通过彼此独立的专用硬件线路(第一通信硬件线路)直接连接,并且同步信号生成单元22与处理组10中的各个处理器101通过彼此独立的专用硬件线路直接连接。在这种情况下,处理组10中的各个处理器101能够通过专用硬件线路发送和接收信号,由此,能够降低信号之间的干扰,提高处理器101发送和接收信号的效率。
在一些示例中,组准备信号可以经由输入同步信号发生模块2的准备信号经过“或(or)”,“与(and)”或者“取反(not)”及其组合等逻辑运算后产生。在一些示例中,组准备信号可以为脉冲信号或电平信号。
以下,结合图5和图6的具体例子对本实施方式的同步信号生成单元22进行进一步详细描述。
图5是示出了本公开的实施方式所涉及的同步并行装置S的同步信号生成单元22的功能框图。如图5所示,在本实施方式中,同步信号生成单元22与处理组10中的各个处理器101可以彼此独立地进行连接,并且能够接收由组准备信号生成单元21产生的组准备信号并根据组准备信号生成组同步信号。
在一些示例中,同步信号生成单元22可以与组准备信号生成单元21连接。由此,同步信号生成单元22能够接收来自全部组准备信号生成单元21的信号。
在一些示例中,组准备信号生成单元21可以彼此独立的连接至同步信号生成单元22。在另一些示例中,组准备信号生成单元21可以经由内部Fabirc连接至同步信号生成单元22。
如图5所示,组准备信号a1、组准备信号a2、……、组准备信号am被发送至筛选器(屏蔽单元)221a1至221am。这里,筛选器有时也称为屏蔽器。各个筛选器221(筛选器221a1、筛选器221a2、……、筛选器221am)均会收到组准备信号a1、组准备信号a2、……、组准备信号am等各个组准备信号。另外,寄存器(待同步组指示单元)222a1、寄存器222a2、……、寄存器222am分别控制对应的筛选器(筛选器221a1、筛选器221a2、……、筛选器221am),并根据控制模块3所分配的计算任务的有效位确定筛选器中哪些组准备信号有效。各个筛选 器(筛选器221a1、筛选器221a2、……、筛选器221am)在接收到有效的组准备信号后分别生成启动信号(准同步信号),并发送至对应的同步信号生成器(组同步信号生成单元)(同步信号生成器223a1、同步信号生成器223a2、……、同步信号生成器223am),再由各个同步信号生成器(同步信号生成器223a1、同步信号生成器223a2、……、同步信号生成器223am)分别将组同步信号发送至各个处理器。
图6是示出了本公开的实施方式所涉及的同步并行装置S的一个同步信号生成单元22的功能框图。如图6所示,在一些示例中,同步信号生成单元22还可以包括用于接收并筛选组准备信号的筛选器221和用于控制筛选器221接收的组准备信号是否有效的寄存器222,筛选器221根据组准备信号和寄存器222的状态判断是否生成启动信号。在这种情况下,寄存器222能够通过控制筛选器221从而对接收的组准备信号进行筛选,由此,同步信号生成单元22能够通过筛选器221进行判断是否生成组同步信号。
在一些示例中,如上所述,组准备信号生成单元21所生成的组准备信号可以发送至各个筛选器221。在这种情况下,筛选器221可以对收到的组准备信号进行筛选,当接收到的信号为对应的组准备信号生成单元21所发出的信号时生成启动信号,由此,筛选器221可以彼此独立地筛选组准备信号,并独立地生成启动信号。
在另一些示例中,筛选器221可以与多个组准备信号生成单元21相对应,当筛选器221接收到对应的多个组准备信号时生成启动信号。
另外,在一些示例中,相对应的组准备信号生成单元21彼此独立地与组同步信号生成单元223进行连接。
在一些示例中,启动信号经由输入筛选器221的组准备信号经过“或(or)”,“与(and)”或者“取反(not)”及其组合等逻辑运算后产生。在一些示例中,启动信号可以为脉冲信号或电平信号。
如图6所示,以筛选器221a1和222a1为例,各个组准备信号a1、组准备信号a1、……、组准备信号am被发送至筛选器221a1。筛选器221a1根据寄存器222a1的设置(例如由控制模块3根据计算任务决定),在接收到有效的组准备信号a1后分别生成启动信号,并发送至对应的同步信号生成器(同步信号生成器223a1、同步信号生成器223a2、……、 同步信号生成器223am),再由同步信号生成器223a1将组同步信号发送至这一组内的各个处理器。
图7是示出了本公开的实施方式所涉及的同步并行装置S的寄存器222的功能框图。图8是示出了本公开的实施方式所涉及的同步并行装置S的部分信号示意图。
如图7所示,寄存器222可以具有至少与组准备信号生成单元21相对应的标记位。当与组准备信号生成单元21相对应的标记位被设置为有效时,筛选器221接收该组准备信号生成单元21所生成的组准备信号。由此,能够根据需要选择筛选器221能够接收的组准备信号。
在一些示例中,寄存器222可以具有与组准备信号生成单元21相对应的标记位。例如,寄存器222具有m位的标记位(标记位t1、标记位t2、……、标记位tm),标记位t1对应组准备信号a1,标记位t2对应组准备信号a2,……,标记位tm对应组准备信号am。当一个标记位设置为有效时,筛选器221通过读取有效的标记位将其它无效的组准备信号屏蔽,从而能够在接收到有效的组准备信号时生成启动信号。
此外,当多个寄存器222的多个标记位设置为有效时,即意味着有多个组同时进行同步并行计算。例如,寄存器222a1的标记位t1和标记位t2为有效标记位,此时,寄存器222a1所控制的筛选器221a1若要生成启动信号,则需要接收到组准备信号生成单元21a1的组准备信号a1和组准备信号生成单元21a2的组准备信号a2。在这种情况下,处理组10a1中的各个处理器在进入空闲状态后,需要等待处理组10a2中的各个处理器全部进入空闲状态后才可以生成启动信号进而生成组同步信号,由此,能够通过对寄存器222的标记位的设置来控制筛选器221接收的组准备信号是否有效,进而控制组同步信号生成的时间。
在一些示例中,当寄存器222的多个标记位被设置为有效时,筛选器221在接收到与多个标记位对应的多个组准备信号生成单元21的组准备信号后生成启动信号。在这种情况下,能够使得筛选器221成为组同步信号生成中的障碍机制,由此,能够通过控制寄存器222的标记位来控制筛选器221,进而对组同步信号的生成时间进行控制。
如上所述,在寄存器222a1的标记位t1和标记位t2为有效标记位 的同时,寄存器222a2的标记位t1和标记位t2同样被设置为有效标记位,此时,筛选器221a1和筛选器221a2均需要接收到来自于组准备信号生成单元21a1的组准备信号a1和组准备信号生成单元21a2的组准备信号a2才可以生成启动信号。在这种情况下,通过对寄存器222的标记位的设置能够实现多个筛选器221同时生成启动信号,由此,能够使得多个同步信号生成单元22同时生成组同步信号,进而能够使得多个处理组10同时进入非空闲状态。
在一些示例中,寄存器222可以选自多功能寄存器、指针寄存器、变址寄存器、专用寄存器、段寄存器、控制寄存器、调试寄存器、任务寄存器、浮点寄存器、多媒体寄存器、单指令流多数据流寄存器中的一种或多种。优选地,寄存器222可以为控制寄存器。
在一些示例中,启动信号经由输入筛选器221的组准备信号和寄存器222标记位信号经过“或(or)”,“与(and)”或者“取反(not)”及其组合等逻辑运算后产生。在一些示例中,标记位信号可以为脉冲信号或电平信号。
以下,结合图7和图8对本实施方式所涉及的筛选器进行进一步详细的描述。
如图7所示,以筛选器221a1为例,组准备信号a1、组准备信号a2、……、组准备信号am发送至筛选器221a1,当筛选器222a1中的两个标记位(假定与组准备信号生成单元21a1和21a2对应)被设置为有效时,筛选器221a1屏蔽组准备信号a1、a2以外的其他准备信号,并在接收到组准备信号a1和组准备信号a2后,生成启动信号a1,并发送至对应的同步信号生成器223a1,再由同步信号生成器223a1发出组同步信号a1至各个处理器101。
图8反映了图7中所示的时序电平信号输入输出的变化情况。以图8所示的例子为例,筛选器221a1接收组准备信号a1、组准备信号a2、……、组准备信号am,以及与上述组准备信号(组准备信号a1、组准备信号a2、……、组准备信号am)分别对应的寄存器222a1中的标记位(标记位t1、标记位t2、……、标记位tm)。筛选器221a1根据寄存器222a1的标记位的是否有效来生成启动信号a1。例如,此时寄存器222a1的中标记位t1、标记位t2有效,而其他标记位t3、标记位 t4、……、标记位tm均无效,此时筛选器221a1仅能够接收到组准备信号a1和组准备信号a2并屏蔽其他组准备信号。筛选器221a1基于组准备信号a1和组准备信号a2生成启动信号a1。
在一些示例中,由同步信号生成单元22所生成的组同步信号可以被同时发送至处理组10中的各个处理器101。由此,能够确保处理组10中的各个处理器101同时开始运算。
在一些示例中,同步信号生成单元22还可以包括与筛选器221连接并接收启动信号的同步信号生成器223,同步信号生成器223根据启动信号生成组同步信号,并发送至处理组10中的各个处理器101,同步信号生成器223与处理组10中的各个处理器101彼此独立地进行连接,例如通过第二通信硬件线路进行连接。在这种情况下,同步信号生成器223仅与对应的处理组10中的各个处理器101分别独立地进行连接,由此,能够确保生成的组同步信号准确地发送至对应的处理组10中的各个处理器101,且提高了组同步信号的传输速率。
在一些示例中,每个筛选器221均连接有一个同步信号生成器223。进一步地,由于筛选器221与处理组10对应,因此与筛选器221对应的同步信号生成器223与处理组10对应。
在一些示例中,同步信号生成器223可以与对应的处理组10中的各个处理器101彼此独立地进行连接。在这种情况下,组同步信号能够直接发送至与同步信号生成器223对应的处理组10中的各个处理器101,由此,能够进一步降低信号传输的延时。
在一些示例中,组同步信号经由输入同步信号生成器223的启动信号经过“或(or)”,“与(and)”或者“取反(not)”及其组合等逻辑运算后产生。在一些示例中,组同步信号可以为脉冲信号或电平信号。
在一些示例中,寄存器222的标记位可以由控制模块3控制。由此,能够通过控制模块3控制寄存器222的标记位从而控制寄存器222的状态。
在一些示例中,控制模块3可以通过Fabirc与处理模块1和同步信号发生模块2进行通信连接。在这种情况下,控制模块3能够通过Fabirc与处理模块1和同步信号发生模块2进行数据交换。在一些示例 中,处理模块1中的处理器101在计算完成后可以通过Fabirc发送计算结果。
在另一些示例中,控制模块3还可以包括无线通信单元。在这种情况下,控制模块3可以通过无线信号发送和接收数据。
在一些示例中,控制模块3可以为芯片的顶层微控制单元(MCU)、位于芯片外的主控电路(Host)、其他芯片或者其他程序应用(Server)。
以下,结合图9分析基于上述多核架构的同步并行装置S的同步并行控制方法。图9是示出了本公开的实施方式所涉及的同步并行方法的流程图。
如图9所示,基于上述多核架构的同步并行装置S的同步并行控制方法包括以下步骤:当节点组中的第一节点进入空闲状态时,第一节点向第一节点所在的组准备信号生成单元21发送准备信号(步骤S100);响应于节点组中所有节点均发送了准备信号,组准备信号生成单元21生成待启动信号(步骤S200);同步信号生成单元22根据待启动信号生成同步信号(步骤S300);节点组中的所有节点响应于接收到的同步信号,开始同步(步骤S400)。
在本公开所涉及的同步并行控制方法中,处理组(节点组)10中的处理器(节点)101在进入空闲状态时彼此独立地向同步信号发生模块2中与该处理组10对应的组准备信号生成单元21发送准备信号,当组准备信号生成单元21接收到与之对应的处理组10中全部处理器101的准备信号后生成组准备信号,再由同步信号生成单元22接收组准备信号并生成对应的组同步信号,最后发送至该处理组10中的全部处理器101,使得处理器101开始同步,由此,同步信号生成单元22能够同时对多个组准备信号处理,以及组准备信号单元和同步信号生成单元22能够独立地进行信号的传递。
在步骤S100中,当处理组10中的各个处理器101进入空闲状态时,使该处理组10中的各个处理器101发出准备信号。这里,处理组10和处理组10中的各个处理器101的设置方式可以具体参见上述处理模块1的描述,这里不再赘述。
在一些示例中,当处理器101开始同步时该处理器101进行计算或传输,并且在完成计算后进入空闲状态。在这种情况下,完成计算 和传输后的处理器101可以等待其他计算或传输中的处理器101,由此,能够使得处理器101在下一个组同步信号到来时同时开始同步。
在步骤S200中,通过组准备信号生成单元21接收处理组10中各个处理器101发出的准备信号,并且当组准备信号生成单元21接收到该处理组10中全部处理器101所发出的准备信号时,使组准备信号生成单元21生成组准备信号。这里,组准备信号生成单元21的设置方式具体参见上述组准备信号生成单元21的描述,这里不再赘述。
在步骤S300和步骤S400中,通过同步信号生成单元22接收组准备信号,并且根据组准备信号生成组同步信号并发送给处理组10中的各个处理器101,当该处理组10中的各个处理器101接收到组同步信号时进入同步。这里,同步信号生成单元22的设置方式具体参见上述同步信号生成单元22的描述,这里不再赘述。
在一些示例中,将计算任务分配给处理组10中的各个处理器101,并且使用该处理组10中的各个处理器101进行运算。由此,处理组10中的各个处理器101能够在本地完成运算。
在一些示例中,当所述控制模块3控制所述寄存器以使该寄存器的标记位有效时,具有该寄存器的筛选器接收与该标记位对应的组准备信号生成单元所生成的组准备信号。由此,能够通过控制寄存器进而控制筛选器接收的组准备信号的有效性。
另外,在一些示例中,本公开还提供一种计算设备,该计算设备包括处理器和存储器,该处理器执行所述存储器存储的计算机指令,使得所述计算设备执行上述本公开所描述的并行控制方法。
另外,在一些示例中,本公开还提供一种计算机可读存储介质,其存储有计算机程序,并且当该计算机程序被处理器执行时实现上述本公开所描述的同步并行控制方法的步骤。
此外,在一些示例中,本公开还提供一种计算机程序产品,其包括计算机指令,当该计算机指令被计算设备执行时,所述计算设备可以执行上述本公开所描述的同步并行控制方法。
在上述示例中,对各个示例的描述都各有侧重,某个示例中没有详述的部分,可以参见其它示例的相关描述。
需要说明的是,对于前述的各方法示例,为了简单描述,有时将 其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤有可能采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的示例均属于优选的例子,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个示例中,应该理解到,所公开的装置,可通过其它的方式来实现。例如,以上所描述的装置示例仅仅是示意性的,例如上述单元的划分,也仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本示例方案的目的。
另外,在本申请各示例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。

Claims (17)

  1. 一种同步信号产生电路,其特征在于:所述同步信号产生电路用于为M个节点组产生同步信号,所述节点组中包括至少一个节点,所述M为大于等于1的整数;
    所述同步信号产生电路包括:同步信号生成单元和M个组准备信号生成单元;
    所述M个组准备信号生成单元与所述M个节点组一一对应;
    所述M个组准备信号生成单元中的第一组准备信号生成单元与待同步的第一节点组中的K个节点相连接;所述第一组准备信号生成单元用于为所述待同步的第一节点组生成第一待启动信号,所述K为大于等于1的整数;
    所述M个组准备信号生成单元的输出端与所述同步信号生成单元相连接;
    所述同步信号生成单元根据所述第一待启动信号生成第一同步信号,所述第一同步信号用于指示所述第一节点组内的所述K个节点开始同步。
  2. 根据权利要求1所述的同步信号产生电路,其特征在于:所述第一组准备信号生成单元用于为所述待同步的第一节点组生成第一待启动信号,包括:
    所述第一准备信号生成单元用于根据所述待同步的第一节点组中的全部K个节点的准备信号生成所述第一待启动信号。
  3. 根据权利要求1或2所述的同步信号产生电路,其特征在于:所述同步信号生成单元包括:M个屏蔽单元、M个待同步组指示单元和M个组同步信号生成单元;
    所述M个待同步组指示单元分别与所述M个屏蔽单元相连接;
    所述M个屏蔽单元中的每个屏蔽单元的输入端与所述M个组准备信号生成单元的输出端相连接;
    所述M个屏蔽单元的输出端分别与所述M个组同步信号生成单元中对应的组同步信号生成单元相连接;
    所述M个屏蔽单元中的第一屏蔽单元根据连接在其上的第一待同 步组指示单元的指示,输出第一组的准同步信号;
    所述M个组同步信号产生单元中的第一组同步信号生成单元根据所述第一组的准同步信号生成所述第一组的同步信号。
  4. 根据权利要求3所述的同步信号产生电路,其特征在于:所述待同步组指示单元包括寄存器;
    所述寄存器包括至少M个寄存器位,所述M个寄存器位与所述M个节点组一一对应,所述M个寄存器位中与所述待同步的第一节点组对应的寄存器位被配置为第一值,所述M个寄存器位中与所述M个节点组中除所述待同步的第一节点组之外的节点组对应的寄存器位被配置为第二值。
  5. 根据权利要求1-4任一项所述的同步信号产生电路,其特征在于:所述第一同步信号用于指示所述第一节点组内的所述K个节点开始同步,包括:所述第一同步信号用于指示所述第一节点组内的所述K个节点同时开始计算,或同时开始传输数据。
  6. 一种芯片,包括如权利要求1-5所述的同步信号产生电路,以及N个处理节点,所述N个处理节点被分为M个处理节点组,其中,所述N为大于1的整数,M小于等于N。
  7. 根据权利要求6所述的芯片,其特征在于:还包括N个第一通信硬件线路,所述N个第一通信硬件线路用于传输从所述N个处理节点发送至对应的组准备信号生成单元的准备信号。
  8. 根据权利要求7所述的芯片,其特征在于:还包括N个第二通信硬件线路,所述N个第二通信硬件线路用于传输从同步信号生成单元发送至对应处理节点的同步信号。
  9. 根据权利要求6-8任一项所述的芯片,其特征在于:还包括控制单元,所述控制单元用于改变寄存器的设置。
  10. 根据权利要求6-9任一项所述的芯片,其特征在于:还包括控制单元,所述控制单元用于控制处理节点组中的各个处理节点的任务的执行和分配。
  11. 根据权利要求6-10任一项所述的芯片,其特征在于:所述N个处理节点包括RISC-V核。
  12. 一种基于多核架构的同步并行控制方法,基于权利要求1-5 任一项所述的同步信号产生电路的同步并行控制方法,其特征在于,包括:
    当所述节点组中的第一节点进入空闲状态时,所述第一节点向所述第一节点所在的组准备信号生成单元发送准备信号;
    响应于所述节点组中所有节点均发送了准备信号,所述组准备信号生成单元生成待启动信号;
    同步信号生成单元根据所述待启动信号生成同步信号;
    所述节点组中的所有节点响应于接收到的所述同步信号,开始同步。
  13. 一种基于多核架构的同步并行装置,其特征在于,包括:
    处理模块,其具有N个处理节点,所述N个处理节点被分为M个处理节点组,其中,所述N为大于1的整数,M小于等于N;以及
    同步信号发生模块,其包括同步信号生成单元和M个组准备信号生成单元;所述M个组准备信号生成单元与所述M个节点组一一对应;所述M个组准备信号生成单元中的第一组准备信号生成单元与待同步的第一节点组中的K个节点相连接;所述第一组准备信号生成单元用于为所述待同步的第一节点组生成第一待启动信号,所述K为大于等于1的整数;所述M个组准备信号生成单元的输出端与所述同步信号生成单元相连接;所述同步信号生成单元根据所述第一待启动信号生成第一同步信号,所述第一同步信号用于指示所述第一节点组内的所述K个节点开始同步。
  14. 根据权利要求13所述的同步并行装置,其特征在于:
    所述处理节点为处理电路、处理器、处理芯片和服务器中的至少一种。
  15. 一种计算设备,其特征在于:
    包括处理器和存储器,所述处理器执行所述存储器存储的计算机指令,使得所述计算设备执行权利要求12所述的同步并行控制方法。
  16. 一种计算机可读存储介质,其特征在于:
    存储有计算机程序,并且当所述计算机程序被处理器执行时实现包括权利要求12所述的同步并行控制方法的步骤。
  17. 一种计算机程序产品,其特征在于:
    包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行权利要求12所述的同步并行控制方法。
PCT/CN2020/096390 2019-08-23 2020-06-16 多核架构的同步信号产生电路、芯片和同步方法及装置 WO2021036421A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20856466.6A EP3989038A4 (en) 2019-08-23 2020-06-16 MULTICORE SYNCHRONOUS SIGNAL GENERATION CIRCUIT, CHIP, AND SYNCHRONIZATION METHOD AND DEVICE
US17/587,770 US12072730B2 (en) 2019-08-23 2022-01-28 Synchronization signal generating circuit, chip and synchronization method and device, based on multi-core architecture

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910785053.1 2019-08-23
CN201910785053.1A CN112416053B (zh) 2019-08-23 2019-08-23 多核架构的同步信号产生电路、芯片和同步方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/587,770 Continuation US12072730B2 (en) 2019-08-23 2022-01-28 Synchronization signal generating circuit, chip and synchronization method and device, based on multi-core architecture

Publications (1)

Publication Number Publication Date
WO2021036421A1 true WO2021036421A1 (zh) 2021-03-04

Family

ID=74684093

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/096390 WO2021036421A1 (zh) 2019-08-23 2020-06-16 多核架构的同步信号产生电路、芯片和同步方法及装置

Country Status (4)

Country Link
US (1) US12072730B2 (zh)
EP (1) EP3989038A4 (zh)
CN (1) CN112416053B (zh)
WO (1) WO2021036421A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023212200A1 (en) * 2022-04-29 2023-11-02 Tesla, Inc. Enhanced global flags for synchronizing coprocessors in processing system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100177828A1 (en) * 2009-01-12 2010-07-15 Maxim Integrated Products, Inc. Parallel, pipelined, integrated-circuit implementation of a computational engine
CN103377032A (zh) * 2012-04-11 2013-10-30 浙江大学 一种基于异构多核芯片的细粒度科学计算并行处理装置
US20160364835A1 (en) * 2015-06-10 2016-12-15 Mobileye Vision Technologies Ltd. Image processor and methods for processing an image
CN106547237A (zh) * 2016-10-24 2017-03-29 华中光电技术研究所(中国船舶重工集团公司第七七研究所) 一种基于异构多核架构的导航解算装置
WO2019126921A1 (en) * 2017-12-25 2019-07-04 Intel Corporation Pre-memory initialization multithread parallel computing platform

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5083265A (en) * 1990-04-17 1992-01-21 President And Fellows Of Harvard College Bulk-synchronous parallel computer
US5434995A (en) * 1993-12-10 1995-07-18 Cray Research, Inc. Barrier synchronization for distributed memory massively parallel processing systems
JP3532037B2 (ja) * 1996-07-31 2004-05-31 富士通株式会社 並列計算機
US6216174B1 (en) * 1998-09-29 2001-04-10 Silicon Graphics, Inc. System and method for fast barrier synchronization
JP5549694B2 (ja) * 2012-02-23 2014-07-16 日本電気株式会社 超並列計算機、同期方法、同期プログラム
FR3021429B1 (fr) * 2014-05-23 2018-05-18 Kalray Barriere de synchronisation materielle entre elements de traitement
US9760410B2 (en) * 2014-12-12 2017-09-12 Intel Corporation Technologies for fast synchronization barriers for many-core processing
GB2569269B (en) * 2017-10-20 2020-07-15 Graphcore Ltd Synchronization in a multi-tile processing arrangement
US20220070801A1 (en) * 2019-01-16 2022-03-03 Nec Corporation Monitoring system and synchronization method
JP2021043737A (ja) * 2019-09-11 2021-03-18 富士通株式会社 バリア同期システム、バリア同期方法及び並列情報処理装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100177828A1 (en) * 2009-01-12 2010-07-15 Maxim Integrated Products, Inc. Parallel, pipelined, integrated-circuit implementation of a computational engine
CN103377032A (zh) * 2012-04-11 2013-10-30 浙江大学 一种基于异构多核芯片的细粒度科学计算并行处理装置
US20160364835A1 (en) * 2015-06-10 2016-12-15 Mobileye Vision Technologies Ltd. Image processor and methods for processing an image
CN106547237A (zh) * 2016-10-24 2017-03-29 华中光电技术研究所(中国船舶重工集团公司第七七研究所) 一种基于异构多核架构的导航解算装置
WO2019126921A1 (en) * 2017-12-25 2019-07-04 Intel Corporation Pre-memory initialization multithread parallel computing platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3989038A4 *

Also Published As

Publication number Publication date
EP3989038A4 (en) 2022-08-24
EP3989038A1 (en) 2022-04-27
US12072730B2 (en) 2024-08-27
US20220147097A1 (en) 2022-05-12
CN112416053A (zh) 2021-02-26
CN112416053B (zh) 2023-11-17

Similar Documents

Publication Publication Date Title
US9971635B2 (en) Method and apparatus for a hierarchical synchronization barrier in a multi-node system
US7984448B2 (en) Mechanism to support generic collective communication across a variety of programming models
US20220070087A1 (en) Sync Network
US8661440B2 (en) Method and apparatus for performing related tasks on multi-core processor
US8339869B2 (en) Semiconductor device and data processor
JP2006518058A (ja) 改善された計算アーキテクチャ用パイプライン加速器、関連システム、並びに、方法
JPWO2008155806A1 (ja) バリア同期方法、装置、及びマルチコアプロセッサ
CN116541227B (zh) 故障诊断方法、装置、存储介质、电子装置及bmc芯片
US11327813B2 (en) Sync group selection
WO2021036421A1 (zh) 多核架构的同步信号产生电路、芯片和同步方法及装置
JP2008059192A (ja) ハード・ソフト協調検証用シミュレータ
US20080126472A1 (en) Computer communication
US6678749B2 (en) System and method for efficiently performing data transfer operations
US20240231956A9 (en) Apparatus and method for synchronizing participants of a simulation
CN112860622B (zh) 一种处理系统以及一种片上系统
JPH11306149A (ja) 並列演算処理装置およびその方法
US12073262B2 (en) Barrier synchronization between host and accelerator over network
CN113568665B (zh) 一种数据处理装置
CN117112466B (zh) 一种数据处理方法、装置、设备、存储介质及分布式集群
JPH1185673A (ja) 共有バスの制御方法とその装置
Essig et al. On-demand instantiation of co-processors on dynamically reconfigurable FPGAs
CN113721703A (zh) 一种多路cpu系统中时钟同步控制装置、系统及控制方法
JP2003099397A (ja) データ処理システム
CN114003116A (zh) 复位电路、系统、方法、电子设备及存储介质
CN117581218A (zh) 支持固定功能内核的嵌入式处理器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20856466

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020856466

Country of ref document: EP

Effective date: 20220124

NENP Non-entry into the national phase

Ref country code: DE