WO2021243490A1 - 一种处理器、处理方法及相关设备 - Google Patents

一种处理器、处理方法及相关设备 Download PDF

Info

Publication number
WO2021243490A1
WO2021243490A1 PCT/CN2020/093627 CN2020093627W WO2021243490A1 WO 2021243490 A1 WO2021243490 A1 WO 2021243490A1 CN 2020093627 W CN2020093627 W CN 2020093627W WO 2021243490 A1 WO2021243490 A1 WO 2021243490A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
instruction
calculation
graph
graph calculation
Prior art date
Application number
PCT/CN2020/093627
Other languages
English (en)
French (fr)
Chinese (zh)
Inventor
周昔平
周若愚
朱凡
孙文博
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2022573476A priority Critical patent/JP7495030B2/ja
Priority to EP20939004.6A priority patent/EP4152150B1/de
Priority to BR112022024535A priority patent/BR112022024535A2/pt
Priority to PCT/CN2020/093627 priority patent/WO2021243490A1/zh
Priority to CN202080101335.6A priority patent/CN115668142A/zh
Publication of WO2021243490A1 publication Critical patent/WO2021243490A1/zh
Priority to US18/070,781 priority patent/US20230093393A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/30087Synchronisation or serialisation instructions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the technical field of directed graph computing, in particular to a processor, a processing method and related equipment.
  • the superscalar (Superscalar) processor Central Processing Unit, CPU) architecture refers to a type of parallel operation that implements instruction-level parallelism in a processor core, which improves performance by issuing multiple instructions per beat, and through hardware
  • the logic unit resolves the dependencies between instructions after parallelization. This technology can achieve higher CPU throughput under the same CPU frequency.
  • the architecture combines the graph computing accelerator and the superscalar processor into a hybrid architecture, and a set of cache is shared between the graph computing accelerator and the superscalar processor, and a communication bus is established to connect the superscalar processor
  • the register above (liveIn) is passed to the graph calculation accelerator; after the graph calculation accelerator is calculated, the following (liveOut) is passed back to the superscalar processor through a separate communication bus; after the graph operation is completed, the calculation result is passed back to the superscalar processor In the register.
  • the above-mentioned SEED architecture adopts the accelerator mode, that is, the graph computing accelerator and the general-purpose processor are completely independent systems.
  • the graph computing accelerator has its own independent data and instruction input channels, while the superscalar processor passes through a message channel Or the shared memory communicates with the graph computing accelerator.
  • the communication delay between the superscalar processor and the graph computing accelerator is relatively large.
  • the SEED architecture cannot further improve the parallelism between the hardware (between the graph computing accelerator and the superscalar processor), and ultimately cannot Improve the operating efficiency and overall performance of the architecture, greatly reducing the availability of graph computing accelerators.
  • the embodiments of the present invention provide a processor, a processing method, and related equipment, which realize the function of accelerating the operation of a general-purpose processor in a graph computing mode.
  • an embodiment of the present invention provides a processor including a processor core, the processor core including an instruction scheduling unit, a graph computing flow unit connected to the instruction scheduling unit, and at least one general arithmetic unit; wherein ,
  • the instruction scheduling unit is configured to: allocate a general calculation instruction among the decoded instructions to be executed to the at least one general calculation unit, and allocate the decoded instruction to be executed in the graph calculation control instruction to the A graph calculation unit, the general calculation instruction is used to instruct the execution of a general calculation task, the graph calculation control instruction is used to instruct the execution of a graph calculation task; the at least one general calculation unit is used to execute the general calculation instruction;
  • the graph calculation flow unit is used to execute the graph calculation control instruction.
  • the embodiment of the present invention provides a processor, which implements the function of accelerating the operation of a general-purpose processor in a graph computing mode, which specifically includes hardware design and software design.
  • the embodiment of the present invention adds a hardware graph calculation flow unit to the processor core (Core) of the processor, and combines it with other general arithmetic units (such as arithmetic logic unit, floating-point unit, etc.) It is set together at the position of the processor execution pipeline, so that the processor can execute instructions independently by calling the graph calculation flow unit, or by calling the graph calculation flow unit and other general-purpose arithmetic units to execute instructions concurrently to achieve the purpose of graph calculation acceleration ;
  • the embodiment of the present invention is based on the instruction set of a general-purpose processor to design extended instructions dedicated to graph computing acceleration (for example, including graph computing control instructions), and in the instruction scheduling stage through the instruction scheduling unit in the processor core
  • the graph calculation control instructions are directly dispatched to the graph calculation flow unit for execution to
  • the instruction scheduling unit in the core can be connected to the graph calculation flow unit and communicate directly, so that the graph calculation control instruction is directly dispatched to the graph calculation
  • the flow unit does not need to communicate through other message channels or memory read and write methods, which greatly reduces the communication delay; at the same time, because the graph calculation flow unit is set in the processor core, it can be controlled between it and other computing units.
  • the synchronous or asynchronous operation of the processor improves the parallelism and computing efficiency of the processor.
  • this application integrates the graph computing architecture into a general-purpose processor, and serves as an execution unit in the general-purpose processor core, which executes independently in the execution pipeline stage. Or it can execute graph computing tasks concurrently with other general computing units, realizing the function of the graph computing flow unit and one or more general computing units working together to efficiently perform computing tasks in the same processor.
  • the processor core further includes: an instruction acquisition unit, configured to acquire the target program to be executed; and an instruction decoding unit, configured to decode the target program to obtain the decoded Pending instructions.
  • the processor core further includes an instruction acquisition unit and an instruction decoding unit
  • the processor core also includes a memory unit, wherein the memory unit outside the processor core stores the target program to be executed, and the processor core
  • the instruction fetching unit in the core obtains the target program to be executed from the memory unit, and decodes it through the instruction decoding unit in the core to obtain the execution unit in the processor (such as general arithmetic unit, graph calculation flow unit, etc.).
  • the executed instructions can be dispatched to the corresponding execution unit for execution.
  • the processor core further includes a result write-back unit; the graph calculation flow unit and the at least one general arithmetic unit are respectively connected to the result write-back unit; the at least one general The arithmetic unit is further configured to send the first execution result of the general calculation task to the result write-back unit, and the first execution result of the general calculation task is the result obtained by executing the general calculation instruction;
  • the graph calculation flow unit is further configured to send a second execution result of the graph calculation task to the result write-back unit, and the second execution result of the graph calculation task is the result obtained by executing the graph calculation control instruction
  • the result write-back unit is used to write back part or all of the first execution result and the second execution result to the instruction scheduling unit.
  • the processor core also includes a result write-back unit.
  • the result write-back unit can temporarily store the results calculated by each general arithmetic unit or graph calculation flow unit, and write back some or all of the calculation results to
  • the instruction scheduling unit is used for the instruction scheduling unit to schedule related parameters.
  • the result write-back unit can also reorder the calculation results obtained from the out-of-order execution, for example, reorder the calculation results of the instructions according to the order in which the instructions are fetched, until the instructions in the front are executed. Submit the instruction to complete the operation result of the entire instruction.
  • the instruction scheduling unit in the processor core has the authority and conditions to obtain the relevant operation state of the graph calculation flow unit (that is, the intermediate or final operation result of the temporary storage result written back to the unit), it can better control and By accessing the graph calculation flow unit, the synchronous or asynchronous operation between it and other execution units can be controlled, which improves the parallelism and operating efficiency of the processor.
  • the processor further includes a memory unit; the graph calculation flow unit includes N computing nodes; the graph calculation control instruction includes a start composition instruction, and the start composition instruction carries the The target address in the memory unit; the graph calculation flow unit is specifically configured to receive the start composition instruction, and read composition block information from the memory unit according to the target address, and the composition block information includes the The operation method of each of the N computing nodes, and the connection and sequence information between the N computing nodes.
  • the graph calculation control instruction received by the graph calculation flow unit is specifically a start composition instruction, and the instruction is used to instruct the graph calculation flow unit to follow the target in the memory unit outside the processor core carried in the instruction Address, read the composition block information stored in the memory unit, where the composition block information includes the corresponding calculation method in each calculation node in the graph calculation flow unit, and the dependency relationship between multiple calculation nodes, that is, The relationship between the calculation results and the input conditions between the calculation nodes that have an association relationship (that is, the two calculation nodes corresponding to the edges in the graph calculation).
  • the calculation flow unit based on the above information graph can complete the calculation of a complete composition block.
  • the above-mentioned one composition block may be one or all of the composition blocks in the picture calculation, that is, a complete picture calculation task may include one or multiple split composition blocks.
  • the graph computing control instruction includes a parameter transfer instruction, and the parameter transfer instruction carries the identifiers of M computing nodes and the input parameters corresponding to the identifiers of the M computing nodes.
  • the M computing nodes are some or all of the N nodes; the graph computing flow unit is configured to receive the parameter transfer instruction, and respectively correspond to the input parameters of the identifiers of the M computing nodes Input to the M computing nodes.
  • the parameter transfer instruction contains the initial input parameters required by multiple computing nodes in the calculation process of a composition block. After multiple computing nodes obtain corresponding parameters from outside the graph computing flow unit, the graph computing flow unit meets the conditions for starting to execute the graph computing task, that is, the graph computing can be started.
  • the connection and sequence information between the N computing nodes includes the source node and the destination node respectively corresponding to the L edges; the graph computing flow unit is specifically used to: monitor the N Whether the input parameters required by each of the three computing nodes are prepared; for the target computing node whose input parameters have been prepared, input the input parameters of the target computing node into the corresponding operation of the target computing node In the method, calculation is performed to obtain the calculation result; according to the source node and the destination node respectively corresponding to the L edges, the calculation result of the source node in each edge is input as an input parameter to the corresponding destination node.
  • the calculation node can start graph calculation. While some computing nodes (such as the source node corresponding to the edge) obtain initial input parameters from outside the graph computing flow unit, another part of the computing node (such as the destination node corresponding to the edge) may need to wait for the computing node with which it is associated (such as After the source node) is calculated, the calculation result can be used as its input parameter to start graph calculation. Therefore, the calculation start time of each computing node may be inconsistent, but for each computing node, the calculation method and input parameters (may include left input parameters, right input parameters or conditional parameters) are prepared, and then the calculation can be started. .
  • the graph calculation control instruction includes a start graph calculation instruction; the graph calculation flow unit is specifically configured to: after receiving the start graph calculation instruction, check that the graph calculation flow unit reads Whether the composition block information is consistent with the pre-started composition block address, and whether the input parameters in the M computing nodes have been entered; if the composition block information is consistent with the pre-started composition block If the addresses are consistent and the input parameters in the M computing nodes have been input, the graph computing task is started to be executed.
  • the graph calculation flow unit is triggered to perform related checks before starting calculation (for example, including checking whether the composition block information is correct and whether the initial input parameters are in place, etc.) through the start graph calculation instruction in the graph calculation control instruction. After the graph calculation flow unit completes the above-mentioned inter-check, it is determined that the composition has been completed, and then the graph calculation task can be started.
  • the instruction scheduling unit is further configured to: control the processor core after the graph calculation flow unit receives the start graph calculation instruction and before the graph calculation task is completed Enter the blocking state. Further optionally, the instruction scheduling unit is further configured to: after the graph calculation flow unit completes the graph calculation task, control the processor core to exit the blocking state.
  • the processor core can start the graph calculation function in a synchronous manner (that is, the graph calculation flow unit and other general arithmetic units can execute tasks in series). That is, during the execution of the graph computing task by the graph calculation flow unit, the pipeline of the processor core will be blocked, and the block state will not exit until the graph calculation flow unit completes the graph calculation task, so as to ensure that during this period, only the graph calculation flow unit While performing tasks, other arithmetic units are temporarily unable to perform tasks, thereby reducing the power consumption of the processor.
  • This instruction can realize the switching of the calculation mode between other arithmetic units in the processor core and the graph calculation flow unit, and can be applied to a program of synchronous operation.
  • the instruction scheduling unit is further configured to: send a synchronization execution result instruction to the graph calculation flow unit, and after the graph calculation flow unit receives the synchronization execution result instruction, and Before completing the graph calculation task, control the processor core to enter a blocking state. Further optionally, the instruction scheduling unit is further configured to: after the graph calculation flow unit completes the graph calculation task, control the processor core to exit the blocking state.
  • the processor core can start the graph calculation function in an asynchronous manner (that is, tasks can be executed in parallel between the graph calculation flow unit and other general arithmetic units), that is, during the execution of the graph calculation task by the graph calculation flow unit, the processing
  • the pipeline of the processor core will not be blocked, and other arithmetic units can operate normally; until when the processor sends a synchronous execution result instruction to the graph calculation flow unit through the instruction scheduling unit (for example, when the operations of other arithmetic units need to rely on the graph calculation
  • the graph calculation flow unit if the graph calculation flow unit has not completed the graph calculation task at this time, it starts to control the block of the pipeline of the processor core until the graph calculation flow unit completes the graph calculation task and feeds back the execution result, Only exit the blocking state to ensure that when other computing units need the execution result of the graph calculation flow unit, they can wait for the graph calculation flow unit to feed back the execution result before continuing to run, thereby ensuring that the parallelism of the processor core
  • the processor core further includes a result write-back unit, and the write-back unit includes a plurality of registers; the graph calculation flow unit and the at least one general arithmetic unit are respectively connected to the The result write-back unit is connected; the graph calculation control instruction includes a parameter return instruction, and the parameter return instruction carries the identifiers of the K computing nodes and the registers corresponding to the identifiers of the K computing nodes; the graph computing flow The unit is specifically configured to: control the calculation results of the K computing nodes to be respectively sent to the corresponding registers in the write-back unit of the results.
  • some calculation nodes may need to output the calculation result to the graph calculation flow unit after the final calculation is completed and write the result back to the unit, that is,
  • the graph calculation flow unit may, according to the received graph calculation control instruction, carry the identifiers of the K calculation nodes in the return-parameter instruction, so as to control the final calculation results of the K calculation nodes as the calculation results of the entire composition block.
  • the results output to the graph calculation flow unit are written back to the unit, so that subsequent execution units can perform further calculations based on the above calculation results.
  • the general arithmetic instructions include general arithmetic logic instructions; the at least one general arithmetic unit includes: an arithmetic logic unit ALU for receiving general arithmetic logic instructions sent by the instruction scheduling unit, and Perform logical operations; optionally, the general arithmetic instructions include memory read and write instructions; the at least one general arithmetic unit includes: a memory read and write unit LSU for receiving memory read and write instructions sent by the instruction scheduling unit, and Perform memory read and write operations.
  • the at least one arithmetic unit may also include an arithmetic logic unit or a memory read and write unit, where the logic arithmetic unit is mainly used for input-related logical operations, and the memory read and write unit is used for memory read and write operations, That is to say, the above-mentioned units are in the execution pipeline stage with the graph calculation flow unit, and jointly complete various types of calculation tasks after decoding in the CPU. They can be executed in parallel, serially, or partly in parallel. Execution, in order to complete the computing tasks of the processor more efficiently.
  • the graph calculation control instruction includes a data read and write instruction, and the data read and write instruction carries a memory read and write address; the graph calculation flow unit is also used to: read and write according to the data
  • the memory read/write address in the instruction reads data from the memory read/write unit LSU or writes data to the memory read/write unit LSU.
  • the graph calculation flow unit in the processor core can multiplex the functions of the memory read and write unit in the processor core, and read and write from the memory read and write unit LSU according to the read and write addresses in the relevant data read and write instructions. Read data or write data in.
  • the at least one general arithmetic unit further includes a floating-point arithmetic unit FPU; the graph calculation task includes floating-point arithmetic; the graph calculation flow unit is further used to: The point calculation data is sent to the floating point calculation unit FPU for calculation, and the calculation result fed back by the FPU is received.
  • the at least one general operation unit further includes a vector operation unit SIMD; the graph calculation task includes vector operations; the graph calculation flow unit is further used to: or send the vector operation data to the vector The calculation unit SIMD performs calculations and receives the calculation results fed back by the SIMD.
  • the general arithmetic unit in the embodiment of the present invention may also include a floating-point arithmetic unit FPU and/or a vector arithmetic unit SIMD, where the floating-point arithmetic unit is used to perform floating-point arithmetic tasks that require higher data accuracy.
  • the vector arithmetic unit can be used for arithmetic operations of single instruction stream and multiple data streams.
  • the above-mentioned general arithmetic unit and the graph calculation flow unit are both in the execution pipeline stage, and there is a data transmission channel between the graph calculation flow unit and the general arithmetic unit.
  • the graph computing flow computing unit When the graph computing flow computing unit is processing graph computing tasks, there are computing tasks related to floating-point operations or computing tasks with single instruction multiple data streams, it can be sent to the corresponding general computing through the corresponding data transmission channel
  • the unit performs calculations without the need to repeatedly set up corresponding processing units in the graph calculation flow unit to process corresponding types of calculation tasks, thereby greatly saving hardware area and overhead.
  • an embodiment of the present invention provides a processing method, which is applied to a processor, the processor includes a processor core, the processor core includes an instruction scheduling unit, and a graph computing flow connected to the instruction scheduling unit Unit and at least one general arithmetic unit; the method includes:
  • the general calculation instructions among the decoded instructions to be executed are allocated to the at least one general calculation unit, and the graph calculation control instructions of the decoded instructions to be executed are allocated to the graph calculation unit ,
  • the general calculation instruction is used to instruct the execution of a general calculation task, and the graph calculation control instruction is used to instruct the execution of the graph calculation task;
  • the graph calculation control instruction is executed by the graph calculation flow unit.
  • the processor core further includes an instruction acquisition unit and an instruction decoding unit: the method further includes:
  • the target program is decoded by the instruction decoding unit to obtain the decoded instruction to be executed.
  • the processor core further includes a result write-back unit; the graph calculation flow unit and the at least one general arithmetic unit are respectively connected to the result write-back unit; the method further includes :
  • Part or all of the first execution result and the second execution result are written back to the instruction scheduling unit through the result write-back unit.
  • the processor further includes a memory unit;
  • the graph calculation flow unit includes N computing nodes;
  • the graph calculation control instruction includes a start composition instruction, and the start composition instruction carries the The target address in the memory unit;
  • the execution of the graph calculation control instruction by the graph calculation flow unit includes:
  • the graph calculation flow unit receives the start composition instruction, and reads the composition block information from the memory unit according to the target address.
  • the composition block information includes the information of each of the N computing nodes. Operation method, and connection and sequence information between the N computing nodes.
  • the graph computing control instruction includes a parameter transfer instruction, and the parameter transfer instruction carries the identifiers of M computing nodes and the input parameters corresponding to the identifiers of the M computing nodes.
  • the M computing nodes are some or all of the N nodes; the execution of the graph computing control instruction by the graph computing flow unit includes:
  • the graph computing flow unit receives the parameter transfer instruction, and inputs the input parameters corresponding to the identifiers of the M computing nodes into the M computing nodes, respectively.
  • connection and sequence information between the N computing nodes includes a source node and a destination node respectively corresponding to the L edges;
  • the graph computing flow unit executes the graph computing control Instructions, including:
  • the graph calculation flow unit it is monitored whether the input parameters required by each of the N calculation nodes are ready; for the target calculation node whose input parameters have been prepared, the input parameters of the target calculation node , Input into the calculation method corresponding to the target calculation node to calculate to obtain the calculation result; according to the source node and the destination node corresponding to the L edges, the calculation result of the source node in each edge is input as an input parameter To the corresponding destination node.
  • the graph calculation control instruction includes a graph calculation start instruction; the execution of the graph calculation control instruction by the graph calculation flow unit to obtain the execution result of the graph calculation task includes:
  • the method further includes:
  • the instruction scheduling unit controls the processor core to enter a blocking state after the graph calculation flow unit receives the start graph calculation instruction and before the graph calculation task is completed.
  • the method further includes:
  • the device core send a synchronization execution result instruction to the graph calculation flow unit through the instruction scheduling unit, and control the processing after the graph calculation flow unit receives the synchronization execution result instruction and before completing the graph calculation task
  • the device core enters a blocking state.
  • the method further includes:
  • the processor core is controlled to exit the blocking state.
  • the processor core further includes a result write-back unit, and the write-back unit includes a plurality of registers; the graph calculation flow unit and the at least one general arithmetic unit are respectively connected to the The result write-back unit is connected; the graph calculation control instruction includes a parameter return instruction, and the parameter return instruction carries the identities of K computing nodes and the registers corresponding to the identities of the K computing nodes; The graph calculation flow unit executes the graph calculation control instruction to obtain the execution result of the graph calculation task, including:
  • the calculation results of the K calculation nodes are respectively sent to the corresponding registers in the result write-back unit through the control of the graph calculation flow unit.
  • the general arithmetic instruction includes a general arithmetic logic instruction; the at least one general arithmetic unit includes an arithmetic logic unit ALU; the general arithmetic instruction is executed by the at least one general arithmetic unit, include:
  • the general operation instruction includes a memory read and write instruction;
  • the at least one general operation unit includes a memory read and write unit LSU; and the at least one general operation unit executes the general calculation instruction To obtain the execution result of the general computing task, including:
  • the graph calculation control instruction includes a data read and write instruction, and the data read and write instruction carries a memory read and write address; the method further includes:
  • the graph calculation flow unit reads data from the memory read/write unit LSU or writes data to the memory read/write unit LSU according to the memory read/write address in the data read/write instruction.
  • the at least one general arithmetic unit further includes a floating-point arithmetic unit FPU; the graph calculation task includes floating-point arithmetic; and the method further includes:
  • the at least one general operation unit further includes a vector operation unit SIMD; the graph calculation task includes vector operation; and the method further includes:
  • the graph calculation flow unit sends the vector operation data to the vector operation unit SIMD for calculation, and receives the calculation result fed back by the SIMD.
  • the present application provides a semiconductor chip, which may include the processor provided in any one of the foregoing first aspects.
  • the present application provides a semiconductor chip, which may include: a processor provided by any one of the implementations of the first aspect, an internal memory coupled to the multi-core processor, and an external memory.
  • the present application provides a system-on-chip SoC chip.
  • the SoC chip includes the processor provided by any one of the implementations of the first aspect, internal memory and external memory coupled to the processor.
  • the SoC chip may be composed of a chip, or it may include a chip and other discrete devices.
  • the present application provides a chip system, which includes the multi-core processor provided by any one of the foregoing first aspects.
  • the chip system further includes a memory for storing necessary or related program instructions and data during the operation of the multi-core processor.
  • the chip system can be composed of chips, or include chips and other discrete devices.
  • the present application provides a processing device that has the function of realizing any one of the processing methods in the second surface.
  • This function can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above-mentioned functions.
  • the present application provides a terminal, the terminal including a processor, and the processor is a processor provided by any one of the foregoing first aspects.
  • the terminal may also include a memory, which is used for coupling with the processor and stores necessary program instructions and data for the terminal.
  • the terminal may also include a communication interface for the terminal to communicate with other devices or communication networks.
  • this application provides a computer-readable storage medium that stores a computer program that, when executed by a processor, implements the process of the processing method described in any one of the above-mentioned second aspects .
  • an embodiment of the present invention provides a computer program, the computer program includes instructions, when the computer program is executed by a processor, the processor can execute the process of the processing method described in any one of the second aspect above .
  • FIG. 1 is a schematic structural diagram of a processor provided by an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of another processor provided by an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of another processor provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a process of comprehensive compilation and execution of source code provided by an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a calculation model of a graph computing flow unit provided by an embodiment of the present invention.
  • Fig. 6 is a schematic diagram of a graph computing flow control instruction provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of an abstract model of computing nodes in a composition block provided by an embodiment of the present invention.
  • Fig. 8 is an abstract model of graph computing flow instructions provided by an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of abstracting code into a data flow diagram provided by an embodiment of the present invention.
  • FIG. 10 is a schematic flowchart of a processing method provided by an embodiment of the present invention.
  • component used in this specification are used to denote computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution.
  • the component may be, but is not limited to, a process, a processor, an object, an executable file, a thread of execution, a program, and/or a computer running on a processor.
  • the application running on the computing device and the computing device can be components.
  • One or more components may reside in processes and/or threads of execution, and components may be located on one computer and/or distributed between two or more computers.
  • these components can be executed from various computer readable media having various data structures stored thereon.
  • a component may be based on a signal having one or more data packets (for example, data from two components that interact with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through a signal) Communicate through local and/or remote processes.
  • data packets for example, data from two components that interact with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through a signal
  • Graph is an abstract data structure used to represent the association relationship between objects. It is described using vertices (Vertex) and edges (Edge): vertices represent objects, and edges represent relationships between objects.
  • the superscalar processor architecture refers to a type of parallel operation that implements instruction-level parallelism in a processor core. This technology can achieve higher CPU throughput under the same CPU frequency.
  • SIMD Single Instruction Multiple Data
  • Instruction pipeline in order to improve the efficiency of the processor to execute instructions, the operation of an instruction is divided into multiple small steps, and each step is completed by a special circuit. For example, an instruction needs to be executed in 3 stages: fetching, decoding, and executing. Each stage takes one machine cycle. If pipeline technology is not used, then the instruction needs 3 machine cycles to execute; if the instruction is used Pipeline technology, when this instruction completes the "fetch” and then enters the "decoding", the next instruction can be "fetched", which improves the execution efficiency of the instruction.
  • Execution Unit (Execution Unit, EU) is responsible for the execution of instructions. In fact, it has both the function of the controller and the function of the arithmetic unit.
  • Register file also known as register file, is an array of multiple registers in the CPU, usually implemented by fast static random access memory (SRAM). This kind of RAM has a dedicated read port and write port, which can concurrently access different registers in multiple channels.
  • SRAM static random access memory
  • Integrated Circuit is a kind of miniature electronic device or component.
  • a certain process is adopted to interconnect the transistors, resistors, capacitors and inductors and other components and wiring required in a circuit, fabricate them on a small or several small semiconductor wafers or dielectric substrates, and then encapsulate them in a package , It becomes a micro structure with the required circuit functions; that is, an IC chip is an integrated circuit formed by a large number of microelectronic components (transistors, resistors, capacitors, etc.) on a plastic base to make a chip.
  • control-Flow Architecture the core idea is to drive operations through instructions, that is, the processor reads instructions in sequence according to the execution sequence of the instructions. Then according to the control information contained in the instruction, call the data for processing.
  • the problem faced by the above-mentioned control flow architecture is how to enable instructions to drive execution all the time without being interrupted (stall) under the condition that the main frequency operation is satisfied, so as to improve the performance of the processor.
  • technologies such as superscalar, very long instruction words, dynamic scheduling algorithms, and instruction prefetching have been developed to improve the performance of the processor, but these technologies still have the problem of high performance overhead.
  • the Dataflow Architecture (Dataflow Architecture) used to solve the above problems has been produced.
  • the Dataflow Architecture is to explicitly describe the dependency of instructions at the instruction set level, and directly show the parallelism between instructions to the hardware.
  • the data flow architecture can be abstracted into a directed graph composed of N nodes. The connection between nodes and nodes represents a data flow. Once the input of each node is ready (Ready), the current node can perform operations and Pass the result to the next node. Therefore, nodes that are not on the same path in the same graph can run concurrently, thereby improving the parallelism of processing.
  • the traditional data flow architecture also needs to support the control flow.
  • (data flow + control flow) is collectively referred to as a graph architecture (Graph Architecture).
  • Graph Architecture Graph Architecture
  • the control flow in the general-purpose processor architecture mainly refers to the execution instructions for general operations, while the control flow of the graph computing architecture is mainly Refers to various graph calculation control instructions in the graph (such as switch/gate/predicate/gate instructions, etc.).
  • this application proposes to integrate the graph computing architecture (data flow + control flow) into the general processor architecture and use it as an execution unit in the processor core (the graph computing in this application).
  • the flow unit (Graph Flow Unit, GFU), executes computing tasks synchronously or asynchronously with other execution units; further, this application also designs the processor to perform general computing functions and control graph computing flow units based on the control flow architecture of the general processor Run related functions, and based on the (data flow + control flow) architecture suitable for graph computing, the calculation function inside the graph computing flow unit is designed.
  • the control flow method is still used for calculation for the general computing task part, while the (data flow + control flow) method is used for the calculation for the graph calculation task part (for example, the hotspot cycle and the hotspot instruction sequence), thereby realizing the Graph computing architecture to accelerate the functions of general-purpose processors.
  • the graph calculation flow unit is set in the processor core of the processor, it can communicate directly with other functional modules or execution units in the processor core without communicating through other message channels or memory read and write methods, which greatly The communication delay is reduced; at the same time, because the graph calculation flow unit is set in the processor core of the processor, the processor core can better control and access the graph calculation flow unit, and then can control the relationship between it and other hardware units.
  • Synchronous or asynchronous operation between the two to improve the parallelism and computing efficiency of the processor.
  • they can be executed repeatedly in the graph computing architecture (ie, graph computing flow unit) to reduce the frequency and bandwidth of the processor core fetching instructions from the memory unit, and reduce the dependency check between instructions
  • the overhead, jump prediction, and register access overhead, etc. effectively utilize the computing resources of the graph computing flow unit, and further improve the operating efficiency and performance of the processor.
  • the above-mentioned processor architecture provided in this application can schedule instructions suitable for execution of the graph computing architecture to the graph computing flow unit in the processor core for execution, and dispatch instructions that are not suitable for the graph computing architecture to other general-purpose processors in the processor core.
  • Executed in the arithmetic unit; and the processor can call GFU execution alone, or call GFU and other execution units to execute concurrently, thereby solving the high switching overhead and poor parallelism of the graphics acceleration processor (such as SEED) architecture in the prior art ,
  • SEED graphics acceleration processor
  • embodiments of the present invention also provide a pipeline structure suitable for the above-mentioned processor architecture.
  • the life cycle of an instruction in the pipeline structure may include the instruction fetching pipeline ⁇ decoding pipeline ⁇ scheduling (transmitting).
  • Pipeline ⁇ execution pipeline ⁇ memory access pipeline ⁇ write reflux pipeline that is, the pipeline structure divides the execution process of an instruction into at least the following six stages, among which,
  • Instruction fetch pipeline Instruction Fetch refers to the process of fetching instructions from memory.
  • Instruction Decode refers to the process of translating instructions fetched from the memory.
  • Instruction Dispatch (issue) pipeline Instruction Dispatch and Issue (Instruction Dispatch and Issue) reads registers to obtain operands, and sends the instructions to the corresponding execution unit (EU) for execution according to the type of instruction.
  • EU execution unit
  • Execution pipeline After the instruction is decoded, the types of calculations that need to be performed are known, and the required operands have been read from the general register set, and then the instruction execution (InstructionExecute) is performed according to the type of instruction. Complete calculation tasks. Instruction execution refers to the actual operation of instructions. For example, if the instruction is an addition operation instruction, the operand is added; if it is a subtraction operation instruction, then the subtraction operation is performed; if it is a graph calculation, then the graph calculation operation is performed.
  • Memory Access refers to the process in which memory access instructions read data from the memory or write data into the memory, mainly executing load/store instructions.
  • Write Back refers to the process of writing the result of instruction execution back to the general register group. If it is an ordinary arithmetic instruction, the result value comes from the result of the calculation in the "execute” phase; if it is a memory read instruction, the result comes from the data read from the memory in the "memory access” phase.
  • each instruction in the processor has to go through the above-mentioned operation steps, but different operation steps of multiple instructions can be executed at the same time, which can speed up the instruction flow as a whole and shorten the program execution time.
  • the foregoing processor architecture and processor pipeline structure are just some exemplary implementations provided by the embodiments of the present invention.
  • the processor architecture and processor pipeline structure in the embodiments of the present invention include but are not limited to the above implementations. Way.
  • Figure 1 is a schematic structural diagram of a processor provided by an embodiment of the present invention.
  • the processor 10 can be located in any electronic device, such as a computer, a computer, a mobile phone, a tablet, a personal digital assistant, or a smart wearable device. , Smart cars or smart home appliances and other equipment.
  • the processor 10 may specifically be a chip or a chip set or a circuit board equipped with a chip or a chip set.
  • the chip or chipset or the circuit board equipped with the chip or chipset can work under the necessary software drive. specifically,
  • the processor 10 may include at least one processor core 101, and the processor core 101 may include an instruction scheduling unit 1011, a graph computing flow unit 1012 connected to the instruction scheduling unit 1011, and at least one general arithmetic unit 1013.
  • the instruction scheduling unit 1011 runs in the transmission pipeline stage of the processor core 101 to complete the scheduling and distribution of instructions to be executed; and the graph computing flow unit 1012 and at least one general arithmetic unit 1013 are both used as the execution unit (EU) of the processor 10 , Which can also be referred to as the functional unit FU) runs in the execution pipeline stage (Execute Stage) to complete various types of calculation tasks.
  • EU execution unit
  • the processor 10 can directly allocate the graph calculation tasks in the instructions to be executed to the graph calculation flow unit 1012 through the instruction scheduling unit 1011 for execution, so as to realize the function of accelerating the general-purpose processor through the graph calculation mode;
  • the general computing tasks in the instructions are dispatched to the at least one general computing unit 1013 for execution, so as to implement general computing functions.
  • the processor 10 may only call the graph calculation flow unit 1012 to perform the task according to different calculation tasks, or call at least one general calculation unit 1013 to perform the task separately, or call the graph calculation flow unit 1012 and the at least one at the same time.
  • a general arithmetic unit 1013 executes tasks in parallel.
  • the instruction scheduling unit 1011, the graph computing flow unit 1012, and the at least one general arithmetic unit 103 can be connected by a bus or other means for direct communication.
  • the connection relationship shown in FIG. 1 does not differ between them.
  • the connection relationship constitutes a restriction.
  • Figure 2 is a schematic structural diagram of another processor provided by an embodiment of the present invention.
  • the processor 10 may include multiple processor cores (F As an example, F is an integer greater than 1), such as processor core 101, processor core 102, processor core 103... processor core 10F.
  • the various processor cores can be homogeneous or heterogeneous, that is, the structure between the processor core (102, 103...10F) and the processor core 101 can be the same or different. This is not specifically limited.
  • the processor core 101 can be used as the main processing core, the processor cores (102, 103...10F) can be used as the slave processing cores, and the main processing core and (F-1) slave processing cores can be located between One or more chips (IC).
  • the main processing core 101 and the (F-1) slave processing cores may be coupled and communicated through a bus or other methods, which is not specifically limited here.
  • the pipeline structure can be different according to the structure of each processor core. Therefore, the pipeline structure referred to in this application refers to the pipeline structure of the processor core 101, and does not affect the pipeline structure of other processor cores. Specific restrictions.
  • FIG. 3 is a schematic structural diagram of another processor provided by an embodiment of the present invention.
  • the processor core 101 may further include an instruction fetching unit 1015 and an instruction decoding unit 1016. It runs in the instruction fetching pipeline stage and the decoding pipeline stage respectively, and completes the corresponding instruction fetching and instruction decoding functions.
  • the at least one general operation unit 1013 may specifically include a memory read and write unit (LSU) 1013A, a floating point operation unit (FPU) 1013B, a vector operation unit (SIMD) 1013C, and an operation logic unit (ALU) One or more of 1013D.
  • LSU memory read and write unit
  • FPU floating point operation unit
  • SIMD vector operation unit
  • ALU operation logic unit
  • the above-mentioned multiple general arithmetic units (including 1013A, 1013B, 1013C, SIMD) and graph calculation flow unit 1012 are all connected to the instruction scheduling unit 1011, and run as the execution unit (EU) of the processor in the execution pipeline stage.
  • the above-mentioned execution units respectively receive different types of instructions dispatched by the instruction dispatch unit 1011, and then execute the types of calculation tasks they are good at based on their own different hardware structures.
  • the processor core 101 of the processor 10 includes a memory unit 1017 outside the core, and the above-mentioned memory read and write unit (LSU) reads and writes data from the memory unit 1017 and runs in the memory access pipeline stage; Yes, the processor core 101 also includes a result write-back unit 1014, which runs at the write reflux stage and is responsible for writing the calculation result of the instruction back to the destination register.
  • the memory unit 1017 is usually a volatile memory when power is off, and the content stored on it will be lost when the power is off. It can also be referred to as a memory (Memory) or a main memory.
  • the memory unit 1017 can be used as a storage medium for temporary data of the operating system or other running programs in the processor 10.
  • the operating system running on the processor 10 transfers the data to be calculated from the memory unit 1017 to the processor core 101 for calculation, and the processor core 101 transmits the result after the calculation is completed.
  • the internal memory 101 may include, dynamic random access memory (DRAM), static random access memory (SRAM), synchronous dynamic random access memory (SDRAM), primary cache (L1Cache), secondary cache (L2Cache) and tertiary cache (L3Cache), etc.
  • FIG. 1, FIG. 2 and FIG. 3 are only some exemplary implementation manners provided by the embodiments of the present invention, and the structure of the processor in the embodiments of the present invention includes but is not limited to the above implementation manners.
  • the functions specifically implemented by the processor 10 may include the following:
  • the instruction acquisition unit 1015 acquires the target program to be executed from the memory unit 1017; the instruction decoding unit 1016 decodes the target program according to a predetermined instruction format to obtain the decoded instruction to be executed.
  • the instruction scheduling unit 1011 receives decoded instructions to be executed.
  • the instructions to be executed include general calculation instructions and graph calculation control instructions. The general calculation instructions are used to instruct the execution of general calculation tasks, and the graph calculation control instructions are used to instruct execution.
  • Graph computing task send the general calculation instruction to the at least one general calculation unit, and send the graph calculation control instruction to the graph calculation flow unit; at least one general calculation unit 1013 receives and executes the general calculation Instruction to obtain the execution result of the general computing task; the graph calculation flow unit 1012 receives and executes the graph calculation control instruction to obtain the execution result of the graph calculation task. At least one general computing unit 1013 also sends the first execution result of the general computing task to the result write-back unit 1014; the graph computing flow unit 1012 also sends the second execution result of the graph computing task to the result Write back unit 1014; The result write back unit 1014 stores the first execution result and the second execution result, and writes back part or all of the first execution result and the second execution result to the instruction scheduling Unit 1011.
  • FIG. 4 is a schematic diagram of a source code comprehensive compilation and execution process provided by an embodiment of the present invention.
  • a complex instruction sequence (such as an instruction sequence with more complex correlations, indirect jumps or interrupts) or an instruction sequence that is only used once, can be used in a general operation mode.
  • Compile and for repetitive instruction sequences, such as loops or functions that are repeatedly called (the relationship can be complex or simple, but usually requires repeated execution), use graph computing flow mode to compile.
  • graph computing flow mode compilation refers to abstracting the logic involved between codes into a graph architecture, which is to perform operations such as checking, jumping, and predicting originally performed by the processor, all in the program compilation stage (ie Through the graph computing flow mode compiler), the binary machine instructions under the graph architecture are generated. Because the instructions under these graph architectures contain the relationship between the input and output of each instruction, when the GFU in the processor is actually operating, it can greatly reduce the logical judgment between instructions and greatly save the CPU.
  • the overhead in the core has good performance and low power consumption.
  • the object file is an .o file.
  • the above object file (such as the .o file) is mainly linked with the library to create an executable file. It is understandable that the compilation stages corresponding to 1, 2, and 3 above can be completed on a device (such as a server, a compiler, etc.) other than the device where the processor 10 is located, or on the device where the processor 10 is located. The above pre-compilation is completed, and it can also be executed and compiled on the device where the processor 10 is located, which is not specifically limited here.
  • the processor 10 will perform a series of operations such as instruction loading, instruction prefetching, instruction predecoding, and division prediction, etc.
  • the target program (for example, including code segment, data segment, BSS segment or stack, etc.) is loaded into the memory unit 1017.
  • the instruction fetching unit 1015 can fetch the above-mentioned target program from the memory unit 1017 in a manner of fetching one instruction at a time multiple times, and then each instruction enters the instruction decoding unit 1016 from the instruction fetching unit 1015 for decoding.
  • the instruction decoding unit 1016 splits and interprets the instruction to be executed according to a predetermined instruction format to further obtain the micro-operation instruction, which is the instruction to be executed after decoding in this application, and sends it to the instruction scheduling unit 1011.
  • the instruction scheduling unit 1011 After the instruction scheduling unit 1011 receives the decoded instruction to be executed, it is distributed to each execution unit (Execution Unit) for calculation according to the type of each instruction, for example, it is dispatched to the general operation unit 1013 or the graph calculation flow unit 1012 for operation . Since the graph calculation flow unit 1012 is set in the processor core 101 of the processor 10, the instruction scheduling unit 1011 can directly connect and communicate with the graph calculation flow unit 1012, thereby directly dispatching the identified graph calculation control instructions to the graph calculation The stream unit 1012 does not need to communicate through other message channels or memory read and write methods, which greatly reduces the communication delay.
  • the general calculation instructions and graph calculation control instructions in this application can be identified by different identification bits (the identification bits can be added in the above compilation stage), that is, different types of instructions It can correspond to different instruction IDs, so that the instruction scheduling unit 1011 can identify according to the instruction ID.
  • the graph calculation flow unit 1012 receives and executes graph calculation control instructions to obtain the execution result of the graph calculation task; one or more general calculation units 1013 receive and execute the general calculation instructions to obtain the execution result of the general calculation task.
  • the graph computing flow unit 1012 and the general arithmetic unit 1013 can execute instructions in parallel or serially, depending on the logical relationship between the instructions executed between these execution units in the target program. The embodiment of the invention does not specifically limit this.
  • both the graph calculation flow unit 1012 and the general arithmetic unit 1013 can send the calculation result to the result write-back unit 1014, and the result write-back unit 1014 can feed back part or all of the calculation result to the instruction scheduling unit 1011.
  • the first execution result or the second execution result can also be directly written into the memory unit 1017, or written into the memory unit 1017 through the memory read-write unit 1013A, so that the relevant execution unit (as shown in the calculation flow The unit 1012 or the memory reading and writing unit 1013A) can obtain the required parameters from the corresponding storage location.
  • the processor core 101 since the graph calculation flow unit 1012 is provided in the processor core 101 of the processor 10, the processor core 101 has the authority and conditions to obtain the relevant calculation states of the graph calculation flow unit 1012 and other general calculation units 1013 (for example, the above-mentioned first The first execution result and the second execution result) can be controlled to run synchronously or asynchronously with other arithmetic units, which improves the operating efficiency of the processor.
  • the graph computing flow unit 1012 receives the liveIn data on the register sent from the instruction scheduling unit 1011 (including instruction transmission and reservation stations, for example) like other general arithmetic units, and passes the input into it. Graph computing flow unit 1012 on the corresponding computing node. In the same way, the graph calculation flow unit 1012 also writes the liveOut output data back to the result write-back unit 1014 (for example, including registers and reorder buffers), thereby writing the output of the graph depending on the graph. The corresponding output registers and instructions are reserved on the station.
  • FIG. 5 is a calculation model of a graph calculation flow unit provided by an embodiment of the present invention. Schematic.
  • the theoretical calculation model of Graphflow in this application can be abstracted into N fully connected calculation nodes (corresponding to the vertices of the graph). An instruction can be placed in each node, an operation can be performed, and the result can be passed to itself or other nodes.
  • Graphflow theoretical calculation model can be divided into two stages of repeated switching:
  • each node in the composition block (1-b in Figure 5) is configured with an operation instruction and There are at most two target nodes. Assuming that N is equal to 16, then 1-b in Figure 1 has 0, 1, 2, 3, 4...15, a total of 16 computing nodes. Once the composition is completed (1-b in Figure 5), the calculations and connections of each node are solidified (read only).
  • the arithmetic instruction in computing node 0 is an add instruction, that is, an addition operation
  • the arithmetic instruction in computing node 2 is an sll instruction, that is a shift operation
  • the arithmetic instruction of computing node 3 is an xor instruction, that is an exclusive OR operation .
  • the calculation results of computing node 1 and computing node 1 are used as the input of the computing node to perform the ld operation (that is, the index fetching operation);
  • the computing node 2 and computing node The calculation result of 3 is used as the input of the calculation node to perform the add operation (ie, addition operation), and so on, and the calculation process of other calculation nodes will not be described one by one.
  • the external module receives the input (LiveIn) to start the data flow. All computing nodes run in parallel. For each node (1-d in Figure 5), as long as its input arrives, it can perform calculations and pass the result to the next computing node; if the input does not arrive, it is in a waiting state for Idle. The operation continues until the data stream reaches the end node (tm).
  • the startup data needs to be input from the external memory unit 1017 (1-e in Figure 5); and the other part Computing nodes (such as computing nodes 5, 6, 8, 9, 10, 11, 12, 13, 14, 15) need to obtain internally the calculation results output by the computing nodes that have a connection relationship with them, and then they can perform operations. And input the result of the operation to the calculation node associated with it.
  • the scheduling instruction unit 1011 in the processor 10 schedules graph calculation control instructions to the graph calculation flow unit 1012, when the controller executes graph calculation tasks, it includes a variety of control instructions with different functions. , Thereby instructing the graph calculation flow unit 1012 to perform the corresponding graph calculation function.
  • the graph calculation control instructions provided in this application mainly include the start composition instruction ⁇ parameter transfer instruction ⁇ start graph calculation instruction ⁇ return parameter instruction. The following is a detailed description of the features and functions of the above instructions:
  • the processor 10 further includes a memory unit 1017; the graph calculation flow unit 1012 includes N computing nodes; the graph calculation control instruction includes a start composition instruction, and the start composition instruction carries a memory unit 1017
  • the target address in the graph calculation flow unit 1012 receives the start composition instruction, and reads the composition block information from the memory unit 1017 according to the target address, the composition block information includes each of the N computing nodes The calculation method of the node, and the connection and sequence information between the N computing nodes.
  • the graph calculation control instruction received by the graph calculation flow unit is specifically a start composition instruction, and the instruction is used to instruct the graph calculation flow unit according to the instruction carried in the instruction in the memory unit 1017 outside the processor core 101 Target address, read the composition block information stored in the memory unit, where the composition block information includes the corresponding calculation method in each of the N calculation nodes in the graph calculation flow unit, and the calculation method between the N calculation nodes
  • the dependency relationship that is, the relationship between the calculation results and the input conditions between the associated calculation nodes (that is, the two calculation nodes corresponding to the edges in the graph calculation), which corresponds to the graph calculation in Figure 5 above N fixed stream instructions in the model.
  • the graph calculation flow unit 1012 can complete the calculation of a complete composition block. It should be noted that the above-mentioned one composition block may be one or all of the composition blocks in the picture calculation, that is, a complete picture calculation task may include one or multiple split composition blocks.
  • FIG. 6 is a schematic diagram of a graph calculation flow control instruction provided by an embodiment of the present invention.
  • the start composition instruction is gfb 0x600960, where gfb is the operation code, 0x600960 is the operand and the memory unit 1017
  • the graph calculation flow unit 1012 can obtain the composition block information corresponding to the address 0x600960 from the memory unit 1012 to start the composition. That is, the scheduling instruction unit 1011 sends a pointer to the address of the correlation graph calculation instruction to be executed to the graph calculation flow unit 1012, and reads the composition information block from the memory unit 1017.
  • the graph computing control instruction includes a parameter transfer instruction, and the parameter transfer instruction carries the identifiers of M computing nodes and the input parameters corresponding to the identifiers of the M computing nodes.
  • the M computing nodes are some or all of the N nodes; the graph computing flow unit is configured to receive the parameter transfer instruction, and respectively correspond to the input parameters of the identifiers of the M computing nodes Input to the M computing nodes.
  • the parameter transfer instruction includes gfmov x0, 1r, which means that the parameter value in the register x0 is used as the right input parameter in the calculation node 1; gfmovx1, 10l, which means that the parameter in the register x1 The value is used as the left input parameter in the calculation node 10, and so on, and it is not listed here.
  • the graph calculation control instruction received by the graph calculation flow unit includes a parameter transfer instruction, and the parameter transfer instruction contains the initial input parameters required by multiple computing nodes in the calculation process of a composition block (for example, FIG. 5 above).
  • the computing nodes 0, 1, 2, 3) in the computing nodes after the multiple computing nodes obtain the corresponding parameters from the outside of the graph computing flow unit, the graph computing flow unit meets the conditions for starting to execute the graph computing task, that is, it can start Perform graph calculations.
  • the graph calculation control instruction includes a start graph calculation instruction; the graph calculation flow unit 1012, after receiving the start graph calculation instruction, determines whether the current composition has been completed; if it is completed, it starts the execution Describe graph computing tasks. Specifically, in a possible implementation manner, after the graph calculation flow unit 1012 receives the start graph calculation instruction, it checks whether the composition block information read by the graph calculation flow unit is consistent with the pre-started composition block address. Are consistent, and determine whether the input parameters in the M computing nodes have been input; if they are consistent and have been input, start to execute the graph computing task.
  • the processor 10 controls the graph calculation flow unit 1012 through the above-mentioned startup graph calculation instruction to start executing graph calculation tasks, specifically including the following two control methods:
  • the graph calculation flow unit 1012 After the graph calculation flow unit 1012 receives the start graph calculation instruction, it determines whether the current composition has been completed; if it is completed, it starts to execute the graph calculation task; further, the instruction scheduling unit 1011 receives the start graph calculation in the graph calculation flow unit 1012 After the instruction and before completing the graph calculation task, the processor core 101 is controlled to enter the blocking state, and after the graph calculation flow unit 1012 completes the graph calculation task, the processor core 101 is controlled to exit the blocked state.
  • the processor 10 may start the execution phase of the graph calculation flow unit 1012 through a gfe (graph flow execute) instruction. If the composition of the graph calculation flow unit 1012 is not completed, the gfe will wait for the completion of the composition before starting the graph calculation flow unit 1012 to execute. In the execution stage of the graph calculation flow unit 1012, other units of the processor core 101 are in the power-saving (Power Gate) stage and do not perform other operations, and the only running unit is the processor core 101 interrupt and exception unit. Therefore, the processor core 101 enters a blocking state after executing gfe. If the composition is wrong, or the execution is wrong, gfe will generate a corresponding exception. Until the execution of the graph calculation flow unit 1012 ends, the CPU instructions after gfe can continue to execute, including the return-to-parameter instruction gfmov.
  • gfe graph flow execute
  • the graph calculation flow unit 1012 is triggered to check whether the GFU composition is completed, and whether the previously started composition block address ⁇ GBB_address> is the same as the execution
  • the composition blocks are the same. If the composition of the composition block is inconsistent, you need to restart the composition unit to recompose the picture. If the composition of the picture is the same, the picture calculation can be started immediately.
  • the instruction can block the pipeline of the processor core 101 until the entire graph calculation is completed, so other arithmetic units of the processor core 101 cannot operate on instructions after the gflow instruction. This instruction can be used to switch between general operation mode and graph calculation mode.
  • This instruction can also be used as a processor in order to reduce energy consumption and only use GFU to operate.
  • the data flow and control flow in the graph flow downwards according to the program definition.
  • the graph flow reaches the end node gfterm of the graph, the calculation of the graph ends.
  • gfterm will start the processor command gflow to enter the commit phase and restart the pipeline of the processor core 101.
  • the processor core can start the graph calculation function in a synchronous manner (that is, the graph calculation flow unit and other general arithmetic units can execute tasks in series). That is, during the execution of the graph computing task by the graph calculation flow unit, the pipeline of the processor core will be blocked, and the block state will not exit until the graph calculation flow unit completes the graph calculation task, so as to ensure that during this period, only the graph calculation flow unit During operation, other operation units cannot operate, thereby reducing the power consumption of the CPU.
  • This instruction can switch the calculation mode between other arithmetic units in the processor and the graph calculation flow unit, and can be applied to a program of synchronous operation.
  • the graph calculation flow unit 1012 After the graph calculation flow unit 1012 receives the start graph calculation instruction, it determines whether the current composition has been completed; if it is completed, it starts to execute the graph calculation task; further, the instruction scheduling unit 1011 also sends the synchronization execution result to the graph calculation flow unit 1012 After the graph calculation flow unit 1012 receives the synchronous execution result instruction, and before the graph calculation task is completed, the processor core 101 is controlled to enter the blocking state, and the graph calculation flow unit 1011 completes the graph calculation task After that, the control processor core 101 exits the blocking state.
  • the processor 10 may start the execution phase of the asynchronous graph calculation flow unit 1012 through a gff (graph flow fork) instruction. If the composition of the graph calculation flow unit 1012 is not completed, gff will wait for the completion of the composition and start the graph calculation flow unit 1012 to execute. While gff starts the execution of the graph calculation flow unit 1012, other arithmetic units of the processor core 101 can perform other operations, so gff does not occupy resources in the ROB. After the asynchronous execution, the processor synchronizes the execution result of the graph calculation flow unit 1012 through the gfj (graph flow join) instruction. Only when the execution of Graphflow ends, the CPU instructions after gfj can continue to execute, including the gfmov that returns the parameter.
  • gff graph flow fork
  • two new CPU instructions are added to the instruction set to start the parallel operation of the GFU and other operation units of the processor core 101, including the instruction gfork ⁇ GBB_address> and the instruction gfjoin ⁇ GBB_address>.
  • the gffork instruction will first check whether the GFU composition is completed, and whether the composition block address ⁇ GBB_address> previously activated is consistent with the executed composition block. If the composition of the composition block is inconsistent, you need to restart the composition unit to recompose the picture. If the composition of the picture is the same, the picture calculation can be started immediately.
  • the gffork instruction does not block the pipeline of the CPU, so other modules of the CPU can be executed asynchronously with the graph calculation.
  • gfjoin is executed. If the graph calculation has been completed, gfjoin will return immediately. If the graph calculation is still not completed, then gfjoin will block the CPU pipeline until the graph calculation is completed.
  • the processor core can start the graph calculation function in an asynchronous manner (that is, tasks can be executed in parallel between the graph calculation flow unit and other general arithmetic units), that is, during the execution of the graph calculation task by the graph calculation flow unit, the processing
  • the pipeline of the processor core will not be blocked, and other arithmetic units can operate normally; until when the processor sends a synchronous execution result instruction to the graph calculation flow unit through the instruction scheduling unit (for example, when the calculation of other arithmetic units requires the graph calculation flow unit If the graph calculation flow unit has not completed the graph calculation task at this time, it will start to block the pipeline of the processor, and will not exit the graph calculation flow unit until the graph calculation flow unit completes the graph calculation task and feeds back the execution result.
  • Blocking state to ensure that when other computing units need the execution result of the graph calculation flow unit, they can wait for the graph calculation flow unit to feed back the execution result before continuing execution, thereby improving the parallelism of the processor core.
  • This instruction can realize the parallel computing mode between other computing units in the processor and the graph computing flow unit, and can be applied to asynchronous computing programs.
  • the embodiment of the present invention also provides an implementation manner of triggering the graph calculation through the judgment of the graph calculation flow unit 1012 itself.
  • the composition block information includes the operation method of each of the N computing nodes, and the connection and sequence information between the N computing nodes, and the connection and sequence information between the N computing nodes
  • the sequence information includes the source node and the destination node respectively corresponding to the L edges; the graph computing flow unit 1012 monitors whether the input parameters required by each of the N computing nodes are ready; the input parameters are ready
  • the target calculation node inputs the input parameters of the target calculation node into the calculation method corresponding to the target calculation node for calculation to obtain the calculation result; according to the source node and the destination node corresponding to the L edges, each The calculation result of the source node in each edge is input to the corresponding destination node as an input parameter.
  • the graph includes multiple nodes and edges connecting each node, and an edge includes the source node, the destination node and the relationship between the source node and the destination node that constitute the edge.
  • the graph computing architecture in this application abstracts the data flow and the control flow program into a graph composed of N nodes, and the connection between the nodes represents a data flow (Dataflow) or a control flow (ControlFlow).
  • Dataflow data flow
  • ControlFlow control flow
  • Each node acts as a graph instruction. Once the input required by a certain graph instruction is ready, the current instruction can perform operations and pass the result to the corresponding input of the next instruction.
  • Figure 7 is a schematic diagram of an abstract model of computing nodes in a composition block provided by an embodiment of the present invention.
  • conditional input p
  • the operation can be performed and the result of the operation is passed to the input of the corresponding node below.
  • the add operation a+b in instruction 1 can be passed to the left input of instruction 4.
  • this application can be expressed as "1add 4l", which means that for instruction 1, once its input is ready, the result is passed to the left input of instruction 4.
  • this application only needs to provide the output address but not the input information of the instruction. The input only needs to ensure that the input of each instruction has a single or multiple instructions. Therefore, using the graph architecture instruction set encoding in the present application makes the graph calculation process simpler and faster.
  • the ideal hardware required to execute this graph is that each node has a computing unit (Process Engine, PE) in this application, and can pass the ideal N
  • the -to-N shared bus (Crossbar) transmits the result to the corresponding computing node of the next level (Crossbar) in the next shot.
  • PE Processing Unit
  • This kind of N-N Crossbar is difficult to realize. Therefore, in a real hardware design, in a possible implementation manner, it is defined in the embodiment of the present invention that P instructions share X computing nodes. That is, it is equivalent to selecting at most X instructions (ready instructions must be input) from P instructions per beat in a computing node to perform simultaneous operations.
  • the calculation node can start graph calculation. While some computing nodes (such as the source node corresponding to the edge) obtain initial input parameters from outside the graph computing flow unit, another part of the computing node (such as the destination node corresponding to the edge) may need to wait for the computing node with which it is associated (such as After the source node) is calculated, the calculation result can be used as its input parameter to start graph calculation. Therefore, the calculation start time of each computing node may be inconsistent, but for each computing node, the calculation method and input parameters (may include left input parameters, right input parameters or conditional parameters) are prepared, and then the calculation can be started. .
  • the processor core 101 further includes a result write-back unit; the graph calculation flow unit 1012 and the at least one general arithmetic unit are respectively connected to the result write-back unit 1014; the graph calculation control instruction includes a return The parameter command, the parameter return command carries the identities of the K computing nodes and the result registers corresponding to the identities of the K computing nodes; the graph computing flow unit 1012 is specifically configured to: The calculation results are respectively sent to the result write-back unit 1014.
  • some calculation nodes may need to output the calculation result to the graph calculation flow unit after the final calculation is completed and write the result back to the unit, that is,
  • the graph calculation flow unit may, according to the received graph calculation control instruction, carry the identifiers of the K calculation nodes in the return-parameter instruction, so as to control the final calculation results of the K calculation nodes as the calculation results of the entire composition block.
  • the results output to the graph calculation flow unit are written back to the unit so that the subsequent execution unit can perform further calculations.
  • the result write-back unit 1014 specifically includes a reorder buffer (ReorderBuffer), which is used to store the instruction execution order before the out-of-order execution.
  • ReorderBuffer reorder buffer
  • the result write-back unit 1014 also includes register leases, such as general-purpose registers, special-purpose registers, and so on.
  • the general register group is used to save the operands and intermediate results participating in the operation; and the special registers are usually some status registers, which cannot be changed by the program, and are controlled by the processor itself to indicate a certain state.
  • the general arithmetic unit 1013 in the processor 10 of the present application may include multiple types of hardware execution units to execute or accelerate different types of computing tasks. It may mainly include one or more of a memory read/write unit 1013A (LSU), a floating point operation unit 1013D (FPU), a vector operation unit 1013C (SIMD), and an operation logic unit 1013D (ALU).
  • LSU memory read/write unit
  • FPU floating point operation unit
  • SIMD vector operation unit
  • ALU operation logic unit
  • the general operation instruction includes a general operation logic instruction or a memory read/write instruction;
  • the at least one general operation unit includes: an operation logic unit 1013D (ALU), which is used to receive instructions sent by the scheduling unit 1011 The general-purpose operation logic instructions of, and perform logical operations; or the memory read and write unit 1013A (LSU), which is used to receive the memory read and write instructions sent by the instruction scheduling unit 1011, and perform memory read and write operations.
  • ALU operation logic unit
  • LSU memory read and write unit
  • Arithmetic and Logic Unit mainly completes fixed-point arithmetic operations (addition, subtraction, multiplication, and division), logical operations (AND or non-exclusive OR), and shift operations on binary data.
  • Mathematical operations such as addition, subtraction, multiplication, division and logical operations such as "OR, AND, ASL, ROL" and other instructions are executed in the logical operation unit.
  • the technical logic operation unit affects the types of operations in the processor such as compression and decompression, computer process scheduling, compiler syntax analysis, computer circuit-aided design, and game AI processing.
  • the Load Store Unit is used to calculate the address. Instructions for accessing the memory type (generally referred to as load/store) usually carry the memory address they want to use in the instruction, and the LSU is responsible for processing these instructions and calculating the address carried in the instruction. Using an LSU alone to calculate the address of a memory type instruction can enable the LSU to execute instructions in parallel with other execution units, which improves the execution efficiency of accessing memory type instructions and improves the performance of the processor.
  • the at least one arithmetic unit may also include an arithmetic logic unit 1013D and a memory read-write unit, where the logical arithmetic unit is mainly used for input-related logical operations, and the memory read-write unit is used for data read-write operations.
  • Instructions that is, the same units mentioned above are in the execution pipeline stage with the graph calculation flow unit, and jointly complete various types of calculation tasks after decoding in the CPU. They can be executed in parallel, serially, or partially in parallel. Serial execution, in order to more efficiently complete the computing tasks of the processor.
  • the embodiment of the present invention embeds the directed graph flow computing architecture (Graphflow) into a module in the superscalar processor, and reuses the existing arithmetic units in the superscalar processor core to achieve better performance and lower performance Energy consumption.
  • Graphflow directed graph flow computing architecture
  • the graph calculation control instruction includes a data read and write instruction, and the data read and write instruction carries a read and write address in the memory read and write unit 1013A; the graph calculation flow unit 1012 is also used to: The memory read/write address in the data read/write instruction reads data from the memory read/write unit 1013A (LSU) or writes data to the memory read/write unit 1013A (LSU).
  • the graph calculation flow unit 1011 can read the instructions, parameters, etc., required for graph calculation from the memory read and write unit 1013A (LSU) through related read (Load) instructions or write (Store) instructions, or transfer the graph
  • LSU memory read and write unit 1013A
  • Load Read
  • Store write
  • the calculated execution result is written into the memory read-write unit 1013A (LSU), and different operations can be performed according to the specific instruction content in the target program.
  • the data read from the memory read/write unit 1013A is actually the data read from the memory unit 1017 by the memory read/write unit 1013A; and the data written to the memory read/write unit 1013A (LSU) The data is actually the data written into the memory unit 1017 from the memory read/write unit 1013A after being written to the memory read/write unit 1013A.
  • the graph calculation flow unit 1011 can also directly read data from the memory unit 1017 according to the graph calculation control instruction, or write the execution result directly into the memory unit 1017, depending on the specific instructions in the target program to be executed. , That is, the graph calculation flow unit 1011 can either calculate control instructions according to the graph, obtain data from the memory read and write unit 1013A, or obtain data from the memory unit 1017, and similarly, can calculate control instructions based on the graph, and read and write to the memory Data is written to the unit 1013A, and data can also be written to the memory unit 1017.
  • the graph calculation flow unit in the processor core 101 can multiplex the functions of the memory read and write unit in the processor core 101, and read and write from the memory according to the read and write addresses in the relevant data read and write instructions Read data or write data in the unit LSU.
  • the at least one general arithmetic unit further includes a floating-point arithmetic unit FPU or a vector arithmetic unit 1013C (SIMD);
  • the graph calculation task includes floating-point arithmetic or vector arithmetic; a graph calculation flow unit 1012. It is also used to: send the data of the floating-point operation to the floating-point operation unit FPU for calculation, and receive the calculation result fed back by the FPU; or send the data of the vector operation to the vector operation unit 1013CSIMD Perform calculations, and receive the calculation results fed back by the SIMD.
  • the floating point operation unit 1013B (Floating Point Unit, FPU) is mainly responsible for floating point operations and high-precision integer operations.
  • Floating-point computing power is an important indicator related to the CPU's multimedia-related applications, audio and video encoding and decoding, image processing/3D graphics processing, and also affects the CPU's scientific computing performance, such as fluid mechanics, quantum mechanics, etc.
  • Single instruction multiple data stream (Single Instruction Multiple Data, SIMD) can also be called the vector operation unit 1013C, which is a technology that realizes data-level parallelism.
  • the vector operation unit 1013C performs multiple operations in a single instruction at the same time .
  • a vector instruction to start a group of data operations, in which data loading, storage and data calculation are carried out in the form of pipeline. It is suitable for a large number of fine-grained, homogeneous, independent data operations, such as multimedia, big data, artificial intelligence and other application scenarios.
  • the memory read and write of the graph computing flow unit 1012 can reuse the memory read and write unit 1013A (LSU) in the processor 10, and the floating point and complex vector operations multiplex the operation logic of FPU and SIMD. .
  • LSU memory read and write unit 1013A
  • the general arithmetic unit in the embodiment of the present invention may also include a floating-point arithmetic unit FPU and/or a vector arithmetic unit 1013C (SIMD), where the floating-point arithmetic unit is used to calculate floating-point operations that require higher data accuracy.
  • the vector computing unit 1013C can be used for computing a single instruction stream and multiple data streams, and because the above-mentioned general computing unit (including some dedicated computing units) and the graph computing flow unit are both in the execution pipeline stage, and both are in the same execution pipeline stage as the graph computing flow unit. There are data transmission channels between the calculation flow units.
  • the graph calculation flow calculation unit is processing graph calculation tasks, there are calculation tasks related to floating-point operations or calculation tasks that are single-instruction and multiple data streams.
  • the corresponding data transmission channel is sent to the corresponding general arithmetic unit for calculation, and there is no need to repeatedly set the corresponding processing unit in the graph calculation flow unit to process the corresponding type of arithmetic task, thereby greatly saving hardware area and overhead.
  • this application also defines one of the Graphflow Instruction-Set Architecture (Graphflow ISA)
  • Graphflow ISA The basic format of the flow instruction, that is, the calculation method of each of the N calculation nodes contained in the composition block information in this application, and the connection and sequence information between the N calculation nodes, where a single calculation
  • the format of an execution instruction executed by the node can be expressed as: [ID+opcode+dest0ID+dest1ID]
  • FIG. 8 is an abstract model of graph computing flow instructions provided by an embodiment of the present invention.
  • ID-based flow instructions will be placed on the computing node whose sequence number is ID.
  • the range of ID is [0, N-1], and N is the total number of nodes in Graphflow.
  • a flow instruction can express one or two dependencies, indicating that the result data is passed to dest0ID and dest1ID.
  • each computing node of Graphflow can place one instruction and up to two outputs.
  • Each computing node has its own left input (l) and right input (r) buffers, operands (opcode), and two target pointers (dest0T, dest1T, T represents the left and right input of the target instruction). Since N nodes are assumed to be fully connected, the range of dest is [0, N-1], which represents the left input (l) and right input (r) buffers of any node to which the output of any node can point.
  • Figure 9 is a schematic diagram of abstracting the code into a data flow graph provided by an embodiment of the present invention:
  • the instructions 0, 1, 2, 5, 6, 9 are placed on the corresponding computing units according to their own IDs.
  • 0,5 calculate the address of A[i]
  • instructions 1, 2, and 6 calculate the data (a+b)*(c+d).
  • Each instruction represents the direction of data flow.
  • the corresponding inputs and connections are configured during the composition phase.
  • 2add 6r means "Once all inputs 2l and 2r of the addition operation of instruction 2 arrive, perform the addition and pass the result of the operation to the right input of instruction 6 (6r)".
  • 9st means "Once all inputs 9l and 9r of the store operation of instruction 9 arrive, operate the store", the store does not need to pass data to other instructions, so there is no need to declare the Destination in the 9 instruction.
  • the degree of parallelism between instructions is obvious (for example, instructions 0, 1, 2 and 5, 6).
  • the only thing the hardware needs to do is to check in parallel whether the input required by each node has arrived. So this is why the Graphflow architecture does not require a lot of logic for hardware dependency analysis.
  • the execution phase for each node, as long as its input arrives, operations can be performed. Therefore, there is no need to put the source information of the instruction in the encoding of the stream instruction.
  • the input of each flow instruction may be dynamically passed in by different nodes, or it may be passed in by other hardware modules. Each instruction does not care where it reads the data, as long as other instructions send the data it needs, then the operation can be performed.
  • FIG. 10 is a schematic flowchart of a processing method provided by an embodiment of the present invention.
  • the processing method is applied to a processor.
  • the processor includes a processor core.
  • the processor core includes an instruction scheduling unit, and The graph calculation flow unit and at least one general arithmetic unit connected to the instruction scheduling unit; and the processing method is applicable to any one of the processors in Figures 1 to 3 and the devices containing the processors (such as mobile phones, computers, Server, etc.).
  • the method may include the following steps S201-S203, where:
  • Step S201 Through the instruction scheduling unit, allocating a general calculation instruction among the decoded instructions to be executed to the at least one general calculation unit, and allocating the decoded instruction to be executed in the graph calculation control instruction to the In the graph calculation unit, the general calculation instruction is used to instruct the execution of a general calculation task, and the graph calculation control instruction is used to instruct the execution of the graph calculation task.
  • Step S202 execute the general calculation instruction through the at least one general arithmetic unit
  • Step S203 Execute the graph calculation control instruction through the graph calculation flow unit.
  • the processor core further includes an instruction fetching unit and an instruction decoding unit, and the above method further includes:
  • the target program is decoded by the instruction decoding unit to obtain the decoded instruction to be executed.
  • the processor core further includes a result write-back unit; the graph calculation flow unit and the at least one general arithmetic unit are respectively connected to the result write-back unit; the method further includes :
  • Part or all of the first execution result and the second execution result are written back to the instruction scheduling unit through the result write-back unit.
  • the processor further includes a memory unit;
  • the graph calculation flow unit includes N computing nodes;
  • the graph calculation control instruction includes a start composition instruction, and the start composition instruction carries the The target address in the memory unit;
  • the execution of the graph calculation control instruction by the graph calculation flow unit includes:
  • the graph calculation flow unit receives the start composition instruction, and reads the composition block information from the memory unit according to the target address.
  • the composition block information includes the information of each of the N computing nodes. Operation method, and connection and sequence information between the N computing nodes.
  • the graph computing control instruction includes a parameter transfer instruction, and the parameter transfer instruction carries the identifiers of M computing nodes and the input parameters corresponding to the identifiers of the M computing nodes.
  • the M computing nodes are some or all of the N nodes; the execution of the graph computing control instruction by the graph computing flow unit includes:
  • the graph computing flow unit receives the parameter transfer instruction, and inputs the input parameters corresponding to the identifiers of the M computing nodes into the M computing nodes, respectively.
  • connection and sequence information between the N computing nodes includes a source node and a destination node respectively corresponding to the L edges;
  • the graph computing flow unit executes the graph computing control Instructions, including:
  • the graph calculation flow unit it is monitored whether the input parameters required by each of the N calculation nodes are ready; for the target calculation node whose input parameters have been prepared, the input parameters of the target calculation node , Input into the calculation method corresponding to the target calculation node to calculate to obtain the calculation result; according to the source node and the destination node corresponding to the L edges, the calculation result of the source node in each edge is input as an input parameter To the corresponding destination node.
  • the graph calculation control instruction includes a graph calculation start instruction; the execution of the graph calculation control instruction by the graph calculation flow unit to obtain the execution result of the graph calculation task includes:
  • the method further includes:
  • the instruction scheduling unit controls the processor core to enter a blocking state after the graph calculation flow unit receives the start graph calculation instruction and before the graph calculation task is completed.
  • the method further includes:
  • the device core send a synchronization execution result instruction to the graph calculation flow unit through the instruction scheduling unit, and control the processing after the graph calculation flow unit receives the synchronization execution result instruction and before completing the graph calculation task
  • the device core enters a blocking state.
  • the method further includes:
  • the processor core is controlled to exit the blocking state.
  • the processor core further includes a result write-back unit, and the write-back unit includes a plurality of registers; the graph calculation flow unit and the at least one general arithmetic unit are respectively connected to the The result write-back unit is connected; the graph calculation control instruction includes a parameter return instruction, and the parameter return instruction carries the identities of K computing nodes and the registers corresponding to the identities of the K computing nodes; The graph calculation flow unit executes the graph calculation control instruction to obtain the execution result of the graph calculation task, including:
  • the calculation results of the K calculation nodes are respectively sent to the corresponding registers in the result write-back unit through the control of the graph calculation flow unit.
  • the general arithmetic instruction includes a general arithmetic logic instruction; the at least one general arithmetic unit includes an arithmetic logic unit ALU; the general arithmetic instruction is executed by the at least one general arithmetic unit, include:
  • the general operation instruction includes a memory read and write instruction;
  • the at least one general operation unit includes a memory read and write unit LSU; and the at least one general operation unit executes the general calculation instruction To obtain the execution result of the general computing task, including:
  • the graph calculation control instruction includes a data read and write instruction, and the data read and write instruction carries a memory read and write address; the method further includes:
  • the graph calculation flow unit reads data from the memory read/write unit LSU or writes data to the memory read/write unit LSU according to the memory read/write address in the data read/write instruction.
  • the at least one general arithmetic unit further includes a floating-point arithmetic unit FPU; the graph calculation task includes floating-point arithmetic; and the method further includes:
  • the at least one general operation unit further includes a vector operation unit SIMD; the graph calculation task includes vector operation; and the method further includes:
  • the graph calculation flow unit sends the vector operation data to the vector operation unit SIMD for calculation, and receives the calculation result fed back by the SIMD.
  • the embodiment of the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium may store a program, and when the program is executed by a processor, the processor can execute any of the methods described in the above method embodiments. A part or all of the steps.
  • the embodiment of the present invention also provides a computer program, the computer program includes instructions, when the computer program is executed by a multi-core processor, the processor can execute part or all of the steps of any one of the above method embodiments .
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the above-mentioned units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the aforementioned integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc., specifically a processor in a computer device) to execute all or part of the steps of the foregoing methods of the various embodiments of the present application.
  • a computer device which may be a personal computer, a server, or a network device, etc., specifically a processor in a computer device
  • the aforementioned storage media may include: U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (Read-Only Memory, abbreviation: ROM) or Random Access Memory (Random Access Memory, abbreviation: RAM), etc.
  • U disk mobile hard disk
  • magnetic disk magnetic disk
  • optical disk read-only memory
  • Read-Only Memory abbreviation: ROM
  • Random Access Memory Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)
  • Executing Machine-Instructions (AREA)
PCT/CN2020/093627 2020-05-30 2020-05-30 一种处理器、处理方法及相关设备 WO2021243490A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
JP2022573476A JP7495030B2 (ja) 2020-05-30 2020-05-30 プロセッサ、処理方法、および関連デバイス
EP20939004.6A EP4152150B1 (de) 2020-05-30 2020-05-30 Prozessor, verarbeitungsverfahren und zugehörige vorrichtung
BR112022024535A BR112022024535A2 (pt) 2020-05-30 2020-05-30 Processador, método de processamento, e dispositivo relacionado
PCT/CN2020/093627 WO2021243490A1 (zh) 2020-05-30 2020-05-30 一种处理器、处理方法及相关设备
CN202080101335.6A CN115668142A (zh) 2020-05-30 2020-05-30 一种处理器、处理方法及相关设备
US18/070,781 US20230093393A1 (en) 2020-05-30 2022-11-29 Processor, processing method, and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/093627 WO2021243490A1 (zh) 2020-05-30 2020-05-30 一种处理器、处理方法及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/070,781 Continuation US20230093393A1 (en) 2020-05-30 2022-11-29 Processor, processing method, and related device

Publications (1)

Publication Number Publication Date
WO2021243490A1 true WO2021243490A1 (zh) 2021-12-09

Family

ID=78831452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093627 WO2021243490A1 (zh) 2020-05-30 2020-05-30 一种处理器、处理方法及相关设备

Country Status (6)

Country Link
US (1) US20230093393A1 (de)
EP (1) EP4152150B1 (de)
JP (1) JP7495030B2 (de)
CN (1) CN115668142A (de)
BR (1) BR112022024535A2 (de)
WO (1) WO2021243490A1 (de)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116414541B (zh) * 2023-05-26 2023-09-05 摩尔线程智能科技(北京)有限责任公司 兼容多种任务工作模式的任务执行方法和装置
CN117707625B (zh) * 2024-02-05 2024-05-10 上海登临科技有限公司 支持指令多发的计算单元、方法及相应图形处理器

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073543A (zh) * 2011-01-14 2011-05-25 上海交通大学 通用处理器与图形处理器融合系统及其融合方法
CN208580395U (zh) * 2018-03-14 2019-03-05 武汉市聚芯微电子有限责任公司 一种处理器流水线结构
US20200057613A1 (en) * 2014-02-06 2020-02-20 Oxide Interactive, LLC Method and system of a command buffer between a cpu and gpu
CN110826708A (zh) * 2019-09-24 2020-02-21 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4560705B2 (ja) 1999-08-30 2010-10-13 富士ゼロックス株式会社 データ処理装置の制御方法
JP2005108086A (ja) 2003-10-01 2005-04-21 Handotai Rikougaku Kenkyu Center:Kk データ処理装置
EP1622009A1 (de) * 2004-07-27 2006-02-01 Texas Instruments Incorporated JSM-Architektur und Systeme
US10242493B2 (en) * 2014-06-30 2019-03-26 Intel Corporation Method and apparatus for filtered coarse pixel shading
US9984037B1 (en) * 2015-04-27 2018-05-29 Synaptic Engines, Llc Scheduler for a fine grained graph processor
US20170286122A1 (en) * 2016-04-01 2017-10-05 Intel Corporation Instruction, Circuits, and Logic for Graph Analytics Acceleration
US20220214861A1 (en) * 2021-01-06 2022-07-07 Samsung Electronics Co., Ltd. System and method to accelerate graph feature extraction
US11921784B2 (en) * 2021-05-13 2024-03-05 Advanced Micro Devices, Inc. Flexible, scalable graph-processing accelerator

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073543A (zh) * 2011-01-14 2011-05-25 上海交通大学 通用处理器与图形处理器融合系统及其融合方法
US20200057613A1 (en) * 2014-02-06 2020-02-20 Oxide Interactive, LLC Method and system of a command buffer between a cpu and gpu
CN208580395U (zh) * 2018-03-14 2019-03-05 武汉市聚芯微电子有限责任公司 一种处理器流水线结构
CN110826708A (zh) * 2019-09-24 2020-02-21 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4152150A4 *

Also Published As

Publication number Publication date
CN115668142A (zh) 2023-01-31
EP4152150A4 (de) 2023-06-28
JP2023527227A (ja) 2023-06-27
EP4152150B1 (de) 2024-06-19
US20230093393A1 (en) 2023-03-23
EP4152150A1 (de) 2023-03-22
JP7495030B2 (ja) 2024-06-04
BR112022024535A2 (pt) 2023-01-31

Similar Documents

Publication Publication Date Title
CN107810480B (zh) 根据性能度量的指令块分配
EP3350686B1 (de) Unterstützung zur fehlerbeseitigung für blockbasierten prozessor
CN108027770B (zh) 用于数据流isa的密集读取编码
US11681531B2 (en) Generation and use of memory access instruction order encodings
US20080046689A1 (en) Method and apparatus for cooperative multithreading
US20230093393A1 (en) Processor, processing method, and related device
US20160378491A1 (en) Determination of target location for transfer of processor control
KR20180021812A (ko) 연속하는 블록을 병렬 실행하는 블록 기반의 아키텍쳐
JP2018507474A (ja) タスク制御フローを加速させるための方法およびシステム
US10175988B2 (en) Explicit instruction scheduler state information for a processor
JP2001306324A (ja) マルチスレッドvliwプロセッサにおける分割可能なパケットを識別するための方法および装置
EP3491514A1 (de) Transaktionsregisterdatei für einen blockbasierten prozessor
US7779240B2 (en) System and method for reducing power consumption in a data processor having a clustered architecture
US20230195526A1 (en) Graph computing apparatus, processing method, and related device
CN108027735B (zh) 用于操作处理器的装置、方法和计算机可读存储介质
Hou et al. FuMicro: A Fused Microarchitecture Design Integrating In‐Order Superscalar and VLIW
WO2021253359A1 (zh) 一种图指令处理方法及装置
US20150074378A1 (en) System and Method for an Asynchronous Processor with Heterogeneous Processors
US20230205530A1 (en) Graph Instruction Processing Method and Apparatus
WO2024087039A1 (zh) 一种块指令的处理方法和块指令处理器
WO2023123453A1 (zh) 运算加速的处理方法、运算加速器的使用方法及运算加速器
May XMOS XS1 Architecture
CN118349283A (zh) 用于分布式集群系统的非阻塞型宏指令多级流水处理器的执行方法和装置
Nishikawa et al. CUE-v3: Data-Driven Chip Multi-Processor for Ad hoc and Ubiquitous Networking Environment.
CN118295712A (zh) 数据处理方法、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20939004

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022573476

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 202237069970

Country of ref document: IN

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112022024535

Country of ref document: BR

ENP Entry into the national phase

Ref document number: 2020939004

Country of ref document: EP

Effective date: 20221213

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 112022024535

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20221130