WO2024119869A1 - Method for executing inter-chip communication task, and related product - Google Patents

Method for executing inter-chip communication task, and related product Download PDF

Info

Publication number
WO2024119869A1
WO2024119869A1 PCT/CN2023/112579 CN2023112579W WO2024119869A1 WO 2024119869 A1 WO2024119869 A1 WO 2024119869A1 CN 2023112579 W CN2023112579 W CN 2023112579W WO 2024119869 A1 WO2024119869 A1 WO 2024119869A1
Authority
WO
WIPO (PCT)
Prior art keywords
communication
concurrent
primitive
communication primitive
executed
Prior art date
Application number
PCT/CN2023/112579
Other languages
French (fr)
Chinese (zh)
Inventor
朝鲁
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2024119869A1 publication Critical patent/WO2024119869A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores

Definitions

  • the present disclosure relates to the field of chips, and more specifically, to the field of inter-chip communication of chips.
  • One purpose of the present disclosure is to solve how to use a single computing core of an artificial intelligence chip to avoid communication deadlock through coroutine programming.
  • a further purpose of the present disclosure is to use a single computing core of an artificial intelligence chip to complete time-division multiplexing inter-chip communication through coroutine programming to support concurrent communication tasks.
  • a method for executing an inter-chip communication task wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes a plurality of communication primitives, the plurality of communication primitives including a serial communication primitive connected in series, the method comprising: performing a search for the communication primitive queue to determine the status of the serial communication primitives in the communication primitive queue; in response to searching for an interrupted serial communication primitive, re-executing the communication primitive queue starting from the interrupted serial communication primitive.
  • an electronic device comprising: one or more processors; and a memory, wherein the memory stores computer executable instructions, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method described above.
  • a computer-readable storage medium comprising computer-executable instructions.
  • the computer-executable instructions are executed by one or more processors, the method described above is executed.
  • the technical solution provided by the present disclosure can bring at least one of the following beneficial effects: without introducing a hardware multi-threading mechanism, the time-sharing reuse capability of the computing core can be realized by using a software coroutine method, so that the computing core can be fully utilized and task deadlock can be avoided.
  • the coroutine execution process has relatively small changes to the hardware, and generally supports various SIMD (Single Instruction Multiple Data) processing architectures to realize software time-sharing reuse.
  • SIMD Single Instruction Multiple Data
  • the asynchronous confirmation method of asynchronous communication primitives is supported by the primitive jump mechanism, and automatic software communication retransmission can be realized without modifying the OP (communication primitive) logic.
  • OP communication primitive
  • the solution disclosed in the present disclosure is sufficient to solve the deadlock problem caused by communication congestion.
  • FIG1 is a schematic diagram showing the structure of a board 10 according to an embodiment of the present disclosure.
  • FIG2 is a schematic diagram showing the combined processing device 101 of this embodiment
  • FIG3 shows a schematic diagram of the internal structure of the computing device 201
  • FIG. 4 shows the internal architecture of the processing core
  • FIG. 5 shows a method for executing an inter-chip communication task according to an embodiment of the present disclosure.
  • FIG6 shows an example of coroutine execution according to one embodiment of the present disclosure
  • FIG7 shows possible changes in the working state of a communication primitive (OP).
  • FIG8 is a schematic diagram showing the execution of a serial primitive queue according to an embodiment of the present disclosure.
  • FIG9 shows a schematic diagram of communication primitives involving in-situ operations
  • FIG10 shows an exemplary application scenario in which there are multiple concurrent communication primitives
  • 11a to 11f are schematic diagrams showing a method of setting concurrent communication primitives between serial communication primitives according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “upon” or “in response to determining” or “in response to detecting,” depending on the context.
  • the phrase “if it is determined” or “if [described condition or event] is detected” may be interpreted as meaning “upon determination” or “in response to determining” or “upon detection of [described condition or event]” or “in response to detecting [described condition or event],” depending on the context.
  • Wafers are circular sheets made of pure silicon, generally divided into 6-inch, 8-inch, 12-inch and other specifications. Wafers are cut into small pieces, which are called dies. Each die is mounted with a chip and wired to achieve specific electrical functions. Then the die is packaged into a particle. The purpose of packaging is to place, fix, seal, protect the chip and enhance the electrical and thermal performance. At the same time, the contacts of the chip are connected to the pins of the package shell with wires, and a chip package structure is completed.
  • the memory is used to temporarily store the computing data required by the system on chip and the data exchanged with the external memory.
  • the memory can be a high-bandwidth memory (HBM), which is a high-performance DRAM (Dynamic Random Access Memory) made based on a 3D stacking process and is suitable for applications with high memory bandwidth requirements, such as graphics processors, online switching and forwarding equipment (such as routers, switches), etc.
  • HBM high-bandwidth memory
  • DRAM Dynamic Random Access Memory
  • SoC System on Chip
  • the board 10 includes a combined processing device 101, which is an artificial intelligence computing unit to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, data mining, etc.
  • a combined processing device 101 which is an artificial intelligence computing unit to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, data mining, etc.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which places high demands on the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligence applications and has huge off-chip storage, on-chip storage and a large amount of computing power.
  • the combined processing device 101 is connected to the external device 103 through the external interface device 102.
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a Wifi interface.
  • the data to be processed can be transmitted from the external device 103 to the combined processing device 101 through the external interface device 102.
  • the calculation result of the combined processing device 101 can be transmitted back to the external device 103 via the external interface device 102.
  • the external interface device 102 can have different interface forms, such as a PCIe (Peripheral Component Interconnect express) interface, etc.
  • the board 10 also includes an external memory 104 for storing data, which includes one or more storage units 105.
  • the external memory 104 is connected to the control device 106 and the combined processing device 101 through a bus and transmits data.
  • the control device 106 in the board 10 is configured to control the state of the combined processing device 101.
  • the control device 106 may include a single chip microcomputer, also known as a micro control unit (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • FIG2 is a schematic diagram showing the combined processing device 101 of this embodiment.
  • the combined processing device 101 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.
  • the computing device 201, the interface device 202, and the processing device 203 are integrated into the aforementioned system on chip.
  • the computing device 201 itself is the aforementioned system on chip.
  • the computing device 201 is configured to execute user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203.
  • the computing device 201 can obtain input data from the processing device 203 via the interface device 202 and write it into the storage device on the computing device 201 chip.
  • the computing device 201 can obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the computing device 201 chip.
  • the interface device 202 can also read data in the storage device of the computing device 201 and transmit it to the processing device 203.
  • the processing device 203 performs basic controls including but not limited to data handling, starting and/or stopping the computing device 201, etc.
  • the processing device 203 can be a central processing unit, a graphics processing unit, or one or more types of processors in other general and/or special processors, which include but are not limited to digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing device 201 can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing device 201 and the processing device 203 are integrated and considered together, the two are regarded as forming a heterogeneous multi-core structure.
  • DRAM 204 is the aforementioned high-bandwidth memory, which is used to store data to be processed. Its size is usually 16G or larger and is used to save data of the computing device 201 and/or the processing device 203.
  • FIG. 3 shows a schematic diagram of the internal structure of a computing device 201.
  • the computing device 201 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the computing device 201 in the figure adopts a multi-core hierarchical structure design, which includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and multiple clusters 305.
  • the peripheral communication module 302 is used to receive the control signal from the processing device 203 through the interface device 202, and start the computing device 201 to perform the task.
  • the on-chip interconnect module 303 connects the external storage controller 301, the peripheral communication module 302 and the multiple clusters 305 to transmit data and control signals between the modules.
  • the synchronization module 304 is a global synchronization barrier controller (Global Barrier Controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC Global Barrier Controller
  • Clusters 305 are the computing cores of the computing device 201. Four are shown as examples in the figure. With the development of hardware, the computing device 201 disclosed in the present invention can also include 8, 16, 64, or even more clusters 305. Clusters 305 are used to efficiently execute deep learning algorithms.
  • Each cluster 305 includes multiple processor cores (IPU Core) 306 and a memory core (MEM Core) 307.
  • IPU Core processor cores
  • MEM Core memory core
  • Each processor core 306 includes three modules: a control module 41, a computing module 42, and a storage module 43.
  • the control module 41 is used to coordinate and control the operation of the operation module 42 and the storage module 43 to complete the deep learning task, and includes an instruction fetch unit (IFU) 411 and an instruction decode unit (IDU) 412.
  • the instruction fetch unit 411 is used to obtain instructions from the processing device 203, and the instruction decode unit 412 decodes the obtained instructions and sends the decoding results to the operation module 42 and the storage module 43 as control information.
  • the operation module 42 includes a vector operation unit 421 and a matrix operation unit 422.
  • the vector operation unit 421 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 422 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 43 is used to store or transfer related data, including a neuron RAM (NRAM) 431, a weight RAM (WRAM) 432, an input/output direct memory access module (IODMA) 433, and a transfer direct memory access module (MVDMA) 434.
  • NRAM 431 is used to store input and output data and intermediate results for calculation by the processor core 306;
  • WRAM 432 is used to store the weights of the deep learning network;
  • IODMA 433 controls the memory access between NRAM 431/WRAM 432 and DRAM 204 through the broadcast bus 309;
  • MVDMA 434 is used to control the memory access between NRAM 431/WRAM 432 and SRAM 308.
  • the storage core 307 is mainly used for storage and communication, that is, to store shared data or intermediate results between the processor cores 306, and to perform communication between the cluster 305 and the DRAM 204, between the clusters 305, and between the processor cores 306.
  • the storage core 307 has the ability of scalar operations and is used to perform scalar operations.
  • the storage core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a cluster direct memory access module (CDMA) 310, and a global direct memory access module (GDMA) 311.
  • SRAM shared memory unit
  • CDMA cluster direct memory access module
  • GDMA global direct memory access module
  • the SRAM 308 plays the role of a high-performance data transfer station.
  • the data reused between different processor cores 306 in the same cluster 305 does not need to be obtained from the DRAM 204 by each processor core 306, but is transferred between the processor cores 306 through the SRAM 308.
  • the storage core 307 only needs to quickly distribute the reused data from the SRAM 308 to multiple processor cores 306, so as to improve the efficiency of inter-core communication and greatly reduce on-chip and off-chip input/output access.
  • Broadcast bus 309, CDMA 310 and GDMA 311 are used to perform communication between processor cores 306, communication between clusters 305 and data transmission between clusters 305 and DRAM 204, respectively. They will be described below.
  • the broadcast bus 309 is used to complete high-speed communication between the processor cores 306 in the cluster 305.
  • the broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (i.e., single processor core to single processor core) data transmission
  • multicast is a communication mode of transmitting a copy of data from SRAM 308 to specific processor cores 306, and broadcast is a communication mode of transmitting a copy of data from SRAM 308 to all processor cores 306, which is a special case of multicast.
  • CDMA 310 is used to control the access of SRAM 308 between different clusters 305 in the same computing device 201.
  • GDMA 311 cooperates with external memory controller 301 to control the access of SRAM 308 of cluster 305 to DRAM 204. memory access, or reading data from DRAM 204 to SRAM 308.
  • inter-chip described in the present disclosure includes multiple meanings.
  • “machine” usually refers to a server computing node host, and “inter-machine communication” can refer to the communication between multiple computing node hosts.
  • “Card” usually refers to a dedicated AI (Artificial Intelligence) computing device installed on a server computing node, and the "card” has one or more chips, such as MLU (Machine Learning Unit, machine learning processor) and GPU (Graphics Processing Unit, graphics processor).
  • MLU Machine Learning Unit
  • GPU Graphics Processing Unit, graphics processor
  • inter-chip high-speed interconnection communication devices between multiple machines and multiple cards, such as an inter-chip communication network built based on Serdes (serial-deserialization) and a host-level network based on Infiniband.
  • inter-chip communication includes communication between different chips between multiple hosts, communication between different chips on the same "card”, and communication between different chips in multiple cards on the same host.
  • RDMA Remote Direct Memory Access
  • card A can asynchronously write/read data to/from the memory of card B without card B performing any operation.
  • Allreduce operator In the process of multi-machine multi-card neural network training, in order to ensure the convergence of the data concurrent training results of multi-machine multi-card, each device participating in the distributed training needs to pass the gradient information ⁇ Wi of the current device back propagation (BP) to other devices, so that each device can finally obtain the reduced result of all gradient information, that is, ⁇ Wi .
  • the method of propagating and accumulating gradient information is called the AllReduce operator.
  • the Allreduce operator can be implemented on different network topologies.
  • the Allreduce operator optimized in the ring topology (Ring) uses the Ring Allreduce algorithm. From the perspective of a single card, the core processes that Allreduce needs to implement are: Receive (abbreviated as R), Compute (abbreviated as C), and Send (abbreviated as S).
  • R Receive
  • C Compute
  • S Send
  • the R part corresponds to receiving the gradient information ⁇ W_(i-1) sent by the upstream device
  • the S part corresponds to sending the updated gradient information ⁇ W_(i) to the downstream device.
  • Synchronization problem In RDMA mode, the computing core of card A writes data payload to the memory area of card B. At this time, the computing core of card B cannot sense whether the data payload has been written. At this time, if the subsequent execution steps of the computing core of card B depend on the arrival of the data payload to continue execution, the computing core of card B needs to sense the arrival of the data payload. The process of sensing the arrival of the data payload is called communication synchronization.
  • Communication deadlock problem If card A and card B each have a computing core, and there are two dual-card communication tasks X and Y, which are sent to the two cards respectively as X_A, X_B and Y_A, Y_B.
  • the communication tasks require that both ends of the communication must have the same task in order to communicate normally.
  • X_A and Y_B are both trapped in communication synchronization and polling and waiting for data to arrive. But in fact, because the tasks do not match, communication tasks X and Y will wait forever, causing communication deadlock problems.
  • the first solution is polling: hardware multi-threaded programming method (Single Instruction, Multiple Threads, SIMT).
  • SIMT Single Instruction, Multiple Threads
  • Step a Device A writes data (Data) and a tag (Flag) to the specified memory area of device B in sequence through RDMA.
  • Step b The computing core of device B polls whether the Flag has changed in the communication receiving task. If the Flag has not changed, go to step c; if the Flag has changed, go to step d.
  • Step c If the Flag never changes, it means that the Data has not yet completed the transmission. At this time, the thread can be switched out of the current processing due to invalid waiting, and the computing core processing is released to other computing/communication tasks.
  • the specific hardware multi-threading method is to save the scene (including the current program execution pointer, stack information, and register information) when the thread is switched out, and restore the scene when the thread is switched back, and resume execution from the thread breakpoint.
  • the thread resumes execution if the Flag is read to have changed, it will enter step d, otherwise continue to poll in step c.
  • Step d If the Flag changes, it means that the Data has been transmitted. At this time, the computing core of device B can safely read the Data, for example, to perform Reduce calculations.
  • the first solution is mainly based on SIMT (Single Instruction, Multiple Threads) implementation and requires hardware multi-threaded support for execution.
  • SIMT Single Instruction, Multiple Threads
  • Kernel multi-threaded kernels
  • the second solution can be found in the Chinese patent application with publication number CN114691312A. Specifically, the specific communication synchronization steps of the above interruption type are as follows:
  • Step a Device A writes the hardware descriptor and data to device B via RDMA.
  • Step b Device B receives the hardware descriptor, which means that the data has been received.
  • the data reception completion interrupt causes the hardware to parse the hardware descriptor, and then triggers the computing core to perform computing tasks according to the hardware descriptor content, such as Reduce computing.
  • the second solution is mainly based on SIMD (Single Instruction, Multiple Data) implementation, which supports computing core reuse through software and requires minimal hardware changes, thus avoiding a large number of hardware replacements or changes.
  • SIMD Single Instruction, Multiple Data
  • Figure 5 shows a method for executing an inter-chip communication task according to an embodiment of the present disclosure, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes multiple communication primitives, and the multiple communication primitives include a serial communication primitive connected in series.
  • the method includes: in operation S510, performing a search for the communication primitive queue to determine the status of the serial communication primitives in the communication primitive queue; in operation S520, in response to searching for an interrupted serial communication primitive, re-executing the communication primitive queue starting from the interrupted serial communication primitive.
  • Coroutine is a non-preemptive scheduling mechanism in which the software itself actively wakes up and sleeps.
  • threads are preemptive scheduling mechanisms in which software is passively awakened and sleeps by the operating system and hardware scheduling.
  • the code in multi-threaded mode, can be:
  • the thread will be automatically switched out by the operating system to execute other functions due to the exhaustion of the time slice during the read(X) process, and will resume execution after a period of time.
  • the thread is unaware of the switching-out and switching-in process.
  • the code can be:
  • the coroutine In coroutine mode, the coroutine will actively sleep and wake up other functions in Sleep state.
  • the communication primitives disclosed herein may be asynchronous communication primitives.
  • Asynchronous communication primitives may have the following characteristics:
  • Asynchronicity The execution of the communication primitive is asynchronous, that is, when the communication primitive function returns, the communication primitive represented by the communication primitive is The communication operation may still be executing on the hardware and the communication primitive cannot be confirmed to be completely completed until the asynchronous response is received.
  • Non-idempotent Repeated execution of a communication primitive may lead to incorrect execution results, so communication primitives usually cannot be executed twice unless the previous communication primitive was not fully executed due to a communication error.
  • Unreliable The communication primitive may cause packet loss due to problems such as link quality, so the communication primitive needs to re-execute the transmission of the lost part.
  • communication primitive queues may be used to describe inter-chip communication tasks, which may be executed serially or in parallel.
  • serially connected communication primitives are referred to as serial communication primitives
  • parallel connected communication primitives are referred to as concurrent communication primitives.
  • a communication queue may include multiple serial communication primitives, or may include mixed serial communication primitives and concurrent communication primitives, which will be described in more detail later.
  • FIG. 6 shows an example of coroutine execution according to one embodiment of the present disclosure.
  • the communication process can be abstracted as a serial execution process of a series of communication primitives OP, the initial state of which is START (start), and the terminal state is FINISH (end). Under normal circumstances, if no communication blocking is encountered, the coroutine operation will execute communication primitives OP0 to OP5 one by one from START until FINISH is executed. It should be pointed out that, for the convenience of identification, in FIG6 , the communication primitives OP that are skipped and not executed are marked with a dotted frame, and the communication primitives OP that are actually executed are represented by a solid frame.
  • Context recovery includes two parts of recovery:
  • Restoration of context data Restoration of context data only requires reloading the content in a specific storage space area.
  • the method of the present disclosure further includes: defining a state machine, wherein the state machine is used to describe the working state of the communication primitive; determining whether the communication primitive is interrupted according to the working state of the communication primitive described by the state machine; wherein the working state includes: waiting state; working state; executed state; and confirmed state.
  • FIG. 7 shows possible changes in the working state of the communication primitive OP.
  • a communication primitive OP when a communication primitive OP is not executed, it is in a pending state, which indicates that the communication primitive is not executed. All communication primitives OP may be in a pending state initially until the state changes after being executed.
  • the working state after being in the waiting state, is used to indicate that the communication primitive is being executed, the communication request of the communication primitive has not been completely issued, and the response signal has not been completely received.
  • it refers to the state of the communication primitive OP when the computing core executes it.
  • the communication primitive OP For a single computing core, at most one communication primitive is in the "working" state. In the "working" state, the communication primitive OP has not completely issued the communication request to other communication primitives, and has not completely received the response signal. Under normal circumstances, the state of all communication primitives OP will change from the waiting state to the working state, unless the execution of the OP is skipped.
  • the Executed state after the "Working" state, is used to indicate that the communication primitive has been executed.
  • the communication request of the communication primitive has been completely issued, and the response signal has not been completely received.
  • the communication primitive OP will wait for the response signal for the communication request.
  • the communication primitive OP may not receive the response signal, or the reception of the response signal may be greatly delayed.
  • These situations include but are not limited to: the request signal issued does not reach the downstream communication primitive OP (for example, due to a communication line failure); the request signal issued reaches the downstream OP, but the signal of the downstream OP does not issue a response signal (for example, the downstream OP fails), or although the downstream OP issues a response signal, the OP that issued the communication request does not receive the response signal (for example, the communication line fails).
  • the request signal issued does not reach the downstream communication primitive OP (for example, due to a communication line failure); the request signal issued reaches the downstream OP, but the signal of the downstream OP does not issue a response signal (for example, the downstream OP fails), or although the downstream OP issues a response signal, the OP that issued the communication request does not receive the response signal (for example, the communication line fails).
  • the OP does not receive a response signal.
  • Many other types of failures may also cause the OP to not receive a response signal, which will not be listed here.
  • the Confirmed state is after the Executed state and is used to indicate that the communication request of the communication primitive has been completely sent out and the response signal has been completely received.
  • the Confirmed state is the last state of the four states in the present disclosure.
  • the working state is converted according to the execution status of the communication primitive, wherein the conversion of the working state is unidirectional.
  • the communication primitive is converted in the order of waiting state, working state, executed state and confirmed state, that is, the state of the communication primitive can only be converted from the waiting state to the working state, from the working state to the executed state, and from the executed state to the confirmed state, but cannot be converted in the opposite direction.
  • the "state” described in this application has two meanings, namely the state of the communication primitive and the state of the state machine; the change in the state of the communication primitive can cause the state of the state machine to change accordingly, but in practice, the change in the state of the state machine is not necessarily synchronized with the state of the communication primitive.
  • the state machine will periodically scan the queue of communication primitives, thereby updating its own state according to the state of the scanned communication primitive. In this case, the state of the communication primitive may have changed, but the state of the state machine may not have changed.
  • the state change in the state machine can jump, for example, it can jump directly from the "waiting" state to the "executed” state, or it can jump to the "confirmed” state, while the state of the communication primitive itself cannot jump.
  • the working state of the subsequent serial communication primitive is prohibited from being behind the working state of the previous serial communication primitive; in other words, the working state of the previous serial communication primitive is prohibited from being before the working state of the subsequent serial communication primitive.
  • communication primitives OP1 to OP5 are communication primitives that are executed serially and in sequence
  • the communication primitives OP1 and OP2 before the communication primitive OP3 can be in the executed state or the confirmed state, but are prohibited from being in the waiting state or the working state, that is, the states of the communication primitives OP1 and OP2 before the communication primitive OP3 cannot be before the state of the communication primitive OP3; and the communication primitives OP4 and OP5 after the communication primitive OP3 can be in the waiting state, the working state or the executed state, but cannot be in the confirmed state, that is, the states of the communication primitives OP4 and OP5 after the communication primitive OP3 cannot be after the state of the communication primitive OP3.
  • the state of the communication primitive will not be reversed, so it is convenient to When blocking occurs, the execution of the communication primitive queue is put to sleep.
  • FIG8 is a schematic diagram showing the execution of a serial primitive queue according to an embodiment of the present disclosure.
  • different backgrounds are used for distinction, for example, the "waiting" state is represented by a blank background, the "working” state is represented by a horizontal line background, the "executed” state is represented by a vertical line background, and the “confirmed” state is represented by a gray background.
  • a dotted frame represents a skipped OP
  • a solid frame represents an executed OP
  • a bold solid frame represents the currently selected and executed OP.
  • the method of the present disclosure further comprises: in response to communication congestion occurring in a serial communication primitive, exiting execution of the communication primitive queue at the serial communication primitive where communication congestion occurs.
  • step a as shown in the line "Start” in FIG8 , first, search for the OP in the "waiting" state from front to back, and convert the OP into the "working” state. If the OP is executed, it is converted into the "executed” state. Thus, the above step a is repeatedly executed until a communication blockage occurs. As shown in FIG8 , assuming that a communication blockage occurs at OP3, a sleep operation is performed at OP3. Then, after the hardware wakes up the communication primitive queue, step b is executed. As shown in FIG8 , after a period of execution, OP0-OP4 may all be in the "executed" state, while OP5-OP7 are still in the "waiting" state because they have not been selected for execution.
  • the method further includes: in response to communication congestion occurring in a serial communication primitive, maintaining a working state of the corresponding serial communication primitive in an executed state.
  • step b when waiting to be reawakened, i.e., when recovering, it can be determined that some OPs have been converted to OPs in the "confirmed” state (OP0-OP2 in the “recovery” row) due to asynchronous confirmation. Search the OPs in the "executed” state from front to back, because these OPs have communication errors and need to be re-executed (OP3-OP4 in the "recovery” row). When all the "executed” state OPs have been searched (OP5 in the "recovery” row), repeat step a. If all OPs are in the "confirmed” state, go to step c.
  • step c all OPs have been executed and the asynchronous operation has been confirmed to return, so the end state can be entered (the last line).
  • re-executing the communication primitive queue includes: for a partially executed serial communication primitive, only re-executing the portion of the serial communication primitive that has not been executed.
  • the communication primitive queue the communication primitives that have been executed and are in the "confirmed" state will no longer be executed when the communication primitive queue is restored, and only those communication primitives that have not been executed or are incompletely executed will be executed.
  • the communication primitive is blocked after a part of it is executed, the execution of the communication primitive is incomplete; then, for the incompletely executed communication primitive, the part that has been executed will no longer be executed, and only the part that has not been executed will be executed.
  • Such beneficial effects include multiple ones. For example, only executing the part of the communication primitives that have not been executed is helpful to reduce the repeated execution of the communication primitives and improve the efficiency during recovery.
  • In-place operation only executing the part of the communication primitive that has not been executed can avoid data errors.
  • In-place operation means that the operation occurs at the storage location where the data involved in the operation originally resides.
  • Figure 9 shows a schematic diagram of the communication primitive involving in-place operation.
  • the original data of an in-situ operation includes a 4*4 data matrix
  • the 0th row contains data ⁇ 0,0,0,0 ⁇
  • the 1st row contains data ⁇ 1,1,1,1 ⁇
  • the 2nd row contains data ⁇ 2,2,2,2 ⁇
  • the 3rd row contains data ⁇ 3,3,3,3 ⁇ .
  • FIG. 10 shows an exemplary application scenario in which there are multiple concurrent communication primitives.
  • concurrent execution means that different communication primitives can be executed alternately, thereby realizing an execution method similar to multi-threading in a single core. Through this mechanism, a single computing core can be used to complete the concurrent execution of multiple different communication primitives, and multiple different computing cores are not required.
  • the plurality of communication primitives further include concurrent communication primitives that can be executed concurrently, and the method further includes: executing the concurrent communication primitives in a time-sharing manner.
  • concurrency does not mean execution at the same time, but rather concurrent execution of communication primitives in the form of time division or time division multiplexing.
  • Figures 11a to 11f show schematic diagrams of setting concurrent communication primitives between serial communication primitives according to an embodiment of the present disclosure.
  • different backgrounds are used for distinction, for example, the "waiting" state is represented by a blank background, the "working” state is represented by a horizontal line background, the "executed” state is represented by a vertical line background, and the “confirmed” state is represented by a gray background.
  • the skipped OP is represented by a dotted box, and the executed OP is represented by a solid box.
  • the boxes “Start”, “End”, “FB” and “FE” are all represented by a blank background, but they only represent some key points of the OP execution, and do not mean that they must also participate in the execution of the OP.
  • the method of the present disclosure further includes: inserting a concurrent start identifier between the concurrent communication primitive and the previous serial communication primitive, so that when the communication primitive queue executes to the concurrent start identifier, the concurrent communication primitive is executed in time-sharing; and inserting a concurrent end identifier between the concurrent communication primitive and the subsequent serial communication primitive, so that when the communication primitive queue executes to the concurrent end identifier, the concurrent communication primitive is re-executed according to the state of the concurrent communication primitive, or the execution of the concurrent communication primitive is exited.
  • OP0, OP1, OP3, OP4 and OP5 are the serial communication primitives described above
  • the difference between the concurrent communication primitives OP21, OP22 and OP23 and the serial communication primitives OP0, OP1, OP3, OP4 and OP5 is that when executing the serial communication primitives, if communication blocking occurs, the sleep operation is performed from the serial communication primitive where the blocking occurs; while when executing concurrent communication primitives, the blocking of one concurrent communication primitive does not affect the execution of other concurrent communication primitives.
  • a flag can be inserted between the two to distinguish them, so as to adopt different execution modes.
  • a concurrent start flag FB Flag Begin
  • FB Flag Begin
  • FE Flag End
  • end of execution of this concurrent communication primitive does not mean to exit the execution of all concurrent primitives, but it can be executed again, that is, re-execute the blocked concurrent communication primitive, or exit the execution of the concurrent communication primitive.
  • Figure 11a shows the states of all communication primitives and their relationships when not executed, wherein all communication primitives OP are in a waiting state, represented by a dotted frame.
  • Figure 11b shows a schematic diagram of executing concurrent communication primitives multiple times.
  • executing the concurrent communication primitives in time-sharing manner includes: making the concurrent communication primitives alternately enter a working state from a waiting state.
  • OP21 can be in the "working” state
  • OP22 can be in the "executed” state
  • OP23 can be in the "working” state.
  • the communication primitive can be re-executed according to the following methods and rules.
  • re-executing the concurrent communication primitive includes: in response to not all of the multiple concurrent communication primitives experiencing communication congestion, re-executing the concurrent communication primitive.
  • the communication primitive OP22 is not blocked
  • the concurrent communication primitives OP21, OP22 and OP23 are not all blocked
  • the blocked communication primitives OP21 and OP23 can be executed again. Specifically, after the concurrent communication primitive OP23 is executed, the concurrent end identifier FE is entered, and the concurrent communication primitive OP21 is returned to re-execute the concurrent communication primitive that was blocked last time.
  • re-executing the concurrent communication primitive includes: for a concurrent communication primitive that has been partially executed, re-executing only the portion of the concurrent communication primitive that has not been executed.
  • Such beneficial effects include multiple ones. For example, only executing the part of the communication primitives that have not been executed is helpful to reduce the repeated execution of the communication primitives and improve the efficiency during recovery.
  • re-executing the concurrent communication primitive according to the state of the concurrent communication primitive further includes: skipping the concurrent communication primitive in the confirmation state without re-executing.
  • the "confirmation" state of OP22 in the above embodiment is only an implementation method, and it does not have to be in the "confirmation” state to be skipped. In essence, as long as the execution of the concurrent communication primitive does not cause communication congestion, even if it is not in the "confirmation” state in the second round of execution, the execution of the concurrent communication primitive is still skipped.
  • exiting the execution of the concurrent communication primitive includes: in response to communication congestion occurring in all of the multiple concurrent communication primitives, exiting the execution of the concurrent communication primitive and exiting the execution of the communication primitive queue.
  • OP21, OP22 and OP23 can be regarded as a serial communication primitive as a whole, which forms a serial relationship with the upstream serial communication primitive OP1 and the downstream OP3. Therefore, according to the above description, when a serial communication primitive is blocked in communication, the execution can be exited from the currently blocked serial communication primitive.
  • exiting the execution of the concurrent communication primitive includes: in response to the number of times the concurrent communication primitive is re-executed reaching a predetermined number of times and communication congestion occurs in at least one of the multiple concurrent communication primitives, exiting the execution of the concurrent communication primitive and exiting the execution of the communication primitive queue.
  • a counter may be added at the concurrent end identifier FE.
  • the system may exit and enter sleep mode.
  • a resume identifier is added at the concurrent start identifier to facilitate easy search for the exit location when resuming the execution of the communication primitive.
  • FIG. 11e shows a schematic diagram of coroutine recovery.
  • the concurrent communication primitive in which the communication congestion occurred is re-executed.
  • OP21 and OP23 If at least one of OP21 and OP23 still has communication blockage, it can be re-executed multiple times from the concurrent end mark FE, or the execution of the parallel communication primitive can be exited from the concurrent end mark FE, and the execution of the entire communication primitive queue can be exited.
  • exiting the execution of the concurrent communication primitive includes: in response to all the concurrent communication primitives being in a confirmed state, exiting the execution of the concurrent communication primitive, and executing the serial communication primitive after the concurrent end identifier.
  • the execution of the parallel communication primitive can be exited from the concurrent end mark FE, and the serial communication primitive OP3 can be executed next.
  • the execution of the serial communication primitives OP3, OP4 and OP5 has been described above in conjunction with FIG8, and will not be repeated here.
  • the execution of the concurrent communication primitives has been described above in conjunction with FIG11a to FIG11f, and these concurrent communication primitives can be separate or combined with the serial communication primitive as shown in FIG11a to FIG11f.
  • an electronic device comprising: one or more processors; and a memory, wherein the memory stores computer executable instructions, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method as described above.
  • a computer-readable storage medium comprising computer-executable instructions.
  • the computer-executable instructions are executed by one or more processors, the method described above is executed.
  • Table 1 below shows the differences between the technical solution of the present disclosure and the first and second solutions described above.
  • the technical solution disclosed in the present invention uses a software coroutine method to realize the time-sharing reuse capability of the computing core without introducing a hardware multi-threading mechanism, thereby being able to fully utilize the computing core and avoid task deadlock.
  • the coroutine execution process has relatively small changes to the hardware, and generally supports various SIMD processing architectures to realize software time-sharing reuse.
  • the asynchronous confirmation method of asynchronous communication primitives is supported by the primitive jump mechanism, and automatic software communication retransmission can be realized without modifying the OP logic.
  • the alternating execution mechanism the concurrent execution of multiple communication primitives can be supported, which is similar to the effect of single-core multi-threading and saves the use of computing cores.
  • the solution disclosed in the present invention is sufficient to solve the deadlock problem caused by communication congestion.
  • the electronic equipment or device disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, transportation, household appliances, and/or medical equipment.
  • the transportation includes airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes magnetic resonance imaging, ultrasound machines and/or electrocardiographs.
  • the electronic equipment or device disclosed herein may also be applied to the Internet, IoT, data centers, energy, transportation, public administration, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical fields. Further, the electronic equipment or device disclosed herein may also be used in cloud, edge, and terminal applications related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solution can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or edge devices (such as smart phones or cameras).
  • cloud devices such as cloud servers
  • electronic devices or devices with low power consumption can be applied to terminal devices and/or edge devices (such as smart phones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or edge device are compatible with each other, so that according to the hardware information of the terminal device and/or edge device, appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or edge device, so as to complete the unified management, scheduling and collaborative work of end-to-end or cloud-edge-to-end.
  • Clause 1 A method for performing an inter-chip communication task, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes a plurality of communication primitives, the plurality of communication primitives including a serial communication primitive of a serial connection, the method comprising:
  • the communication primitive queue is re-executed starting from the interrupted serial communication primitive.
  • the working status includes:
  • the working state after the waiting state, is used to indicate that the communication primitive is being executed, the communication request of the communication primitive has not been completely sent, and the response signal has not been completely received;
  • the executed state after the working state, is used to indicate that the communication primitive has been executed, the communication request of the communication primitive has been completely sent, and the response signal has not been completely received;
  • the confirmation state is used to indicate, after the executed state, that the communication request of the communication primitive has been completely issued and the response signal has been completely received.
  • the working state is converted according to the execution status of the serial communication primitive, wherein the conversion of the working state is unidirectional.
  • Clause 4 The method of clause 3, wherein, for a single computing core, at most one serial communication primitive is in operation.
  • Clause 5 The method according to clause 3, wherein, in the communication primitive queue, the working state of a subsequent serial communication primitive is prohibited from being behind the working state of a preceding serial communication primitive.
  • Clause 6 The method according to any one of clauses 1-5 further comprises: in response to a serial communication primitive being blocked in communication, exiting execution of the communication primitive queue at the serial communication primitive where the communication is blocked.
  • Clause 7 The method according to any one of clauses 1-6 further comprises: in response to communication congestion of a serial communication primitive, maintaining the working state of the corresponding serial communication primitive in an executed state.
  • Clause 8 The method according to any one of clauses 1 to 7, wherein re-executing the communication primitive queue starting from the interrupted serial communication primitive comprises:
  • Clause 9 The method according to any one of clauses 1-8, wherein the multiple communication primitives further include concurrent communication primitives that can be executed concurrently, and the method further includes: executing the concurrent communication primitives in a time-sharing manner.
  • a concurrent end marker is inserted between the concurrent communication primitive and the next serial communication primitive, so that when the communication primitive queue executes to the concurrent end marker, the concurrent communication primitive is re-executed or the concurrent communication primitive is exited according to the state of the concurrent communication primitive.
  • the concurrent communication primitives are re-executed.
  • re-executing the concurrent communication primitive comprises: for a concurrent communication primitive that has been partially executed, re-executing only the portion of the concurrent communication primitive that has not been executed.
  • Clause 14 The method of clause 12, wherein, based on the state of the concurrent communication primitive, re-execution
  • the concurrent communication primitives further include: skipping concurrent communication primitives in a confirmed state without re-executing.
  • Clause 15 The method of any one of clauses 10-14, wherein, based on the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:
  • Clause 16 The method of any one of clauses 10-15, wherein, based on the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:
  • the execution of the concurrent communication primitive is exited, and the execution of the communication primitive queue is exited.
  • Clause 17 The method according to clause 15 or 16, wherein, in response to exiting the execution of the concurrent communication primitive, a resume identifier is added at the concurrent start identifier.
  • Clause 18 The method according to Clause 17 further comprises: in response to searching for the recovery identifier, re-executing the concurrent communication primitive where communication congestion occurs.
  • Clause 19 The method of any one of clauses 10-18, wherein, based on the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:
  • An electronic device comprising:
  • a memory wherein computer executable instructions are stored in the memory, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method as described in any one of clauses 1-19.
  • Clause 21 A computer-readable storage medium comprising computer-executable instructions, which, when executed by one or more processors, perform the method as described in any one of Clauses 1-19.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer And Data Communications (AREA)

Abstract

A method for executing an inter-chip communication task, a corresponding electronic device and a readable storage medium. An inter-chip communication task is described by means of a communication primitive queue, the communication primitive queue comprising a plurality of communication primitives, and the plurality of communication primitives comprising serial communication primitives which are serially connected. The method comprises: executing a search for a communication primitive queue to determine states of serial communication primitives in the communication primitive queue; and in response to having found an interrupted serial communication primitive, re-executing the communication primitive queue from the interrupted serial communication primitive.

Description

一种执行片间通信任务的方法和相关产品A method for performing inter-chip communication tasks and related products
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2022年12月09日申请的,申请号为202211589123.4,名称为“一种执行片间通信任务的方法和相关产品”的中国专利申请的优先权。This application claims priority to a Chinese patent application filed on December 9, 2022, with application number 202211589123.4 and titled “A method for performing inter-chip communication tasks and related products”.
技术领域Technical Field
本公开涉及芯片领域,更具体地,涉及芯片的片间通信领域。The present disclosure relates to the field of chips, and more specifically, to the field of inter-chip communication of chips.
背景技术Background technique
如何基于芯片特有的高效片间通信装置进行软件编程,是实现高可扩展性人工智能网络训练的关键问题。片间通信的核心难题有两个:一个是如何将数据写向远端芯片,另一个问题是如何在远端芯片感知等待数据是否已经到达。其中,后者的通信同步是本文关注的主要问题。然而,计算核用于轮询将导致计算资源极其紧缺,并最终导致通信死锁。如何在单计算核的情况下避免通信死锁,提升计算资源利用率是一个期待解决的问题。How to program software based on the chip's unique and efficient inter-chip communication device is a key issue in achieving highly scalable artificial intelligence network training. There are two core problems in inter-chip communication: one is how to write data to the remote chip, and the other is how to sense whether the waiting data has arrived at the remote chip. Among them, the latter communication synchronization is the main issue of this article. However, the use of computing cores for polling will lead to an extremely tight supply of computing resources and eventually cause communication deadlock. How to avoid communication deadlock in the case of a single computing core and improve computing resource utilization is a problem that is expected to be solved.
发明内容Summary of the invention
本公开的一个目的在于解决如何利用人工智能芯片的单计算核通过协程编程方式避免通信死锁。本公开进一步的目的在于如何利用人工智能芯片的单计算核通过协程编程方式完成时分复用的片间通信,支持并发通信任务。One purpose of the present disclosure is to solve how to use a single computing core of an artificial intelligence chip to avoid communication deadlock through coroutine programming. A further purpose of the present disclosure is to use a single computing core of an artificial intelligence chip to complete time-division multiplexing inter-chip communication through coroutine programming to support concurrent communication tasks.
根据本公开的第一方面,提供一种执行片间通信任务的方法,其中,所述片间通信任务通过通信原语队列来描述,并且所述通信原语队列包括多个通信原语,所述多个通信原语包括串行连接的串行通信原语,所述方法包括:执行针对通信原语队列的搜索,以确定所述通信原语队列中串行通信原语的状态;响应于搜索到发生中断的串行通信原语,从所述发生中断的串行通信原语处开始,重新执行所述通信原语队列。According to a first aspect of the present disclosure, a method for executing an inter-chip communication task is provided, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes a plurality of communication primitives, the plurality of communication primitives including a serial communication primitive connected in series, the method comprising: performing a search for the communication primitive queue to determine the status of the serial communication primitives in the communication primitive queue; in response to searching for an interrupted serial communication primitive, re-executing the communication primitive queue starting from the interrupted serial communication primitive.
根据本公开的第二方面,提供一种电子设备,包括:一个或多个处理器;以及存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。According to a second aspect of the present disclosure, an electronic device is provided, comprising: one or more processors; and a memory, wherein the memory stores computer executable instructions, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method described above.
根据本公开的第三方面,提供一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述的方法。According to a third aspect of the present disclosure, a computer-readable storage medium is provided, comprising computer-executable instructions. When the computer-executable instructions are executed by one or more processors, the method described above is executed.
本公开提供的技术方案能够带来如下至少一个有益效果:无需引入硬件多线程机制下,利用软件协程方法实现计算核的分时复用能力,从而能够充分地利用计算核、避免任务死锁。协程执行流程对硬件改动较小,普遍支持各类SIMD(Single Instruction Multiple Data,单指令多数据流)处理架构实现软件分时复用。此外,通过原语跳跃机制支持异步通信原语的异步确认方式,可以在OP(通信原语)逻辑不修改的情况下,实现自动软件通信重传。通过交替执行机制,可以支持多个通信原语的并发执行,该实现类似于单核多线程的效果,节省了计算核的使用。该本公开的方案足以解决通信阻塞所带来的死锁问题。The technical solution provided by the present disclosure can bring at least one of the following beneficial effects: without introducing a hardware multi-threading mechanism, the time-sharing reuse capability of the computing core can be realized by using a software coroutine method, so that the computing core can be fully utilized and task deadlock can be avoided. The coroutine execution process has relatively small changes to the hardware, and generally supports various SIMD (Single Instruction Multiple Data) processing architectures to realize software time-sharing reuse. In addition, the asynchronous confirmation method of asynchronous communication primitives is supported by the primitive jump mechanism, and automatic software communication retransmission can be realized without modifying the OP (communication primitive) logic. Through the alternating execution mechanism, the concurrent execution of multiple communication primitives can be supported, which is similar to the effect of single-core multi-threading and saves the use of computing cores. The solution disclosed in the present disclosure is sufficient to solve the deadlock problem caused by communication congestion.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:By reading the detailed description below with reference to the accompanying drawings, the above and other purposes, features and advantages of the exemplary embodiments of the present disclosure will become readily understood. In the accompanying drawings, several embodiments of the present disclosure are shown in an exemplary and non-limiting manner, and the same or corresponding reference numerals represent the same or corresponding parts, wherein:
图1示出本披露实施例的一种板卡10的结构示意图;FIG1 is a schematic diagram showing the structure of a board 10 according to an embodiment of the present disclosure;
图2是示出此实施例的组合处理装置101中的示意图;FIG2 is a schematic diagram showing the combined processing device 101 of this embodiment;
图3示出了计算装置201的内部结构示意图; FIG3 shows a schematic diagram of the internal structure of the computing device 201;
图4示了处理核的内部架构;Figure 4 shows the internal architecture of the processing core;
图5示出了根据本公开一个实施方式的执行片间通信任务的方法FIG. 5 shows a method for executing an inter-chip communication task according to an embodiment of the present disclosure.
图6示出了根据本公开一个实施方式的协程执行示例;FIG6 shows an example of coroutine execution according to one embodiment of the present disclosure;
图7示出了通信原语(OP)工作状态的可能变化情况;FIG7 shows possible changes in the working state of a communication primitive (OP);
图8示出了根据本公开一个实施方式的串行原语队列执行情况的示意图;FIG8 is a schematic diagram showing the execution of a serial primitive queue according to an embodiment of the present disclosure;
图9示出了涉及原位运算的通信原语的一个示意图;FIG9 shows a schematic diagram of communication primitives involving in-situ operations;
图10示出了存在多个并发的通信原语的一种示例性应用场景;以及FIG10 shows an exemplary application scenario in which there are multiple concurrent communication primitives; and
图11a至图11f示出了根据本公开一个实施方式的在串行通信原语之间设置并发通信原语的示意图。11a to 11f are schematic diagrams showing a method of setting concurrent communication primitives between serial communication primitives according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。The following will be combined with the drawings in the embodiments of the present disclosure to clearly and completely describe the technical solutions in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present disclosure.
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。“第一”、“第二”、“第三”和“第四”等也不仅仅表示一个,而是也可以表示多个。第一本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third", "fourth", etc. in the claims, specifications and drawings of the present disclosure are used to distinguish different objects rather than to describe a specific order. "First", "second", "third", "fourth", etc. do not just mean one, but may also mean multiple. The terms "include" and "comprise" used in the specification and claims of the first disclosure indicate the presence of the described features, wholes, steps, operations, elements and/or components, but do not exclude the presence or addition of one or more other features, wholes, steps, operations, elements, components and/or their collections.
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terms used in this disclosure are only for the purpose of describing specific embodiments and are not intended to limit the disclosure. As used in this disclosure and claims, the singular forms of "a", "an", and "the" are intended to include the plural forms unless the context clearly indicates otherwise. It should also be further understood that the term "and/or" used in this disclosure and claims refers to any combination of one or more of the associated listed items and all possible combinations, including these combinations.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and claims, the term "if" may be interpreted as "when" or "upon" or "in response to determining" or "in response to detecting," depending on the context. Similarly, the phrase "if it is determined" or "if [described condition or event] is detected" may be interpreted as meaning "upon determination" or "in response to determining" or "upon detection of [described condition or event]" or "in response to detecting [described condition or event]," depending on the context.
下面结合附图来详细描述本披露的具体实施方式。The specific implementation of the present disclosure is described in detail below with reference to the accompanying drawings.
现今的半导体制程是从一块完整的晶圆(Wafer)开始的,晶圆由纯硅构成的圆形薄片,一般分为6英寸、8英寸、12英寸等多种规格,晶圆会被切割成一个个的小块,这小块称为晶粒(Die)。每个晶粒上会贴装有芯片(Chip),并布置好接线,以实现特定的电气功能。接着以晶粒为单位封装成为一个颗粒,封装的目的是安放、固定、密封、保护芯片和增强电热性能的作用,同时在芯片的触点上用导线连接到封装外壳的引脚上,一个芯片封装结构便完成了。Today's semiconductor manufacturing process starts with a complete wafer. Wafers are circular sheets made of pure silicon, generally divided into 6-inch, 8-inch, 12-inch and other specifications. Wafers are cut into small pieces, which are called dies. Each die is mounted with a chip and wired to achieve specific electrical functions. Then the die is packaged into a particle. The purpose of packaging is to place, fix, seal, protect the chip and enhance the electrical and thermal performance. At the same time, the contacts of the chip are connected to the pins of the package shell with wires, and a chip package structure is completed.
内存用于暂时存放片上系统所需的运算数据,以及与外部存储器交换的数据。在此实施例中,内存可以是高宽带内存(High Bandwidth Memory,HBM),这是一种基于3D堆栈工艺制作的高性能DRAM(Dynamic Random Access Memory,动态随机存取存储器),适用于高存储器带宽需求的应用场合,像是图形处理器、网上交换及转发设备(如路由器、交换器)等。The memory is used to temporarily store the computing data required by the system on chip and the data exchanged with the external memory. In this embodiment, the memory can be a high-bandwidth memory (HBM), which is a high-performance DRAM (Dynamic Random Access Memory) made based on a 3D stacking process and is suitable for applications with high memory bandwidth requirements, such as graphics processors, online switching and forwarding equipment (such as routers, switches), etc.
片上系统(System On Chip,SoC)指的是在单个芯片上集成一个完整的系统,对所有或部分必要的电子电路进行包分组的技术。在此实施例中,片上系统装配在板卡上。图 1示出本披露实施例的一种板卡10的结构示意图。如图1所示,板卡10包括组合处理装置101,其是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和大量的计算能力。System on Chip (SoC) refers to a technology that integrates a complete system on a single chip and packages all or part of the necessary electronic circuits. In this embodiment, the system on chip is assembled on a board. 1 shows a schematic diagram of the structure of a board 10 of the embodiment of the present disclosure. As shown in FIG1 , the board 10 includes a combined processing device 101, which is an artificial intelligence computing unit to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, data mining, etc. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is the large amount of input data, which places high demands on the storage capacity and computing power of the platform. The board 10 of this embodiment is suitable for cloud intelligence applications and has huge off-chip storage, on-chip storage and a large amount of computing power.
组合处理装置101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或Wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至组合处理装置101。组合处理装置101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe(Peripheral Component Interconnect express,高速外围组件互连)接口等。The combined processing device 101 is connected to the external device 103 through the external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a Wifi interface. The data to be processed can be transmitted from the external device 103 to the combined processing device 101 through the external interface device 102. The calculation result of the combined processing device 101 can be transmitted back to the external device 103 via the external interface device 102. According to different application scenarios, the external interface device 102 can have different interface forms, such as a PCIe (Peripheral Component Interconnect express) interface, etc.
板卡10还包括用于存储数据的外部存储器104,其包括一个或多个存储单元105。外部存储器104通过总线与控制器件106和组合处理装置101进行连接和数据传输。板卡10中的控制器件106配置用于对组合处理装置101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机,又称微控制单元(Micro Controller Unit,MCU)。The board 10 also includes an external memory 104 for storing data, which includes one or more storage units 105. The external memory 104 is connected to the control device 106 and the combined processing device 101 through a bus and transmits data. The control device 106 in the board 10 is configured to control the state of the combined processing device 101. To this end, in an application scenario, the control device 106 may include a single chip microcomputer, also known as a micro control unit (Micro Controller Unit, MCU).
图2是示出此实施例的组合处理装置101中的示意图。如图2中所示,组合处理装置101包括计算装置201、接口装置202、处理装置203和DRAM 204。在一种应用场景中,计算装置201、接口装置202、处理装置203整合成前述的片上系统。在另一种应用场景中,计算装置201本身即为前述的片上系统。FIG2 is a schematic diagram showing the combined processing device 101 of this embodiment. As shown in FIG2 , the combined processing device 101 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204. In one application scenario, the computing device 201, the interface device 202, and the processing device 203 are integrated into the aforementioned system on chip. In another application scenario, the computing device 201 itself is the aforementioned system on chip.
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。The computing device 201 is configured to execute user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。The interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 can obtain input data from the processing device 203 via the interface device 202 and write it into the storage device on the computing device 201 chip. Further, the computing device 201 can obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the computing device 201 chip. Alternatively or optionally, the interface device 202 can also read data in the storage device of the computing device 201 and transmit it to the processing device 203.
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器、图形处理器或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。The processing device 203, as a general processing device, performs basic controls including but not limited to data handling, starting and/or stopping the computing device 201, etc. According to different implementations, the processing device 203 can be a central processing unit, a graphics processing unit, or one or more types of processors in other general and/or special processors, which include but are not limited to digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, only with respect to the computing device 201 disclosed in the present invention, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are integrated and considered together, the two are regarded as forming a heterogeneous multi-core structure.
DRAM 204即为前述的高宽带内存,用以存储待处理的数据,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。DRAM 204 is the aforementioned high-bandwidth memory, which is used to store data to be processed. Its size is usually 16G or larger and is used to save data of the computing device 201 and/or the processing device 203.
图3示出了计算装置201的内部结构示意图。计算装置201用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,图中的计算装置201采用多核分层结构设计,其包括外部存储控制器301、外设通信模块302、片上互联模块303、同步模块304以及多个集群305。3 shows a schematic diagram of the internal structure of a computing device 201. The computing device 201 is used to process input data such as computer vision, speech, natural language, and data mining. The computing device 201 in the figure adopts a multi-core hierarchical structure design, which includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and multiple clusters 305.
外部存储控制器301可以有多个,在图中示例性地展示2个,其用以响应处理器核发出的访问请求,访问外部存储设备,例如图2中的DRAM 204,从而自片外读取数据或是 将数据写入。外设通信模块302用以通过接口装置202接收来自处理装置203的控制信号,启动计算装置201执行任务。片上互联模块303将外部存储控制器301、外设通信模块302及多个集群305连接起来,用以在各个模块间传输数据和控制信号。同步模块304是一种全局同步屏障控制器(Global Barrier Controller,GBC),用以协调各集群的工作进度,确保信息的同步。多个集群305是计算装置201的计算核心,在图中示例性地展示4个,随着硬件的发展,本披露的计算装置201还可以包括8个、16个、64个、甚至更多的集群305。集群305用以高效地执行深度学习算法。There can be multiple external storage controllers 301, and two are shown in the figure as an example. They are used to respond to access requests from the processor core and access external storage devices, such as DRAM 204 in FIG. 2, so as to read data from outside the chip or Write the data. The peripheral communication module 302 is used to receive the control signal from the processing device 203 through the interface device 202, and start the computing device 201 to perform the task. The on-chip interconnect module 303 connects the external storage controller 301, the peripheral communication module 302 and the multiple clusters 305 to transmit data and control signals between the modules. The synchronization module 304 is a global synchronization barrier controller (Global Barrier Controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information. Multiple clusters 305 are the computing cores of the computing device 201. Four are shown as examples in the figure. With the development of hardware, the computing device 201 disclosed in the present invention can also include 8, 16, 64, or even more clusters 305. Clusters 305 are used to efficiently execute deep learning algorithms.
每个集群305包括多个处理器核(IPU Core)306及一个存储核(MEM Core)307。Each cluster 305 includes multiple processor cores (IPU Core) 306 and a memory core (MEM Core) 307.
处理器核306在图中示例性地展示4个,本披露不限制处理器核306的数量。其内部架构如图4所示。每个处理器核306包括三大模块:控制模块41、运算模块42及存储模块43。The figure shows four processor cores 306 as an example, and the present disclosure does not limit the number of processor cores 306. Its internal architecture is shown in FIG4. Each processor core 306 includes three modules: a control module 41, a computing module 42, and a storage module 43.
控制模块41用以协调并控制运算模块42和存储模块43的工作,以完成深度学习的任务,其包括取指单元(Instruction Fetch Unit,IFU)411及指令译码单元(Instruction Decode Unit,IDU)412。取指单元411用以获取来自处理装置203的指令,指令译码单元412则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块42和存储模块43。The control module 41 is used to coordinate and control the operation of the operation module 42 and the storage module 43 to complete the deep learning task, and includes an instruction fetch unit (IFU) 411 and an instruction decode unit (IDU) 412. The instruction fetch unit 411 is used to obtain instructions from the processing device 203, and the instruction decode unit 412 decodes the obtained instructions and sends the decoding results to the operation module 42 and the storage module 43 as control information.
运算模块42包括向量运算单元421及矩阵运算单元422。向量运算单元421用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元422负责深度学习算法的核心计算,即矩阵乘及卷积。The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 422 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
存储模块43用来存储或搬运相关数据,包括神经元存储单元(Neuron RAM,NRAM)431、权值存储单元(Weight RAM,WRAM)432、输入/输出直接内存访问模块(Input/Output Direct Memory Access,IODMA)433、搬运直接内存访问模块(MoVe Direct Memory Access,MVDMA)434。NRAM 431用以存储供处理器核306计算的输入、输出数据及中间结果;WRAM 432则用以存储深度学习网络的权值;IODMA 433通过广播总线309控制NRAM431/WRAM 432与DRAM 204的访存;MVDMA 434则用以控制NRAM 431/WRAM 432与SRAM 308的访存。The storage module 43 is used to store or transfer related data, including a neuron RAM (NRAM) 431, a weight RAM (WRAM) 432, an input/output direct memory access module (IODMA) 433, and a transfer direct memory access module (MVDMA) 434. NRAM 431 is used to store input and output data and intermediate results for calculation by the processor core 306; WRAM 432 is used to store the weights of the deep learning network; IODMA 433 controls the memory access between NRAM 431/WRAM 432 and DRAM 204 through the broadcast bus 309; MVDMA 434 is used to control the memory access between NRAM 431/WRAM 432 and SRAM 308.
回到图3,存储核307主要用以存储和通信,即存储处理器核306间的共享数据或中间结果、以及执行集群305与DRAM 204之间的通信、集群305间彼此的通信、处理器核306间彼此的通信等。在其他实施例中,存储核307具有标量运算的能力,用以执行标量运算。Returning to FIG. 3 , the storage core 307 is mainly used for storage and communication, that is, to store shared data or intermediate results between the processor cores 306, and to perform communication between the cluster 305 and the DRAM 204, between the clusters 305, and between the processor cores 306. In other embodiments, the storage core 307 has the ability of scalar operations and is used to perform scalar operations.
存储核307包括共享存储单元(SRAM)308、广播总线309、集群直接内存访问模块(Cluster Direct Memory Access,CDMA)310及全局直接内存访问模块(Global Direct Memory Access,GDMA)311。SRAM 308承担高性能数据中转站的角色,在同一个集群305内不同处理器核306之间所复用的数据不需要通过处理器核306各自向DRAM 204获得,而是经SRAM 308在处理器核306间中转,存储核307只需要将复用的数据从SRAM308迅速分发给多个处理器核306即可,以提高核间通讯效率,亦大大减少片上片外的输入/输出访问。The storage core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a cluster direct memory access module (CDMA) 310, and a global direct memory access module (GDMA) 311. The SRAM 308 plays the role of a high-performance data transfer station. The data reused between different processor cores 306 in the same cluster 305 does not need to be obtained from the DRAM 204 by each processor core 306, but is transferred between the processor cores 306 through the SRAM 308. The storage core 307 only needs to quickly distribute the reused data from the SRAM 308 to multiple processor cores 306, so as to improve the efficiency of inter-core communication and greatly reduce on-chip and off-chip input/output access.
广播总线309、CDMA 310及GDMA 311则分别用来执行处理器核306间的通信、集群305间的通信和集群305与DRAM 204的数据传输。以下将分别说明。Broadcast bus 309, CDMA 310 and GDMA 311 are used to perform communication between processor cores 306, communication between clusters 305 and data transmission between clusters 305 and DRAM 204, respectively. They will be described below.
广播总线309用以完成集群305内各处理器核306间的高速通信,此实施例的广播总线309支持核间通信方式包括单播、多播与广播。单播是指点对点(即单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 308传输到特定几个处理器核306的通信方式,而广播则是将一份数据从SRAM 308传输到所有处理器核306的通信方式,属于多播的一种特例。The broadcast bus 309 is used to complete high-speed communication between the processor cores 306 in the cluster 305. The broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (i.e., single processor core to single processor core) data transmission, multicast is a communication mode of transmitting a copy of data from SRAM 308 to specific processor cores 306, and broadcast is a communication mode of transmitting a copy of data from SRAM 308 to all processor cores 306, which is a special case of multicast.
CDMA 310用以控制在同一个计算装置201内不同集群305间的SRAM 308的访存。GDMA 311与外部存储控制器301协同,用以控制集群305的SRAM 308到DRAM 204 的访存,或是将数据自DRAM 204读取至SRAM 308中。CDMA 310 is used to control the access of SRAM 308 between different clusters 305 in the same computing device 201. GDMA 311 cooperates with external memory controller 301 to control the access of SRAM 308 of cluster 305 to DRAM 204. memory access, or reading data from DRAM 204 to SRAM 308.
本公开所述的“片间”包括多种含义。首先,“机”通常是指服务器计算节点主机,“机间通信”可以是指多个计算节点主机之间的通信。“卡”通常是指安装在服务器计算节点上的专用AI(Artificial Intelligence,人工智能)计算设备,“卡”上具备一片或多片芯片,如MLU(Machine Learning Unit,机器学习处理器)、GPU(Graphics Processing Unit,图形处理器)。一台“机”通常拥有多张“卡”,一次分布式训练将可能涉及到多个“机”及多张“卡”、“片”。多机多卡之间存在片间高速互联通信装置,如基于Serdes(串行-解串行)构建的片间通信网络、基于Infiniband的主机级网络。在本公开中,片间通信包括了多个主机之间不同芯片之间的通信,也包括了同一个“卡”上不同芯片之间的通信,还包括了同一个主机上多个卡中不同芯片之间的通信。The "inter-chip" described in the present disclosure includes multiple meanings. First, "machine" usually refers to a server computing node host, and "inter-machine communication" can refer to the communication between multiple computing node hosts. "Card" usually refers to a dedicated AI (Artificial Intelligence) computing device installed on a server computing node, and the "card" has one or more chips, such as MLU (Machine Learning Unit, machine learning processor) and GPU (Graphics Processing Unit, graphics processor). A "machine" usually has multiple "cards", and a distributed training may involve multiple "machines" and multiple "cards" and "chips". There are inter-chip high-speed interconnection communication devices between multiple machines and multiple cards, such as an inter-chip communication network built based on Serdes (serial-deserialization) and a host-level network based on Infiniband. In the present disclosure, inter-chip communication includes communication between different chips between multiple hosts, communication between different chips on the same "card", and communication between different chips in multiple cards on the same host.
RDMA(Remote Direct Memory Access,远程直接数据存取)是指远端DMA,即A卡可以异步向B卡的内存中写入/读取数据,而无需B卡进行任何操作。RDMA (Remote Direct Memory Access) refers to remote DMA, that is, card A can asynchronously write/read data to/from the memory of card B without card B performing any operation.
Allreduce算子:在多机多卡的神经网络训练过程中,为确保多机多卡的数据并发训练结果收敛,参与分布式训练的每个设备需要将当前设备反向传递(BP)的梯度信息ΔWi传递给其他设备,最终使得每个设备都能获得全部梯度信息的规约结果,即∑ΔWi。梯度信息被传播和累加计算的方法被称为AllReduce算子。Allreduce operator: In the process of multi-machine multi-card neural network training, in order to ensure the convergence of the data concurrent training results of multi-machine multi-card, each device participating in the distributed training needs to pass the gradient information ΔWi of the current device back propagation (BP) to other devices, so that each device can finally obtain the reduced result of all gradient information, that is, ∑ΔWi . The method of propagating and accumulating gradient information is called the AllReduce operator.
Ring Allreduce算法:Allreduce算子可以在不同的网络拓扑结构上实现,其中在环形拓扑(Ring)中优化实现的Allreduce算子采用了Ring Allreduce算法。从单卡角度看,Allreduce所需实现的核心过程为:收(Receive,简记为R),算(Compute,简记为C),发(Send,简记为S)。在Ring Allreduce算法中,R部分对应于接收上游设备发来的梯度信息ΔW_(i-1),C部分对应于计算ΔW_(i)=Add(ΔW_(i-1),ΔW_(i)),S部分对应于向下游设备发送更新后的梯度信息ΔW_(i)。Ring Allreduce algorithm: The Allreduce operator can be implemented on different network topologies. The Allreduce operator optimized in the ring topology (Ring) uses the Ring Allreduce algorithm. From the perspective of a single card, the core processes that Allreduce needs to implement are: Receive (abbreviated as R), Compute (abbreviated as C), and Send (abbreviated as S). In the Ring Allreduce algorithm, the R part corresponds to receiving the gradient information ΔW_(i-1) sent by the upstream device, the C part corresponds to calculating ΔW_(i) = Add(ΔW_(i-1), ΔW_(i)), and the S part corresponds to sending the updated gradient information ΔW_(i) to the downstream device.
同步问题:在RDMA模式下,A卡计算核向B卡的内存区域写入数据负载,此时B卡计算核无法感知到数据负载是否已经完成写入。此时,如果B卡计算核的后续执行步骤依赖于数据负载的到达才能继续执行,则B卡计算核需要感知数据负载的到达。感知数据负载到达的过程被称为通信同步。Synchronization problem: In RDMA mode, the computing core of card A writes data payload to the memory area of card B. At this time, the computing core of card B cannot sense whether the data payload has been written. At this time, if the subsequent execution steps of the computing core of card B depend on the arrival of the data payload to continue execution, the computing core of card B needs to sense the arrival of the data payload. The process of sensing the arrival of the data payload is called communication synchronization.
通信死锁问题:假如A卡和B卡各存在一个计算核,同时存在两个双卡通信任务X和Y,其下发至两卡分别为X_A、X_B以及Y_A、Y_B,通信任务要求必须通信双端均是相同任务方可进行正常通信。则存在一个时刻,A卡上计算核被X_A占用,B卡上计算核被Y_B占用,此时X_A、Y_B均陷入通信同步,轮询等待数据到达。但实际上,因为任务不匹配,通信任务X和Y将永远等待下去,造成通信死锁问题。Communication deadlock problem: If card A and card B each have a computing core, and there are two dual-card communication tasks X and Y, which are sent to the two cards respectively as X_A, X_B and Y_A, Y_B. The communication tasks require that both ends of the communication must have the same task in order to communicate normally. There is a moment when the computing core on card A is occupied by X_A and the computing core on card B is occupied by Y_B. At this time, X_A and Y_B are both trapped in communication synchronization and polling and waiting for data to arrive. But in fact, because the tasks do not match, communication tasks X and Y will wait forever, causing communication deadlock problems.
当前,进行通信同步通常会涉及如下几种方案。Currently, communication synchronization usually involves the following solutions.
第一种方案,轮询式:硬件多线程编程方法(Single Instruction,Multiple Threads,SIMT),具体通信同步步骤如下:The first solution is polling: hardware multi-threaded programming method (Single Instruction, Multiple Threads, SIMT). The specific communication synchronization steps are as follows:
步骤a.设备A通过RDMA向设备B的指定内存区域中依次写入数据(Data),并写入标签(Flag)。Step a. Device A writes data (Data) and a tag (Flag) to the specified memory area of device B in sequence through RDMA.
步骤b.设备B的计算核在通信接收任务中,轮询Flag是否发生变更。如果Flag始终不变更,则进入步骤c;如果Flag变更,则进入步骤d。Step b. The computing core of device B polls whether the Flag has changed in the communication receiving task. If the Flag has not changed, go to step c; if the Flag has changed, go to step d.
步骤c.如果Flag始终不变更,说明Data尚未完成传输,此时该线程可因无效等待而被切换出当前处理,将该计算核处理释放给其他的计算/通信任务。具体硬件多线程做法是,在线程切换出时保存现场(包括程序当前执行指针、堆栈信息、寄存器信息),在线程切换回时恢复现场,从线程断点处恢复执行。在线程恢复执行时,如果读取到Flag发生变更,则进入步骤d,否则继续在c步骤轮询。Step c. If the Flag never changes, it means that the Data has not yet completed the transmission. At this time, the thread can be switched out of the current processing due to invalid waiting, and the computing core processing is released to other computing/communication tasks. The specific hardware multi-threading method is to save the scene (including the current program execution pointer, stack information, and register information) when the thread is switched out, and restore the scene when the thread is switched back, and resume execution from the thread breakpoint. When the thread resumes execution, if the Flag is read to have changed, it will enter step d, otherwise continue to poll in step c.
步骤d.如果Flag变更,说明Data已经完成传输,此时设备B的计算核已经可以安全地读取Data数据,例如进行Reduce计算。 Step d. If the Flag changes, it means that the Data has been transmitted. At this time, the computing core of device B can safely read the Data, for example, to perform Reduce calculations.
第一种方案主要基于SIMT(Single Instruction,Multiple Threads)实现,需要硬件多线程支持执行,优点是软件开发人员在Warp视角下编写多线程内核(Kernel),对切换不感知。The first solution is mainly based on SIMT (Single Instruction, Multiple Threads) implementation and requires hardware multi-threaded support for execution. The advantage is that software developers write multi-threaded kernels (Kernel) from the Warp perspective and are unaware of switching.
第二种方案,中断式,可参见公开号为CN114691312A的中国专利申请。具体而言,上述中断式的具体通信同步步骤如下:The second solution, the interruption type, can be found in the Chinese patent application with publication number CN114691312A. Specifically, the specific communication synchronization steps of the above interruption type are as follows:
步骤a.设备A通过RDMA向设备B写入硬件描述符及Data。Step a. Device A writes the hardware descriptor and data to device B via RDMA.
步骤b.设备B接收到硬件描述符,此时代表Data已经接收完毕。数据接收完成中断使得硬件解析硬件描述符,然后,根据硬件描述符内容触发计算核执行计算任务,例如进行Reduce计算。Step b. Device B receives the hardware descriptor, which means that the data has been received. The data reception completion interrupt causes the hardware to parse the hardware descriptor, and then triggers the computing core to perform computing tasks according to the hardware descriptor content, such as Reduce computing.
第二种方案主要基于SIMD(Single Instruction,Multiple Data)实现,其通过软件方式支撑计算核复用,只需要做出最少的硬件改动,从而避免大量硬件更替或改动。The second solution is mainly based on SIMD (Single Instruction, Multiple Data) implementation, which supports computing core reuse through software and requires minimal hardware changes, thus avoiding a large number of hardware replacements or changes.
下面将结合附图对本公开的具体实施方式进行描述。The specific embodiments of the present disclosure will be described below in conjunction with the accompanying drawings.
图5示出了根据本公开一个实施方式的执行片间通信任务的方法,其中,所述片间通信任务通过通信原语队列来描述,并且所述通信原语队列包括多个通信原语,所述多个通信原语包括串行连接的串行通信原语,所述方法包括:在操作S510,执行针对通信原语队列的搜索,以确定所述通信原语队列中串行通信原语的状态;在操作S520,响应于搜索到发生中断的串行通信原语,从所述发生中断的串行通信原语处开始,重新执行所述通信原语队列。Figure 5 shows a method for executing an inter-chip communication task according to an embodiment of the present disclosure, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes multiple communication primitives, and the multiple communication primitives include a serial communication primitive connected in series. The method includes: in operation S510, performing a search for the communication primitive queue to determine the status of the serial communication primitives in the communication primitive queue; in operation S520, in response to searching for an interrupted serial communication primitive, re-executing the communication primitive queue starting from the interrupted serial communication primitive.
首先,本公开的以上技术特征是在“协程”模式下执行的。协程是一种软件自身主动进行唤醒、休眠的非抢占式调度机制。而相比之下,线程是软件受操作系统、硬件调度被动唤醒、休眠的抢占式调度机制。First, the above technical features of the present disclosure are executed in the "coroutine" mode. Coroutine is a non-preemptive scheduling mechanism in which the software itself actively wakes up and sleeps. In contrast, threads are preemptive scheduling mechanisms in which software is passively awakened and sleeps by the operating system and hardware scheduling.
例如在多线程模式下,代码可以为:
For example, in multi-threaded mode, the code can be:
上述代码在硬件多线程下,该线程会在read(X)等过程中由于时间片用尽,被操作系统自动切出执行其他函数,并在一段时间后恢复执行,线程不感知切出和切入过程。In the hardware multithreading of the above code, the thread will be automatically switched out by the operating system to execute other functions due to the exhaustion of the time slice during the read(X) process, and will resume execution after a period of time. The thread is unaware of the switching-out and switching-in process.
而在协程模式下,代码可以为:
In coroutine mode, the code can be:
在协程模式下,该协程会主动进行休眠,并唤醒其他处于Sleep(休眠)状态的函数。In coroutine mode, the coroutine will actively sleep and wake up other functions in Sleep state.
对于唤醒-休眠的协程模式,可参见公开号为CN114691312A的中国专利申请。For the wake-up-sleep coroutine mode, please refer to the Chinese patent application with publication number CN114691312A.
通信过程中,以Ring AllReduce为例,会执行SEND、RecvReduceSEND、RecvReduceCopySEND、RecvCopySEND、Recv等通信原语,在本申请中可简写为OP。需要理解的是,在本文中,采用通信原语和OP这两种描述方式没有本质区别,而仅仅是为了匹配当时的上下文环境。During the communication process, taking Ring AllReduce as an example, communication primitives such as SEND, RecvReduceSEND, RecvReduceCopySEND, RecvCopySEND, and Recv will be executed, which can be abbreviated as OP in this application. It should be understood that in this article, there is no essential difference between the two description methods of communication primitives and OP, but it is only to match the context at the time.
本公开的通信原语可以是异步通信原语。异步通信原语可以存在以下特性:The communication primitives disclosed herein may be asynchronous communication primitives. Asynchronous communication primitives may have the following characteristics:
异步性:通信原语的执行是异步的,即通信原语函数返回时,该通信原语所代表的通 信操作可能仍然在硬件上正在执行,直到收到异步响应后才能确认该通信原语彻底完成。Asynchronicity: The execution of the communication primitive is asynchronous, that is, when the communication primitive function returns, the communication primitive represented by the communication primitive is The communication operation may still be executing on the hardware and the communication primitive cannot be confirmed to be completely completed until the asynchronous response is received.
非幂等:一个通信原语的重复执行可能会带来错误的执行结果,因此通信原语通常不能被执行两次,除非上一次通信原语因为通信错误未被完整执行。Non-idempotent: Repeated execution of a communication primitive may lead to incorrect execution results, so communication primitives usually cannot be executed twice unless the previous communication primitive was not fully executed due to a communication error.
不可靠:通信原语可能由于链路质量等问题造成通信丢包,因此通信原语需要重新执行丢包部分的传输。Unreliable: The communication primitive may cause packet loss due to problems such as link quality, so the communication primitive needs to re-execute the transmission of the lost part.
在本公开中,可以采用通信原语队列来描述片间通信任务,这些通信任务可以是串行执行的,也可以是并行执行的,为方便表述,本文中将串行连接的通信原语称为串行通信原语,将并行连接的通信原语称为并发通信原语。一个通信队列中可以包括多个串行通信原语,也可以包括混合的串行通信原语和并发通信原语,这将在后文中进行更加详细的描述。In the present disclosure, communication primitive queues may be used to describe inter-chip communication tasks, which may be executed serially or in parallel. For ease of description, serially connected communication primitives are referred to as serial communication primitives, and parallel connected communication primitives are referred to as concurrent communication primitives. A communication queue may include multiple serial communication primitives, or may include mixed serial communication primitives and concurrent communication primitives, which will be described in more detail later.
图6示出了根据本公开一个实施方式的协程执行示例。FIG. 6 shows an example of coroutine execution according to one embodiment of the present disclosure.
在图6中,通信过程可抽象为一串通信原语OP的串行执行过程,该通信过程的初始状态为START(开始),终止状态为FINISH(结束)。在正常情况下,如果未遇到任何通信阻塞情况,该协程操作将从START开始逐一地执行通信原语OP0至OP5,直至执行到FINISH。需要指出的是,为了方便识别,在图6中,将略过不执行的通信原语OP以虚线框标识,将要实际执行的通信原语OP以实线框来表示。In FIG6 , the communication process can be abstracted as a serial execution process of a series of communication primitives OP, the initial state of which is START (start), and the terminal state is FINISH (end). Under normal circumstances, if no communication blocking is encountered, the coroutine operation will execute communication primitives OP0 to OP5 one by one from START until FINISH is executed. It should be pointed out that, for the convenience of identification, in FIG6 , the communication primitives OP that are skipped and not executed are marked with a dotted frame, and the communication primitives OP that are actually executed are represented by a solid frame.
但实际上,在真实通信中,通常会遇到“通信阻塞”的情况,此时计算核主动调用Sleep(休眠)退出,现场所涉及的变量驻留在特定存储空间区域。如图6中“协程休眠”所示,假设成功完成了通信原语OP0、OP1的执行,但此时当前在通信原语OP2处发生阻塞,则后续的通信原语OP3、OP4、OP5不再执行。此时,该计算核可以转而执行其他通信任务。But in fact, in real communication, we usually encounter the situation of "communication blocking". At this time, the computing core actively calls Sleep to exit, and the variables involved in the scene reside in a specific storage space area. As shown in "Coroutine Sleep" in Figure 6, assuming that the execution of communication primitives OP0 and OP1 is successfully completed, but the communication primitive OP2 is currently blocked, then the subsequent communication primitives OP3, OP4, and OP5 will no longer be executed. At this time, the computing core can turn to perform other communication tasks.
如图6中“协程恢复”所示,在“阻塞解除”时,即通信同步完成时,可以接续上次Sleep的通信原语OP2执行,假如不再阻塞,将直接执行到FINISH处,结束计算核的任务。As shown in the "Coroutine Recovery" in Figure 6, when the "blocking is released", that is, when the communication synchronization is completed, the communication primitive OP2 that was in Sleep last time can be continued to execute. If it is no longer blocked, it will be executed directly to FINISH to end the task of the computing core.
为了能够在阻塞解除时接续上次的通信原语OP执行,就需要软件自行进行上下文恢复,上下文恢复包含两部分的恢复:In order to continue the execution of the last communication primitive OP when the blockage is released, the software needs to perform context recovery by itself. Context recovery includes two parts of recovery:
1)上下文数据的恢复:上下文数据的恢复只需要重新加载特定存储空间区域中的内容即可。1) Restoration of context data: Restoration of context data only requires reloading the content in a specific storage space area.
2)程序执行位置恢复:由于硬件简化设计考虑,硬件不支持直接恢复至Sleep退出时的通信原语OP的位置,因此,本方案设计了通信原语跳跃机制。2) Program execution position recovery: Due to hardware simplification design considerations, the hardware does not support direct recovery to the position of the communication primitive OP when Sleep exits. Therefore, this solution designs a communication primitive jump mechanism.
由此,根据本公开的一个实施方式,为了实现上述所述的跳跃机制,本公开的方法进一步包括:定义状态机,所述状态机用于描述通信原语的工作状态;根据所述状态机所描述的通信原语的工作状态来确定通信原语是否发生中断;其中所述工作状态包括:等待状态;工作中状态;已执行状态;以及确认状态。Therefore, according to one embodiment of the present disclosure, in order to implement the jump mechanism described above, the method of the present disclosure further includes: defining a state machine, wherein the state machine is used to describe the working state of the communication primitive; determining whether the communication primitive is interrupted according to the working state of the communication primitive described by the state machine; wherein the working state includes: waiting state; working state; executed state; and confirmed state.
图7示出了通信原语OP工作状态的可能变化情况。FIG. 7 shows possible changes in the working state of the communication primitive OP.
如图7所示,当一个通信原语OP未被执行时,其处于等待(Pending)状态,用于表示通信原语未被执行。所有的通信原语OP初始时均可以处于等待状态,直到被执行之后状态发生变化。As shown in Figure 7, when a communication primitive OP is not executed, it is in a pending state, which indicates that the communication primitive is not executed. All communication primitives OP may be in a pending state initially until the state changes after being executed.
工作中(Working)状态,其处于所述等待状态之后,用于表示通信原语正在执行,该通信原语的通信请求未完全发出,且尚未完全接收到响应信号,换言之,是指当计算核执行到该通信原语OP时OP所处的状态,对于单个计算核而言,最多一个通信原语处于“工作中”状态。在“工作中”状态下,该通信原语OP对其他通信原语的通信请求未完全发出,并且尚未完全接收到响应信号。在正常情况下,所有通信原语OP的状态都会从等待状态转变为工作中状态,除非由于该OP的执行被略过。The working state, after being in the waiting state, is used to indicate that the communication primitive is being executed, the communication request of the communication primitive has not been completely issued, and the response signal has not been completely received. In other words, it refers to the state of the communication primitive OP when the computing core executes it. For a single computing core, at most one communication primitive is in the "working" state. In the "working" state, the communication primitive OP has not completely issued the communication request to other communication primitives, and has not completely received the response signal. Under normal circumstances, the state of all communication primitives OP will change from the waiting state to the working state, unless the execution of the OP is skipped.
已执行状态(Executed),在所述“工作中”状态之后,用于表示通信原语已执行, 该通信原语的通信请求已完全发出,且尚未完全接收到响应信号。当某个通信原语OP执行完毕之后且通信请求已完全发出之后,通信原语OP将会等待针对该通信请求的响应信号。但是,在多种情况下,该通信原语OP可能无法接收到响应信号,或者响应信号的接收发生较大延迟,这些情况包括但不限于:发出的请求信号未到达下游的通信原语OP(例如由于通信线路发生故障);发出的请求信号到达了下游OP,但是下游OP的信号未发出响应信号(例如下游OP发生故障),或者下游OP虽然发出响应信号,但发出通信请求的OP未接收到该响应信号(例如通信线路发生故障)。上面仅仅是举例说明了OP未接收到响应信号的几种示例情形,很多其他类型的故障也可能导致OP未接收到响应信号,这里将不再穷举。The Executed state, after the "Working" state, is used to indicate that the communication primitive has been executed. The communication request of the communication primitive has been completely issued, and the response signal has not been completely received. After a communication primitive OP is executed and the communication request has been completely issued, the communication primitive OP will wait for the response signal for the communication request. However, in many cases, the communication primitive OP may not receive the response signal, or the reception of the response signal may be greatly delayed. These situations include but are not limited to: the request signal issued does not reach the downstream communication primitive OP (for example, due to a communication line failure); the request signal issued reaches the downstream OP, but the signal of the downstream OP does not issue a response signal (for example, the downstream OP fails), or although the downstream OP issues a response signal, the OP that issued the communication request does not receive the response signal (for example, the communication line fails). The above are just a few examples of situations where the OP does not receive a response signal. Many other types of failures may also cause the OP to not receive a response signal, which will not be listed here.
确认状态(Confirmed),处于所述已执行状态之后,用于表示通信原语的通信请求已完全发出,并且完全接收到响应信号。确认状态是本公开中四个状态中的最后一个状态。The Confirmed state is after the Executed state and is used to indicate that the communication request of the communication primitive has been completely sent out and the response signal has been completely received. The Confirmed state is the last state of the four states in the present disclosure.
在本公开中,假设所有通信原语均处于等待状态,那么当该通信原语队列被启动后,如果不发生中断,则队列中的每个通信原语从等待状态逐一地进入工作中状态。这里需要理解的是,这里的术语“逐一”表示的是通信原语的串行执行,即通信对于单个计算核的情形,最多一个通信原语处于工作中状态。换言之,当计算核正在执行一个通信原语时,并不同时执行其他通信原语,而是只有在退出一个通信原语的执行之后才执行下一个通信原语。In the present disclosure, it is assumed that all communication primitives are in a waiting state. Then, when the communication primitive queue is started, if no interruption occurs, each communication primitive in the queue enters a working state one by one from a waiting state. It should be understood here that the term "one by one" here refers to the serial execution of the communication primitives, that is, in the case of communication for a single computing core, at most one communication primitive is in a working state. In other words, when the computing core is executing a communication primitive, other communication primitives are not executed at the same time, but the next communication primitive is executed only after exiting the execution of a communication primitive.
根据本公开的实施方式,根据所述通信原语的执行情况,转换所述工作状态,其中,所述工作状态的转换是单向的。According to an embodiment of the present disclosure, the working state is converted according to the execution status of the communication primitive, wherein the conversion of the working state is unidirectional.
在该实施方式中,通信原语以等待状态、工作中状态、已执行状态以及确认状态为顺序进行转换,即通信原语的状态只能从等待状态转换为工作中状态,从工作中状态转换为已执行状态,从已执行状态转换为确认状态,而不能反向地进行转换。In this embodiment, the communication primitive is converted in the order of waiting state, working state, executed state and confirmed state, that is, the state of the communication primitive can only be converted from the waiting state to the working state, from the working state to the executed state, and from the executed state to the confirmed state, but cannot be converted in the opposite direction.
还需要理解的是,本申请中所述的“状态”有两层含义,即通信原语的状态和状态机的状态;通信原语的状态发生改变,可以引起状态机的状态相应地发生改变,但实际中,状态机的状态发生改变并不必然同步于通信原语的状态。例如,状态机会定时地扫描通信原语的队列,从而根据扫描到的通信原语的状态来更新其自身的状态,在此情况下,通信原语的状态可能已经发生变化,但状态机的状态可能未发生变化。此外,由于扫描通信原语的状态可能有一定的时间间隔,因此状态机中状态的变换是可以跳跃的,例如可以从“等待”状态直接跳跃到“已执行”状态,或者可以跳跃到“确认”状态,而通信原语本身的状态并不能发生跳跃。It is also necessary to understand that the "state" described in this application has two meanings, namely the state of the communication primitive and the state of the state machine; the change in the state of the communication primitive can cause the state of the state machine to change accordingly, but in practice, the change in the state of the state machine is not necessarily synchronized with the state of the communication primitive. For example, the state machine will periodically scan the queue of communication primitives, thereby updating its own state according to the state of the scanned communication primitive. In this case, the state of the communication primitive may have changed, but the state of the state machine may not have changed. In addition, since there may be a certain time interval for scanning the state of the communication primitive, the state change in the state machine can jump, for example, it can jump directly from the "waiting" state to the "executed" state, or it can jump to the "confirmed" state, while the state of the communication primitive itself cannot jump.
此外,根据本公开的实施方式,在所述通信原语队列中,在后的串行通信原语的工作状态禁止处于在前的串行通信原语的工作状态之后;换言之,在前的串行通信原语的工作状态禁止处于在后的串行通信原语的工作状态之前。In addition, according to an embodiment of the present disclosure, in the communication primitive queue, the working state of the subsequent serial communication primitive is prohibited from being behind the working state of the previous serial communication primitive; in other words, the working state of the previous serial communication primitive is prohibited from being before the working state of the subsequent serial communication primitive.
仍然以图6和图7为例,假设通信原语OP1至OP5是串行并依次执行的通信原语,在本实施方式中,如果以处于“已执行”状态的通信原语OP3为基准,那么,通信原语OP3之前的通信原语OP1和OP2可以处于已执行状态或确认状态,而禁止处于等待状态或工作中状态,即通信原语OP3之前的通信原语OP1和OP2的状态不可处于通信原语OP3的状态之前;而通信原语OP3之后的通信原语OP4和OP5可以处于等待状态、工作中状态或者已执行状态,但不可处于确认状态,即通信原语OP3之后的通信原语OP4和OP5的状态不可处于通信原语OP3的状态之后。Still taking Figures 6 and 7 as an example, assuming that communication primitives OP1 to OP5 are communication primitives that are executed serially and in sequence, in this embodiment, if the communication primitive OP3 in the "executed" state is taken as a reference, then the communication primitives OP1 and OP2 before the communication primitive OP3 can be in the executed state or the confirmed state, but are prohibited from being in the waiting state or the working state, that is, the states of the communication primitives OP1 and OP2 before the communication primitive OP3 cannot be before the state of the communication primitive OP3; and the communication primitives OP4 and OP5 after the communication primitive OP3 can be in the waiting state, the working state or the executed state, but cannot be in the confirmed state, that is, the states of the communication primitives OP4 and OP5 after the communication primitive OP3 cannot be after the state of the communication primitive OP3.
需要注意的是,上面的段落中仅以通信原语OP3为基准进行了解释,在整个通信原语队列中,均需遵循以上规则。此外,还需要注意的是,如上文所述,仅有一个通信原语处于工作中状态,因此,在串行通信原语队列中,处于工作中状态的通信原语之前的通信原语需要处于已执行状态或确认状态,而不能同时处于工作中状态。It should be noted that the above paragraphs are explained based on the communication primitive OP3 only, and the above rules must be followed in the entire communication primitive queue. In addition, it should be noted that, as mentioned above, only one communication primitive is in the working state, so in the serial communication primitive queue, the communication primitive before the communication primitive in the working state needs to be in the executed state or the confirmed state, and cannot be in the working state at the same time.
根据以上的实施方式,通信原语的状态不会发生颠倒,从而方便于在某个通信原语 出现阻塞时使得通信原语队列的执行发生休眠。According to the above implementation, the state of the communication primitive will not be reversed, so it is convenient to When blocking occurs, the execution of the communication primitive queue is put to sleep.
图8示出了根据本公开一个实施方式的串行原语队列执行情况的示意图。在图8中,为了区分不同的状态,使用了不同的背景来进行区分,例如“等待”状态以空白背景表示,“工作中”状态以横线背景表示,“已执行”状态以竖线背景表示,而“确认”状态以灰色背景表示。此外,以虚线框表示跳过的OP,以实线框表示执行的OP,而以加粗的实线框表示当前被选中正在执行的OP。FIG8 is a schematic diagram showing the execution of a serial primitive queue according to an embodiment of the present disclosure. In FIG8 , in order to distinguish different states, different backgrounds are used for distinction, for example, the "waiting" state is represented by a blank background, the "working" state is represented by a horizontal line background, the "executed" state is represented by a vertical line background, and the "confirmed" state is represented by a gray background. In addition, a dotted frame represents a skipped OP, a solid frame represents an executed OP, and a bold solid frame represents the currently selected and executed OP.
根据本公开的一个实施方式,本公开的方法进一步包括:响应于串行通信原语发生通信阻塞,在发生通信阻塞的串行通信原语处退出所述通信原语队列的执行。According to one embodiment of the present disclosure, the method of the present disclosure further comprises: in response to communication congestion occurring in a serial communication primitive, exiting execution of the communication primitive queue at the serial communication primitive where communication congestion occurs.
在步骤a,如图8中行“开始”所示,首先,从前向后搜索处于“等待”状态的OP,并将该OP转化为“工作中”状态,如果OP执行完毕则转化为“已执行”状态。由此,一直重复执行以上步骤a,直到发生了通信阻塞。如图8所示,假设在OP3处发生了通信阻塞,那么在OP3处进行休眠操作。然后,硬件唤醒该通信原语队列之后,执行步骤b。如图8所示的情况,在执行了一段时间之后,OP0-OP4均可能处于“已执行”状态,而OP5-OP7由于未被选定执行,因此仍然处于“等待”状态。In step a, as shown in the line "Start" in FIG8 , first, search for the OP in the "waiting" state from front to back, and convert the OP into the "working" state. If the OP is executed, it is converted into the "executed" state. Thus, the above step a is repeatedly executed until a communication blockage occurs. As shown in FIG8 , assuming that a communication blockage occurs at OP3, a sleep operation is performed at OP3. Then, after the hardware wakes up the communication primitive queue, step b is executed. As shown in FIG8 , after a period of execution, OP0-OP4 may all be in the "executed" state, while OP5-OP7 are still in the "waiting" state because they have not been selected for execution.
根据本公开的一个实施方式,方法进一步包括:响应于串行通信原语发生通信阻塞,使得相应的串行通信原语的工作状态保持在已执行状态。According to one embodiment of the present disclosure, the method further includes: in response to communication congestion occurring in a serial communication primitive, maintaining a working state of the corresponding serial communication primitive in an executed state.
在运行了一段时间之后,OP0-OP3均顺利接收到响应信号,因此进入“确认”状态,而OP3处发生了通信阻塞,因此在OP3处进行休眠操作。根据上文所述,在后的串行通信原语的工作状态禁止处于在前的串行通信原语的工作状态之后,因此OP4的工作状态也处于“已执行”状态,而不能进行到“确认”状态。After running for a period of time, OP0-OP3 all successfully received the response signal, so they entered the "confirmation" state, and communication blocking occurred at OP3, so the sleep operation was performed at OP3. According to the above, the working state of the subsequent serial communication primitive is prohibited from being after the working state of the previous serial communication primitive, so the working state of OP4 is also in the "executed" state, and cannot proceed to the "confirmation" state.
在步骤b,待重新唤醒时,即在恢复时,可以判断部分OP已经由于异步确认而已转换为“确认”状态(“恢复”行的OP0-OP2)的OP。从前向后搜索“已执行”状态的OP,因为这部分OP发生了通信错误,需要被重新执行(“恢复”行的OP3-OP4)。当所有“已执行”状态OP搜索完毕后(“恢复”行的OP5),则重复执行步骤a。若所有OP均处于“确认”状态,则进入步骤c。In step b, when waiting to be reawakened, i.e., when recovering, it can be determined that some OPs have been converted to OPs in the "confirmed" state (OP0-OP2 in the "recovery" row) due to asynchronous confirmation. Search the OPs in the "executed" state from front to back, because these OPs have communication errors and need to be re-executed (OP3-OP4 in the "recovery" row). When all the "executed" state OPs have been searched (OP5 in the "recovery" row), repeat step a. If all OPs are in the "confirmed" state, go to step c.
可以理解的是,在恢复之后,如果所有通信原语的执行都是正常的,则会逐个地执行接下来的OP(例如OP5至OP7),直至所有OP执行完毕。It is understandable that after recovery, if the execution of all communication primitives is normal, the next OPs (eg, OP5 to OP7) will be executed one by one until all OPs are executed.
在步骤c.此时所有OP已经执行完毕,并且异步操作已确认返回,那么可进入结束状态(最后一行)。In step c, all OPs have been executed and the asynchronous operation has been confirmed to return, so the end state can be entered (the last line).
上面结合图8介绍了不同OP随着执行的进行而发生的状态变化示意图。根据本公开的一个实施方式,当恢复OP的执行时,例如如图8所示,当从OP3处恢复OP的执行时,从发生中断的串行通信原语处开始,重新执行所述通信原语队列包括:对于已经部分执行过的串行通信原语,仅重新执行该串行通信原语中未被执行过的那部分。The above describes a schematic diagram of state changes of different OPs as they are executed in conjunction with Figure 8. According to one embodiment of the present disclosure, when the execution of an OP is resumed, for example, as shown in Figure 8, when the execution of an OP is resumed from OP3, starting from the serial communication primitive where the interruption occurs, re-executing the communication primitive queue includes: for a partially executed serial communication primitive, only re-executing the portion of the serial communication primitive that has not been executed.
可以理解的是,如上文所述,对于通信原语队列,已经执行过并且处于“确认”状态的通信原语在通信原语队列恢复时不再被执行,而仅执行那些未被执行过或者执行不完整的通信原语。根据本公开的优选实施方式,对于某一个通信原语而言,如果该通信原语执行了一部分之后发生阻塞,则该通信原语的执行并不完整;那么,对于该不完整执行的通信原语,将不再执行已经执行过的那部分,而仅仅执行未被执行过的那部分。It is understandable that, as described above, for the communication primitive queue, the communication primitives that have been executed and are in the "confirmed" state will no longer be executed when the communication primitive queue is restored, and only those communication primitives that have not been executed or are incompletely executed will be executed. According to the preferred embodiment of the present disclosure, for a certain communication primitive, if the communication primitive is blocked after a part of it is executed, the execution of the communication primitive is incomplete; then, for the incompletely executed communication primitive, the part that has been executed will no longer be executed, and only the part that has not been executed will be executed.
这样的有益效果包括多个,例如,仅仅执行未被执行过的那部分通信原语,有利于减少通信原语的重复执行,提升恢复时的效率。Such beneficial effects include multiple ones. For example, only executing the part of the communication primitives that have not been executed is helpful to reduce the repeated execution of the communication primitives and improve the efficiency during recovery.
此外,对于原位运算(In-Place)而言,仅仅执行未被执行过的那部分通信原语可以避免出现数据错误。原位运算是指运算发生在参与运算的数据原先所处的存储位置处,图9示出了涉及原位运算的通信原语的一个示意图。In addition, for in-place operation, only executing the part of the communication primitive that has not been executed can avoid data errors. In-place operation means that the operation occurs at the storage location where the data involved in the operation originally resides. Figure 9 shows a schematic diagram of the communication primitive involving in-place operation.
如图9所示,假设一个原位运算的原始数据包括了一个4*4的数据矩阵,第0行为数据{0,0,0,0},第1行为数据{1,1,1,1},第2行为数据{2,2,2,2},以及第3行为数据{3,3,3,3}。 在进行原位运算时,上述的原始数据将被覆盖。假设这些原始数据仅仅经历了部分运算,并且第0行和第1行的原始数据被新数据覆盖,例如新数据分别为{0,1,2,3}和{3,2,1,0},但第2行和第3行的数据由于通信阻塞而没有进行运算,仍然保留了原始数据。在此情况下,如果在恢复时仍然对全部数据进行运算,则有可能因为数据已更新而发生错误,而如果仅仅执行未被执行过的那部分,则不会发生上述错误。As shown in Figure 9, assume that the original data of an in-situ operation includes a 4*4 data matrix, the 0th row contains data {0,0,0,0}, the 1st row contains data {1,1,1,1}, the 2nd row contains data {2,2,2,2}, and the 3rd row contains data {3,3,3,3}. When performing in-situ operations, the above original data will be overwritten. Assume that these original data have only undergone partial operations, and the original data of rows 0 and 1 are overwritten by new data, for example, the new data are {0,1,2,3} and {3,2,1,0} respectively, but the data of rows 2 and 3 are not operated due to communication congestion, and the original data is still retained. In this case, if all data are still operated during recovery, errors may occur because the data has been updated, while if only the part that has not been executed is executed, the above errors will not occur.
上文中描述了多个OP之间串行执行的场景,但也可能存在多个OP并发执行的情况。The above describes the scenario of serial execution between multiple OPs, but there may also be a situation where multiple OPs are executed concurrently.
图10示出了存在多个并发的通信原语的一种示例性应用场景。FIG. 10 shows an exemplary application scenario in which there are multiple concurrent communication primitives.
如图10所示,上游通信节点A和两个下游通信节点B、C之间存在连接关系,并且通信节点A和B之间执行通信原语OP1,而通信节点A和C之间执行通信原语OP2,这是两个可以并发的通信原语,因为OP1和OP2两个通信原语使用了不同的通信路径,两者之间互不干扰。在此情况下,可以通过前述协程和原语跳跃机制实现通信原语的并发执行。根据本公开的一个实施方式,并发执行是指,不同的通信原语可以交替地执行,从而实现单核内类似多线程的执行方式。通过该机制,可以使用单一计算核来完成多个不同通信原语的并发执行,而且不需要多个不同的计算核。As shown in Figure 10, there is a connection relationship between the upstream communication node A and the two downstream communication nodes B and C, and the communication primitive OP1 is executed between the communication nodes A and B, and the communication primitive OP2 is executed between the communication nodes A and C. These are two concurrent communication primitives, because the two communication primitives OP1 and OP2 use different communication paths and do not interfere with each other. In this case, the concurrent execution of the communication primitives can be achieved through the aforementioned coroutine and primitive jump mechanism. According to one embodiment of the present disclosure, concurrent execution means that different communication primitives can be executed alternately, thereby realizing an execution method similar to multi-threading in a single core. Through this mechanism, a single computing core can be used to complete the concurrent execution of multiple different communication primitives, and multiple different computing cores are not required.
根据本公开的一个实施方式,所述多个通信原语进一步包括可并发执行的并发通信原语,所述方法进一步包括:以分时的方式来执行所述并发通信原语。According to one embodiment of the present disclosure, the plurality of communication primitives further include concurrent communication primitives that can be executed concurrently, and the method further includes: executing the concurrent communication primitives in a time-sharing manner.
需要理解的是,这里所述的“并发”,并非在同一时刻执行,而是可以是分时或时分复用的形式来对通信原语进行并发执行。It should be understood that the "concurrency" mentioned here does not mean execution at the same time, but rather concurrent execution of communication primitives in the form of time division or time division multiplexing.
图11a至图11f示出了根据本公开一个实施方式的在串行通信原语之间设置并发通信原语的示意图。其中,与图7和图8相似,为了区分不同的状态,使用了不同的背景来进行区分,例如“等待”状态以空白背景表示,“工作中”状态以横线背景表示,“已执行”状态以竖线背景表示,而“确认”状态以灰色背景表示。此外,以虚线框表示跳过的OP,以实线框表示执行的OP。需要理解的是,方框“开始”、“结束”、“FB”和“FE”等均以空白背景来表示,但仅仅是表示OP执行的一些关键点,并不表示他们也必然参与OP的执行。Figures 11a to 11f show schematic diagrams of setting concurrent communication primitives between serial communication primitives according to an embodiment of the present disclosure. Wherein, similar to Figures 7 and 8, in order to distinguish different states, different backgrounds are used for distinction, for example, the "waiting" state is represented by a blank background, the "working" state is represented by a horizontal line background, the "executed" state is represented by a vertical line background, and the "confirmed" state is represented by a gray background. In addition, the skipped OP is represented by a dotted box, and the executed OP is represented by a solid box. It should be understood that the boxes "Start", "End", "FB" and "FE" are all represented by a blank background, but they only represent some key points of the OP execution, and do not mean that they must also participate in the execution of the OP.
根据本公开的一个实施方式,本公开的方法进一步包括:在所述并发通信原语和前一串行通信原语之间插入并发开始标识,从而当通信原语队列执行到所述并发开始标识时,分时地执行所述并发通信原语;以及在所述并发通信原语和后一串行通信原语之间插入并发结束标识,从而当通信原语队列执行到所述并发结束标识时,根据所述并发通信原语的状态,重新执行所述并发通信原语,或者退出执行所述并发通信原语。According to one embodiment of the present disclosure, the method of the present disclosure further includes: inserting a concurrent start identifier between the concurrent communication primitive and the previous serial communication primitive, so that when the communication primitive queue executes to the concurrent start identifier, the concurrent communication primitive is executed in time-sharing; and inserting a concurrent end identifier between the concurrent communication primitive and the subsequent serial communication primitive, so that when the communication primitive queue executes to the concurrent end identifier, the concurrent communication primitive is re-executed according to the state of the concurrent communication primitive, or the execution of the concurrent communication primitive is exited.
如图11a所示,假设OP0,OP1,OP3,OP4和OP5为上文所述的串行通信原语,在OP1和OP3之间存在三个并发通信原语OP21,OP22和OP23,其中如果将OP21,OP22和OP23视为一个整体的话(假设表示为OP2),那么OP0,OP1,OP2、OP3,OP4和OP5仍然构成串行的通信原语队列,其仍然符合上文所述的规则;但是,如果将OP21,OP22和OP23单独对待的话,则OP21,OP22和OP23可以分时地执行。As shown in FIG11a , assuming that OP0, OP1, OP3, OP4 and OP5 are the serial communication primitives described above, there are three concurrent communication primitives OP21, OP22 and OP23 between OP1 and OP3, wherein if OP21, OP22 and OP23 are regarded as a whole (assuming they are represented as OP2), then OP0, OP1, OP2, OP3, OP4 and OP5 still constitute a serial communication primitive queue, which still complies with the rules described above; however, if OP21, OP22 and OP23 are treated separately, then OP21, OP22 and OP23 can be executed in time-sharing manner.
并发通信原语OP21,OP22和OP23与串行通信原语OP0,OP1,OP3,OP4和OP5的不同之处在于,当执行串行通信原语时,如果发生通信的阻塞,那么就从发生阻塞的串行通信原语处进行休眠操作;而在执行并发通信原语时,一个并发通信原语发生阻塞并不影响其他并发通信原语的执行。The difference between the concurrent communication primitives OP21, OP22 and OP23 and the serial communication primitives OP0, OP1, OP3, OP4 and OP5 is that when executing the serial communication primitives, if communication blocking occurs, the sleep operation is performed from the serial communication primitive where the blocking occurs; while when executing concurrent communication primitives, the blocking of one concurrent communication primitive does not affect the execution of other concurrent communication primitives.
仍然如图11a所示,为了区分串行通信原语和并发通信原语,可以在二者之间插入标识以进行区分,从而采取不用的执行模式。具体而言,可以在串行通信原语OP1与并发通信原语OP21,OP22和OP23之间插入并发开始标识FB(Flag Begin),当执行到该并发开始标识FB时,可以自动地并发执行该并发开始标识FB之后的通信原语;类似地,可以在并发通信原语OP21,OP22和OP23与串行通信原语OP3之间插入并发结束标识FE(Flag End),当执行到该并发结束标识FE时,则意味着本次并发通信原语执行的结 束。Still as shown in FIG. 11a, in order to distinguish between the serial communication primitive and the concurrent communication primitive, a flag can be inserted between the two to distinguish them, so as to adopt different execution modes. Specifically, a concurrent start flag FB (Flag Begin) can be inserted between the serial communication primitive OP1 and the concurrent communication primitives OP21, OP22 and OP23. When the concurrent start flag FB is executed, the communication primitives after the concurrent start flag FB can be automatically executed in parallel; similarly, a concurrent end flag FE (Flag End) can be inserted between the concurrent communication primitives OP21, OP22 and OP23 and the serial communication primitive OP3. When the concurrent end flag FE is executed, it means that the execution of this concurrent communication primitive is completed. bundle.
需要理解的是,上文所述的“本次并发通信原语执行的结束”并不意味着要退出所有并发原语的执行,而是可以再次执行,即重新执行存在阻塞的并发通信原语,或者退出执行所述并发通信原语。It should be understood that the above-mentioned "end of execution of this concurrent communication primitive" does not mean to exit the execution of all concurrent primitives, but it can be executed again, that is, re-execute the blocked concurrent communication primitive, or exit the execution of the concurrent communication primitive.
图11a示出了未执行时所有通信原语的状态和相互之间的关系,其中所有通信原语OP均处于等待状态,以虚线框表示。图11b示出了多次执行并发通信原语的示意图。Figure 11a shows the states of all communication primitives and their relationships when not executed, wherein all communication primitives OP are in a waiting state, represented by a dotted frame. Figure 11b shows a schematic diagram of executing concurrent communication primitives multiple times.
根据本公开的一个实施方式,分时地执行所述并发通信原语包括:使得并发通信原语从等待状态交替进入工作中状态。According to one embodiment of the present disclosure, executing the concurrent communication primitives in time-sharing manner includes: making the concurrent communication primitives alternately enter a working state from a waiting state.
如图11b所示,假设串行通信原语OP0和OP1已经执行完毕并且处于“确认”状态,之后经过并发开始标识FB后,进入到并发通信原语OP21,OP22和OP23的执行,可以使得OP21至OP23交替地(或者分时地)进入“工作中”状态,但与串行通信原语的执行不同的是,并发通信原语OP21至OP23由于是并发的,因此他们之间的状态彼此独立。例如,如图11b所示,在首次执行并发通信原语OP21至OP23后,OP21可以处于“工作中”状态,OP22可以处于“已执行”状态,而OP23可以处于“工作中”状态。假设此时并发通信原语OP2处发生了阻塞,那么可以根据如下方式和规则来重新执行通信原语。As shown in Figure 11b, assuming that the serial communication primitives OP0 and OP1 have been executed and are in the "confirmed" state, after passing through the concurrent start mark FB, the execution of the concurrent communication primitives OP21, OP22 and OP23 can be entered, so that OP21 to OP23 can alternately (or time-sharingly) enter the "working" state, but unlike the execution of the serial communication primitives, the concurrent communication primitives OP21 to OP23 are concurrent, so their states are independent of each other. For example, as shown in Figure 11b, after the first execution of the concurrent communication primitives OP21 to OP23, OP21 can be in the "working" state, OP22 can be in the "executed" state, and OP23 can be in the "working" state. Assuming that the concurrent communication primitive OP2 is blocked at this time, the communication primitive can be re-executed according to the following methods and rules.
根据本公开的一个实施方式,根据所述并发通信原语的状态,重新执行所述并发通信原语包括:响应于多个所述并发通信原语中未全部发生通信阻塞,则重新执行所述并发通信原语。According to an embodiment of the present disclosure, according to the status of the concurrent communication primitive, re-executing the concurrent communication primitive includes: in response to not all of the multiple concurrent communication primitives experiencing communication congestion, re-executing the concurrent communication primitive.
仍然以图11b为例,由于通信原语OP22未发生通信阻塞,因此根据本公开的一个规则,在并发通信原语OP21、OP22和OP23未全部发生阻塞的情况下,可以再次执行发生阻塞的通信原语OP21和OP23。具体而言,在执行了并发通信原语OP23之后,进入到并发结束标识FE,再次回到并发通信原语OP21,以对上次发生阻塞的并发通信原语重新执行。Still taking FIG. 11b as an example, since the communication primitive OP22 is not blocked, according to a rule of the present disclosure, when the concurrent communication primitives OP21, OP22 and OP23 are not all blocked, the blocked communication primitives OP21 and OP23 can be executed again. Specifically, after the concurrent communication primitive OP23 is executed, the concurrent end identifier FE is entered, and the concurrent communication primitive OP21 is returned to re-execute the concurrent communication primitive that was blocked last time.
根据本公开的一个实施方式,重新执行所述并发通信原语包括:对于已经部分执行过的并发通信原语,仅重新执行该并发通信原语中未被执行过的那部分。According to one embodiment of the present disclosure, re-executing the concurrent communication primitive includes: for a concurrent communication primitive that has been partially executed, re-executing only the portion of the concurrent communication primitive that has not been executed.
如上文所述,对于某一个通信原语而言,如果该通信原语执行了一部分之后发生阻塞,则该通信原语的执行并不完整;那么,对于该不完整执行的通信原语,将不再执行已经执行过的那部分,而仅仅执行未被执行过的那部分。As mentioned above, for a certain communication primitive, if the communication primitive is blocked after executing part of it, the execution of the communication primitive is incomplete; then, for the incompletely executed communication primitive, the part that has been executed will no longer be executed, and only the part that has not been executed will be executed.
这样的有益效果包括多个,例如,仅仅执行未被执行过的那部分通信原语,有利于减少通信原语的重复执行,提醒恢复时的效率。Such beneficial effects include multiple ones. For example, only executing the part of the communication primitives that have not been executed is helpful to reduce the repeated execution of the communication primitives and improve the efficiency during recovery.
此外,上文中结合图9描述了在一些特定情形下仅仅执行未被执行过的那部分所带来的有益效果,而这样的有益效果同样适用于并发通信原语。In addition, the above text describes the beneficial effects of executing only the unexecuted portion in some specific situations in conjunction with FIG. 9 , and such beneficial effects are also applicable to concurrent communication primitives.
根据本公开的一个实施方式,根据所述并发通信原语的状态,重新执行所述并发通信原语进一步包括:略过处于确认状态的并发通信原语而无需重新执行。According to an embodiment of the present disclosure, re-executing the concurrent communication primitive according to the state of the concurrent communication primitive further includes: skipping the concurrent communication primitive in the confirmation state without re-executing.
如图11c所示,假设在重新执行并发通信原语OP21、OP22和OP23时,并发通信原语OP22已经处于“确认”状态,而并发通信原语OP21和OP23处发生了通信阻塞,并且处于“已执行”状态,则此时略过对并发通信原语OP22的执行,而仅再次交替地执行并发通信原语OP21和OP23。As shown in Figure 11c, assuming that when the concurrent communication primitives OP21, OP22 and OP23 are re-executed, the concurrent communication primitive OP22 is already in the "confirmed" state, and communication congestion has occurred at the concurrent communication primitives OP21 and OP23 and is in the "executed" state, then the execution of the concurrent communication primitive OP22 is skipped at this time, and only the concurrent communication primitives OP21 and OP23 are executed alternately again.
需要理解的是,在重新执行并发通信原语OP21、OP22和OP23时,上面实施例中OP22处于“确认”状态仅仅是一个实施方式,而不是必须都要处于“确认”状态才被略过,本质上,只要该并发通信原语的执行没有发生通信阻塞,即使在第二轮执行时尚未处于“确认”状态,则仍然略过对该并发通信原语的执行。It should be understood that, when re-executing the concurrent communication primitives OP21, OP22 and OP23, the "confirmation" state of OP22 in the above embodiment is only an implementation method, and it does not have to be in the "confirmation" state to be skipped. In essence, as long as the execution of the concurrent communication primitive does not cause communication congestion, even if it is not in the "confirmation" state in the second round of execution, the execution of the concurrent communication primitive is still skipped.
另一方面,根据本公开的一个实施方式,根据所述并发通信原语的状态,退出执行所述并发通信原语包括:响应于多个所述并发通信原语中全部发生通信阻塞,则退出执行所述并发通信原语,并退出所述通信原语队列的执行。 On the other hand, according to one embodiment of the present disclosure, based on the status of the concurrent communication primitive, exiting the execution of the concurrent communication primitive includes: in response to communication congestion occurring in all of the multiple concurrent communication primitives, exiting the execution of the concurrent communication primitive and exiting the execution of the communication primitive queue.
在此情况下,如果在交替地执行并发通信原语OP21,OP22和OP23时,这三个并发通信原语OP21,OP22和OP23均发生了通信阻塞,导致所有并发通信原语无法正常执行,则在此情况下可以退出这些通信原语的执行,并进行休眠状态。OP21,OP22和OP23整体上可以被视为一个串行通信原语,其与上游的串行通信原语OP1和下游的OP3构成串行关系,因此,根据上文的描述,当某一个串行通信原语发生通信阻塞时,可以从当前发生阻塞的串行通信原语处退出执行。In this case, if during the execution of the concurrent communication primitives OP21, OP22 and OP23, the three concurrent communication primitives OP21, OP22 and OP23 are blocked in communication, causing all concurrent communication primitives to fail to execute normally, then in this case, the execution of these communication primitives can be exited and the state of sleep can be entered. OP21, OP22 and OP23 can be regarded as a serial communication primitive as a whole, which forms a serial relationship with the upstream serial communication primitive OP1 and the downstream OP3. Therefore, according to the above description, when a serial communication primitive is blocked in communication, the execution can be exited from the currently blocked serial communication primitive.
进一步地,根据本公开的一个实施方式,根据所述并发通信原语的状态,退出执行所述并发通信原语包括:响应于重新执行所述并发通信原语的次数达到预定次数并且多个所述并发通信原语中的至少一个发生通信阻塞,退出执行所述并发通信原语,并退出所述通信原语队列的执行。Further, according to one embodiment of the present disclosure, based on the state of the concurrent communication primitive, exiting the execution of the concurrent communication primitive includes: in response to the number of times the concurrent communication primitive is re-executed reaching a predetermined number of times and communication congestion occurs in at least one of the multiple concurrent communication primitives, exiting the execution of the concurrent communication primitive and exiting the execution of the communication primitive queue.
根据本公开的上述实施方式,当从并发结束标识FE返回到并发通信原语再次执行后,如果返回的次数超过预定次数,当并发通信原语中仍然存在发生通信阻塞的情况,则可以不再返回这些并发通信原语,并且从该通信原语队列中退出。例如,如图11d所示,当从并发结束标识FE返回到并发通信原语两次之后,如果并发通信原语OP21,OP22和OP23中仍然有至少一个并发通信原语(在图11d中是OP21和OP23)仍然处于阻塞状态,则可以不再返回这些并发通信原语,并且从该通信原语队列中退出。这样的有益效果在于,对并发通信原语的执行不会无休止的循环下去,当尝试了多次而仍然无法解决时,则正常从并发通信原语中退出,并退出整个通信原语队列的执行,避免陷入死锁。According to the above-mentioned embodiment of the present disclosure, after returning from the concurrent end identifier FE to the concurrent communication primitive for execution again, if the number of returns exceeds the predetermined number of times, when there is still a communication blocking in the concurrent communication primitive, these concurrent communication primitives may not be returned, and the communication primitive queue may be exited. For example, as shown in Figure 11d, after returning from the concurrent end identifier FE to the concurrent communication primitive twice, if there is still at least one concurrent communication primitive (OP21 and OP23 in Figure 11d) among the concurrent communication primitives OP21, OP22 and OP23 that is still in a blocked state, these concurrent communication primitives may not be returned, and the communication primitive queue may be exited. Such a beneficial effect is that the execution of the concurrent communication primitive will not be endlessly looped. When it is tried many times and still cannot be solved, it will normally exit from the concurrent communication primitive and exit the execution of the entire communication primitive queue to avoid deadlock.
可以在并发结束标识FE处加入计数器,当从并发结束标识FE处返回并发通信原语的次数达到计数器规定的数值后,则可以退出并进行休眠。A counter may be added at the concurrent end identifier FE. When the number of concurrent communication primitives returned from the concurrent end identifier FE reaches a value specified by the counter, the system may exit and enter sleep mode.
根据本公开的一个实施方式,其中,响应于退出执行所述并发通信原语,在所述并发开始标识处加入恢复标识,以便于在恢复通信原语的执行时容易地搜索到退出的位置。According to one embodiment of the present disclosure, in response to exiting the execution of the concurrent communication primitive, a resume identifier is added at the concurrent start identifier to facilitate easy search for the exit location when resuming the execution of the communication primitive.
仍然如图11d所示,当多次执行并发通信原语OP21,OP22和OP23之后这些并发通信原语中仍然有一部分发生通信阻塞,将从这些并发通信原语的执行中退出,退出时,为了便于在恢复执行时顺利地搜索到退出的点,可以在并发开始标识处加入恢复标识。Still as shown in Figure 11d, after executing the concurrent communication primitives OP21, OP22 and OP23 multiple times, some of these concurrent communication primitives still have communication blockage, and will exit from the execution of these concurrent communication primitives. When exiting, in order to facilitate the smooth search for the exit point when resuming execution, a recovery marker can be added at the concurrent start marker.
图11e示出了协程恢复时的一个示意图。FIG. 11e shows a schematic diagram of coroutine recovery.
如图11e所示,根据本公开的一个实施方式,响应于搜索到所述恢复标识,重新执行发生通信阻塞的并发通信原语。As shown in FIG. 11e , according to one embodiment of the present disclosure, in response to searching for the recovery identifier, the concurrent communication primitive in which the communication congestion occurred is re-executed.
如图11e所示,首先搜索是否存在恢复标识,如果在并发开始标识处搜索到恢复标识,则意味着并发开始标识FB之后的并发通信原语中至少有一个发生了通信阻塞,需要重新执行。在此情况下,在从开始处直接进入到并发开始标识FB,而略过串行通信原语OP1和OP2的执行;然后,进入到并发通信原语OP21、OP22和OP23的执行。由于并发通信原语OP22已经处于“确认”状态,因此将略过对并发通信原语OP22的执行,而仅执行发生通信阻塞的并发通信原语OP21和OP23。如果OP21和OP23中的至少一个仍然发生通信阻塞,则可以从并发结束标识FE处重新执行多次,或者可以从并发结束标识FE处退出并行通信原语的执行,并且退出整个通信原语队列的执行As shown in Figure 11e, first search whether there is a recovery mark. If the recovery mark is found at the concurrent start mark, it means that at least one of the concurrent communication primitives after the concurrent start mark FB has a communication blockage and needs to be re-executed. In this case, directly enter the concurrent start mark FB from the beginning, and skip the execution of the serial communication primitives OP1 and OP2; then, enter the execution of the concurrent communication primitives OP21, OP22 and OP23. Since the concurrent communication primitive OP22 is already in the "confirmed" state, the execution of the concurrent communication primitive OP22 will be skipped, and only the concurrent communication primitives OP21 and OP23 that have communication blockage will be executed. If at least one of OP21 and OP23 still has communication blockage, it can be re-executed multiple times from the concurrent end mark FE, or the execution of the parallel communication primitive can be exited from the concurrent end mark FE, and the execution of the entire communication primitive queue can be exited.
或者,根据本公开的一个实施方式,根据所述并发通信原语的状态,退出执行所述并发通信原语包括:响应于所述并发通信原语全部处于确认状态,退出执行所述并发通信原语,并且执行并发结束标识之后的串行通信原语。Alternatively, according to one embodiment of the present disclosure, according to the status of the concurrent communication primitive, exiting the execution of the concurrent communication primitive includes: in response to all the concurrent communication primitives being in a confirmed state, exiting the execution of the concurrent communication primitive, and executing the serial communication primitive after the concurrent end identifier.
如图11f所示,如果OP21和OP23均未发生通信阻塞,则可以从并发结束标识FE处退出并行通信原语的执行,并接着执行串行通信原语OP3。而串行通信原语OP3,OP4和OP5的执行在上文中已经结合图8进行了描述,这里将不再赘述。上文中已经结合图11a至图11f描述了并发通信原语的执行,这些并发通信原语可以是单独的,也可以如图11a至图11f那样与串行通信原语结合。As shown in FIG11f, if communication blocking does not occur in both OP21 and OP23, the execution of the parallel communication primitive can be exited from the concurrent end mark FE, and the serial communication primitive OP3 can be executed next. The execution of the serial communication primitives OP3, OP4 and OP5 has been described above in conjunction with FIG8, and will not be repeated here. The execution of the concurrent communication primitives has been described above in conjunction with FIG11a to FIG11f, and these concurrent communication primitives can be separate or combined with the serial communication primitive as shown in FIG11a to FIG11f.
根据本公开的一个实施方式,还提供一种电子设备,包括:一个或多个处理器;以 及存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。According to an embodiment of the present disclosure, there is also provided an electronic device, comprising: one or more processors; and a memory, wherein the memory stores computer executable instructions, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method as described above.
根据本公开的一个实施方式,还提供一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述的方法。According to one embodiment of the present disclosure, a computer-readable storage medium is further provided, comprising computer-executable instructions. When the computer-executable instructions are executed by one or more processors, the method described above is executed.
下面的表1示出了本公开的技术方案与上文所述的第一种和第二种方案的区别。
Table 1 below shows the differences between the technical solution of the present disclosure and the first and second solutions described above.
表1Table 1
本公开的技术方案在无需引入硬件多线程机制下,利用软件协程方法实现计算核的分时复用能力,从而能够充分地利用计算核、避免任务死锁。协程执行流程对硬件改动较小,普遍支持各类SIMD处理架构实现软件分时复用。此外,通过原语跳跃机制支持异步通信原语的异步确认方式,可以在OP逻辑不修改的情况下,实现自动软件通信重传。通过交替执行机制,可以支持多个通信原语的并发执行,该实现类似于单核多线程的效果,节省了计算核的使用。该本公开的方案足以解决通信阻塞所带来的死锁问题。The technical solution disclosed in the present invention uses a software coroutine method to realize the time-sharing reuse capability of the computing core without introducing a hardware multi-threading mechanism, thereby being able to fully utilize the computing core and avoid task deadlock. The coroutine execution process has relatively small changes to the hardware, and generally supports various SIMD processing architectures to realize software time-sharing reuse. In addition, the asynchronous confirmation method of asynchronous communication primitives is supported by the primitive jump mechanism, and automatic software communication retransmission can be realized without modifying the OP logic. Through the alternating execution mechanism, the concurrent execution of multiple communication primitives can be supported, which is similar to the effect of single-core multi-threading and saves the use of computing cores. The solution disclosed in the present invention is sufficient to solve the deadlock problem caused by communication congestion.
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic equipment or device disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, transportation, household appliances, and/or medical equipment. The transportation includes airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes magnetic resonance imaging, ultrasound machines and/or electrocardiographs. The electronic equipment or device disclosed herein may also be applied to the Internet, IoT, data centers, energy, transportation, public administration, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical fields. Further, the electronic equipment or device disclosed herein may also be used in cloud, edge, and terminal applications related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solution can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or edge devices (such as smart phones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or edge device are compatible with each other, so that according to the hardware information of the terminal device and/or edge device, appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or edge device, so as to complete the unified management, scheduling and collaborative work of end-to-end or cloud-edge-to-end.
通过以下条款,能够更好地理解本公开的技术方案。The following clauses can better understand the technical solution of the present disclosure.
条款1.一种执行片间通信任务的方法,其中,所述片间通信任务通过通信原语队列来描述,并且所述通信原语队列包括多个通信原语,所述多个通信原语包括串行连接的串行通信原语,所述方法包括:Clause 1. A method for performing an inter-chip communication task, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes a plurality of communication primitives, the plurality of communication primitives including a serial communication primitive of a serial connection, the method comprising:
执行针对通信原语队列的搜索,以确定所述通信原语队列中串行通信原语的状态; performing a search of a communication primitive queue to determine a status of a serial communication primitive in the communication primitive queue;
响应于搜索到发生中断的串行通信原语,从所述发生中断的串行通信原语处开始,重新执行所述通信原语队列。In response to searching for an interrupted serial communication primitive, the communication primitive queue is re-executed starting from the interrupted serial communication primitive.
条款2.根据条款1所述的方法,进一步包括:Clause 2. The method according to clause 1, further comprising:
定义状态机,所述状态机用于描述通信原语的工作状态;defining a state machine, wherein the state machine is used to describe the working state of the communication primitive;
根据所述状态机所描述的通信原语的工作状态来确定通信原语是否发生中断;Determine whether a communication primitive is interrupted according to the working state of the communication primitive described by the state machine;
其中所述工作状态包括:The working status includes:
等待状态,用于表示通信原语未被执行;Wait state, used to indicate that the communication primitive has not been executed;
工作中状态,在所述等待状态之后,用于表示通信原语正在执行,该通信原语的通信请求未完全发出,且尚未完全接收到响应信号;The working state, after the waiting state, is used to indicate that the communication primitive is being executed, the communication request of the communication primitive has not been completely sent, and the response signal has not been completely received;
已执行状态,在所述工作中状态之后,用于表示通信原语已执行,该通信原语的通信请求已完全发出,且尚未完全接收到响应信号;以及The executed state, after the working state, is used to indicate that the communication primitive has been executed, the communication request of the communication primitive has been completely sent, and the response signal has not been completely received; and
确认状态,在所述已执行状态之后,用于表示通信原语的通信请求已完全发出,并且完全接收到响应信号。The confirmation state is used to indicate, after the executed state, that the communication request of the communication primitive has been completely issued and the response signal has been completely received.
条款3.根据条款2所述的方法,进一步包括:Clause 3. The method according to clause 2, further comprising:
启动所述通信原语队列的执行,以使得串行通信原语从等待状态逐一进入工作中状态;Starting the execution of the communication primitive queue so that the serial communication primitives enter the working state one by one from the waiting state;
根据所述串行通信原语的执行情况,转换所述工作状态,其中,所述工作状态的转换是单向的。The working state is converted according to the execution status of the serial communication primitive, wherein the conversion of the working state is unidirectional.
条款4.根据条款3所述的方法,其中,对于单个计算核的情形,最多一个串行通信原语处于工作中状态。Clause 4. The method of clause 3, wherein, for a single computing core, at most one serial communication primitive is in operation.
条款5.根据条款3所述的方法,其中,在所述通信原语队列中,在后的串行通信原语的工作状态禁止处于在前的串行通信原语的工作状态之后。Clause 5. The method according to clause 3, wherein, in the communication primitive queue, the working state of a subsequent serial communication primitive is prohibited from being behind the working state of a preceding serial communication primitive.
条款6.根据条款1-5中任意一项所述的方法,进一步包括:响应于串行通信原语发生通信阻塞,在发生通信阻塞的串行通信原语处退出所述通信原语队列的执行。Clause 6. The method according to any one of clauses 1-5 further comprises: in response to a serial communication primitive being blocked in communication, exiting execution of the communication primitive queue at the serial communication primitive where the communication is blocked.
条款7.根据条款1-6中任意一项所述的方法,进一步包括:响应于串行通信原语发生通信阻塞,使得相应的串行通信原语的工作状态保持在已执行状态。Clause 7. The method according to any one of clauses 1-6 further comprises: in response to communication congestion of a serial communication primitive, maintaining the working state of the corresponding serial communication primitive in an executed state.
条款8.根据条款1-7中任意一项所述的方法,其中,从所述发生中断的串行通信原语处开始,重新执行所述通信原语队列包括:Clause 8. The method according to any one of clauses 1 to 7, wherein re-executing the communication primitive queue starting from the interrupted serial communication primitive comprises:
对于已经部分执行过的串行通信原语,仅重新执行该串行通信原语中未被执行过的那部分。For a serial communication primitive that has been partially executed, only the portion of the serial communication primitive that has not been executed is re-executed.
条款9.根据条款1-8中任意一项所述的方法,其中,所述多个通信原语进一步包括可并发执行的并发通信原语,所述方法进一步包括:以分时的方式来执行所述并发通信原语。Clause 9. The method according to any one of clauses 1-8, wherein the multiple communication primitives further include concurrent communication primitives that can be executed concurrently, and the method further includes: executing the concurrent communication primitives in a time-sharing manner.
条款10.根据条款9所述的方法,进一步包括:Clause 10. The method according to clause 9, further comprising:
在所述并发通信原语和前一串行通信原语之间插入并发开始标识,从而当通信原语队列执行到所述并发开始标识时,分时地执行所述并发通信原语;以及Inserting a concurrent start marker between the concurrent communication primitive and a previous serial communication primitive, so that when the communication primitive queue executes to the concurrent start marker, the concurrent communication primitive is executed in time-sharing manner; and
在所述并发通信原语和后一串行通信原语之间插入并发结束标识,从而当通信原语队列执行到所述并发结束标识时,根据所述并发通信原语的状态,重新执行所述并发通信原语,或者退出执行所述并发通信原语。A concurrent end marker is inserted between the concurrent communication primitive and the next serial communication primitive, so that when the communication primitive queue executes to the concurrent end marker, the concurrent communication primitive is re-executed or the concurrent communication primitive is exited according to the state of the concurrent communication primitive.
条款11.根据条款10所述的方法,其中,分时地执行所述并发通信原语包括:使得并发通信原语从等待状态交替进入工作中状态。Clause 11. The method according to clause 10, wherein executing the concurrent communication primitives in a time-sharing manner comprises: causing the concurrent communication primitives to alternately enter a working state from a waiting state.
条款12.根据条款10或11所述的方法,其中,根据所述并发通信原语的状态,重新执行所述并发通信原语包括:Clause 12. The method of clause 10 or 11, wherein, based on the state of the concurrent communication primitive, re-executing the concurrent communication primitive comprises:
响应于多个所述并发通信原语中未全部发生通信阻塞,则重新执行所述并发通信原语。In response to not all of the concurrent communication primitives being blocked, the concurrent communication primitives are re-executed.
条款13.根据条款12所述的方法,其中,重新执行所述并发通信原语包括:对于已经部分执行过的并发通信原语,仅重新执行该并发通信原语中未被执行过的那部分。Clause 13. The method according to clause 12, wherein re-executing the concurrent communication primitive comprises: for a concurrent communication primitive that has been partially executed, re-executing only the portion of the concurrent communication primitive that has not been executed.
条款14.根据条款12所述的方法,其中,根据所述并发通信原语的状态,重新执行 所述并发通信原语进一步包括:略过处于确认状态的并发通信原语而无需重新执行。Clause 14. The method of clause 12, wherein, based on the state of the concurrent communication primitive, re-execution The concurrent communication primitives further include: skipping concurrent communication primitives in a confirmed state without re-executing.
条款15.根据条款10-14中任意一项所述的方法,其中,根据所述并发通信原语的状态,退出执行所述并发通信原语包括:Clause 15. The method of any one of clauses 10-14, wherein, based on the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:
响应于多个所述并发通信原语中全部发生通信阻塞,则退出执行所述并发通信原语,并退出所述通信原语队列的执行。In response to communication congestion occurring in all of the multiple concurrent communication primitives, the execution of the concurrent communication primitives is exited, and the execution of the communication primitive queue is exited.
条款16.根据条款10-15中任意一项所述的方法,其中,根据所述并发通信原语的状态,退出执行所述并发通信原语包括:Clause 16. The method of any one of clauses 10-15, wherein, based on the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:
响应于重新执行所述并发通信原语的次数达到预定次数并且多个所述并发通信原语中的至少一个发生通信阻塞,退出执行所述并发通信原语,并退出所述通信原语队列的执行。In response to the number of times the concurrent communication primitive is re-executed reaching a predetermined number and communication congestion occurs in at least one of the plurality of concurrent communication primitives, the execution of the concurrent communication primitive is exited, and the execution of the communication primitive queue is exited.
条款17.根据条款15或16所述的方法,其中,响应于退出执行所述并发通信原语,在所述并发开始标识处加入恢复标识。Clause 17. The method according to clause 15 or 16, wherein, in response to exiting the execution of the concurrent communication primitive, a resume identifier is added at the concurrent start identifier.
条款18.根据条款17所述的方法,进一步包括:响应于搜索到所述恢复标识,重新执行发生通信阻塞的并发通信原语。Clause 18. The method according to Clause 17 further comprises: in response to searching for the recovery identifier, re-executing the concurrent communication primitive where communication congestion occurs.
条款19.根据条款10-18中任意一项所述的方法,其中,根据所述并发通信原语的状态,退出执行所述并发通信原语包括:Clause 19. The method of any one of clauses 10-18, wherein, based on the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:
响应于所述并发通信原语全部处于确认状态,退出执行所述并发通信原语,并且执行并发结束标识之后的串行通信原语。In response to all the concurrent communication primitives being in a confirmed state, the execution of the concurrent communication primitives is exited, and the serial communication primitive after the concurrent end marker is executed.
条款20.一种电子设备,包括:Clause 20. An electronic device comprising:
一个或多个处理器;以及one or more processors; and
存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如条款1-19中任意一项所述的方法。A memory, wherein computer executable instructions are stored in the memory, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method as described in any one of clauses 1-19.
条款21.一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如条款1-19中任意一项所述的方法。Clause 21. A computer-readable storage medium comprising computer-executable instructions, which, when executed by one or more processors, perform the method as described in any one of Clauses 1-19.
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。 The embodiments of the present disclosure are described in detail above. Specific examples are used herein to illustrate the principles and implementation methods of the present disclosure. The description of the above embodiments is only used to help understand the method and its core idea of the present disclosure. At the same time, changes or deformations made by those skilled in the art based on the ideas of the present disclosure, the specific implementation methods and the scope of application of the present disclosure, all fall within the scope of protection of the present disclosure. In summary, the content of this specification should not be understood as a limitation on the present disclosure.

Claims (21)

  1. 一种执行片间通信任务的方法,其中,所述片间通信任务通过通信原语队列来描述,并且所述通信原语队列包括多个通信原语,所述多个通信原语包括串行连接的串行通信原语,所述方法包括:A method for performing an inter-chip communication task, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes a plurality of communication primitives, and the plurality of communication primitives include a serial communication primitive of a serial connection, and the method comprises:
    执行针对通信原语队列的搜索,以确定所述通信原语队列中串行通信原语的状态;performing a search of a communication primitive queue to determine a status of a serial communication primitive in the communication primitive queue;
    响应于搜索到发生中断的串行通信原语,从所述发生中断的串行通信原语处开始,重新执行所述通信原语队列。In response to searching for an interrupted serial communication primitive, the communication primitive queue is re-executed starting from the interrupted serial communication primitive.
  2. 根据权利要求1所述的方法,进一步包括:The method according to claim 1, further comprising:
    定义状态机,所述状态机用于描述通信原语的工作状态;defining a state machine, wherein the state machine is used to describe the working state of the communication primitive;
    根据所述状态机所描述的通信原语的工作状态来确定通信原语是否发生中断;Determine whether a communication primitive is interrupted according to the working state of the communication primitive described by the state machine;
    其中所述工作状态包括:The working status includes:
    等待状态,用于表示通信原语未被执行;Wait state, used to indicate that the communication primitive has not been executed;
    工作中状态,在所述等待状态之后,用于表示通信原语正在执行,该通信原语的通信请求未完全发出,且尚未完全接收到响应信号;The working state, after the waiting state, is used to indicate that the communication primitive is being executed, the communication request of the communication primitive has not been completely sent, and the response signal has not been completely received;
    已执行状态,在所述工作中状态之后,用于表示通信原语已执行,该通信原语的通信请求已完全发出,且尚未完全接收到响应信号;以及The executed state, after the working state, is used to indicate that the communication primitive has been executed, the communication request of the communication primitive has been completely sent, and the response signal has not been completely received; and
    确认状态,在所述已执行状态之后,用于表示通信原语的通信请求已完全发出,并且完全接收到响应信号。The confirmation state is used to indicate, after the executed state, that the communication request of the communication primitive has been completely issued and the response signal has been completely received.
  3. 根据权利要求2所述的方法,进一步包括:The method according to claim 2, further comprising:
    启动所述通信原语队列的执行,以使得串行通信原语从等待状态逐一进入工作中状态;Starting the execution of the communication primitive queue so that the serial communication primitives enter the working state one by one from the waiting state;
    根据所述串行通信原语的执行情况,转换所述工作状态,其中,所述工作状态的转换是单向的。The working state is converted according to the execution status of the serial communication primitive, wherein the conversion of the working state is unidirectional.
  4. 根据权利要求3所述的方法,其中,对于单个计算核的情形,最多一个串行通信原语处于工作中状态。The method according to claim 3, wherein, for a single computing core, at most one serial communication primitive is in operation.
  5. 根据权利要求3所述的方法,其中,在所述通信原语队列中,在后的串行通信原语的工作状态禁止处于在前的串行通信原语的工作状态之后。The method according to claim 3, wherein, in the communication primitive queue, the working state of the subsequent serial communication primitive is prohibited from being behind the working state of the preceding serial communication primitive.
  6. 根据权利要求1-5中任意一项所述的方法,进一步包括:响应于串行通信原语发生通信阻塞,在发生通信阻塞的串行通信原语处退出所述通信原语队列的执行。The method according to any one of claims 1 to 5, further comprising: in response to communication congestion occurring in a serial communication primitive, exiting execution of the communication primitive queue at the serial communication primitive where the communication congestion occurs.
  7. 根据权利要求1-6中任意一项所述的方法,进一步包括:响应于串行通信原语发生通信阻塞,使得相应的串行通信原语的工作状态保持在已执行状态。The method according to any one of claims 1 to 6, further comprising: in response to communication congestion of the serial communication primitive, maintaining the working state of the corresponding serial communication primitive in an executed state.
  8. 根据权利要求1-7中任意一项所述的方法,其中,从所述发生中断的串行通信原语处开始,重新执行所述通信原语队列包括:The method according to any one of claims 1 to 7, wherein starting from the serial communication primitive where the interruption occurs, re-executing the communication primitive queue comprises:
    对于已经部分执行过的串行通信原语,仅重新执行该串行通信原语中未被执行过的那部分。For a serial communication primitive that has been partially executed, only the portion of the serial communication primitive that has not been executed is re-executed.
  9. 根据权利要求1-8中任意一项所述的方法,其中,所述多个通信原语进一步包括可并发执行的并发通信原语,所述方法进一步包括:以分时的方式来执行所述并发通信原语。According to the method according to any one of claims 1 to 8, wherein the multiple communication primitives further include concurrent communication primitives that can be executed concurrently, and the method further comprises: executing the concurrent communication primitives in a time-sharing manner.
  10. 根据权利要求9所述的方法,进一步包括:The method according to claim 9, further comprising:
    在所述并发通信原语和前一串行通信原语之间插入并发开始标识,从而当通信原语队列执行到所述并发开始标识时,分时地执行所述并发通信原语;以及Inserting a concurrent start marker between the concurrent communication primitive and a previous serial communication primitive, so that when the communication primitive queue executes to the concurrent start marker, the concurrent communication primitive is executed in time-sharing manner; and
    在所述并发通信原语和后一串行通信原语之间插入并发结束标识,从而当通信原语队 列执行到所述并发结束标识时,根据所述并发通信原语的状态,重新执行所述并发通信原语,或者退出执行所述并发通信原语。Insert a concurrent end marker between the concurrent communication primitive and the subsequent serial communication primitive, so that when the communication primitive queue When the execution sequence reaches the concurrent end mark, the concurrent communication primitive is re-executed or the concurrent communication primitive is exited according to the state of the concurrent communication primitive.
  11. 根据权利要求10所述的方法,其中,分时地执行所述并发通信原语包括:使得并发通信原语从等待状态交替进入工作中状态。The method according to claim 10, wherein executing the concurrent communication primitives in time-sharing comprises: causing the concurrent communication primitives to alternately enter a working state from a waiting state.
  12. 根据权利要求10或11所述的方法,其中,根据所述并发通信原语的状态,重新执行所述并发通信原语包括:The method according to claim 10 or 11, wherein, according to the state of the concurrent communication primitive, re-executing the concurrent communication primitive comprises:
    响应于多个所述并发通信原语中未全部发生通信阻塞,则重新执行所述并发通信原语。In response to not all of the concurrent communication primitives being blocked, the concurrent communication primitives are re-executed.
  13. 根据权利要求12所述的方法,其中,重新执行所述并发通信原语包括:对于已经部分执行过的并发通信原语,仅重新执行该并发通信原语中未被执行过的那部分。The method according to claim 12, wherein re-executing the concurrent communication primitive comprises: for a concurrent communication primitive that has been partially executed, re-executing only the portion of the concurrent communication primitive that has not been executed.
  14. 根据权利要求12所述的方法,其中,根据所述并发通信原语的状态,重新执行所述并发通信原语进一步包括:略过处于确认状态的并发通信原语而无需重新执行。The method according to claim 12, wherein, according to the status of the concurrent communication primitive, re-executing the concurrent communication primitive further comprises: skipping the concurrent communication primitive in the confirmed state without re-executing.
  15. 根据权利要求10-14中任意一项所述的方法,其中,根据所述并发通信原语的状态,退出执行所述并发通信原语包括:The method according to any one of claims 10 to 14, wherein, according to the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:
    响应于多个所述并发通信原语中全部发生通信阻塞,则退出执行所述并发通信原语,并退出所述通信原语队列的执行。In response to communication congestion occurring in all of the multiple concurrent communication primitives, the execution of the concurrent communication primitives is exited, and the execution of the communication primitive queue is exited.
  16. 根据权利要求10-15中任意一项所述的方法,其中,根据所述并发通信原语的状态,退出执行所述并发通信原语包括:The method according to any one of claims 10 to 15, wherein, according to the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:
    响应于重新执行所述并发通信原语的次数达到预定次数并且多个所述并发通信原语中的至少一个发生通信阻塞,退出执行所述并发通信原语,并退出所述通信原语队列的执行。In response to the number of times the concurrent communication primitive is re-executed reaching a predetermined number and communication congestion occurs in at least one of the plurality of concurrent communication primitives, the execution of the concurrent communication primitive is exited, and the execution of the communication primitive queue is exited.
  17. 根据权利要求15或16所述的方法,其中,响应于退出执行所述并发通信原语,在所述并发开始标识处加入恢复标识。The method according to claim 15 or 16, wherein, in response to exiting the execution of the concurrent communication primitive, a recovery identifier is added at the concurrent start identifier.
  18. 根据权利要求17所述的方法,进一步包括:响应于搜索到所述恢复标识,重新执行发生通信阻塞的并发通信原语。The method according to claim 17 further comprises: in response to searching for the recovery identifier, re-executing the concurrent communication primitive where communication congestion occurs.
  19. 根据权利要求10-18中任意一项所述的方法,其中,根据所述并发通信原语的状态,退出执行所述并发通信原语包括:The method according to any one of claims 10 to 18, wherein, according to the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:
    响应于所述并发通信原语全部处于确认状态,退出执行所述并发通信原语,并且执行并发结束标识之后的串行通信原语。In response to all the concurrent communication primitives being in the confirmation state, the execution of the concurrent communication primitives is exited, and the serial communication primitive after the concurrent end mark is executed.
  20. 一种电子设备,包括:An electronic device, comprising:
    一个或多个处理器;以及one or more processors; and
    存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如权利要求1-19中任意一项所述的方法。A memory, wherein computer executable instructions are stored in the memory, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method as described in any one of claims 1-19.
  21. 一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如权利要求1-19中任意一项所述的方法。 A computer-readable storage medium comprises computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the method according to any one of claims 1 to 19 is executed.
PCT/CN2023/112579 2022-12-09 2023-08-11 Method for executing inter-chip communication task, and related product WO2024119869A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211589123.4 2022-12-09
CN202211589123.4A CN118170553A (en) 2022-12-09 2022-12-09 Method for executing inter-chip communication task and related product

Publications (1)

Publication Number Publication Date
WO2024119869A1 true WO2024119869A1 (en) 2024-06-13

Family

ID=91347479

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/112579 WO2024119869A1 (en) 2022-12-09 2023-08-11 Method for executing inter-chip communication task, and related product

Country Status (2)

Country Link
CN (1) CN118170553A (en)
WO (1) WO2024119869A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200409709A1 (en) * 2019-06-29 2020-12-31 Intel Corporation Apparatuses, methods, and systems for time-multiplexing in a configurable spatial accelerator
CN112463710A (en) * 2020-12-10 2021-03-09 西安交通大学 Inter-core communication method and system based on embedded platform
CN112527729A (en) * 2020-12-15 2021-03-19 杭州慧芯达科技有限公司 Tightly-coupled heterogeneous multi-core processor architecture and processing method thereof
CN114691312A (en) * 2020-12-31 2022-07-01 中科寒武纪科技股份有限公司 Circuit, method and system for inter-chip communication

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200409709A1 (en) * 2019-06-29 2020-12-31 Intel Corporation Apparatuses, methods, and systems for time-multiplexing in a configurable spatial accelerator
CN112463710A (en) * 2020-12-10 2021-03-09 西安交通大学 Inter-core communication method and system based on embedded platform
CN112527729A (en) * 2020-12-15 2021-03-19 杭州慧芯达科技有限公司 Tightly-coupled heterogeneous multi-core processor architecture and processing method thereof
CN114691312A (en) * 2020-12-31 2022-07-01 中科寒武纪科技股份有限公司 Circuit, method and system for inter-chip communication

Also Published As

Publication number Publication date
CN118170553A (en) 2024-06-11

Similar Documents

Publication Publication Date Title
US9971635B2 (en) Method and apparatus for a hierarchical synchronization barrier in a multi-node system
US7984448B2 (en) Mechanism to support generic collective communication across a variety of programming models
US20090007141A1 (en) Message passing with a limited number of dma byte counters
US20090006296A1 (en) Dma engine for repeating communication patterns
CN102077181A (en) Method and system for generating and delivering inter-processor interrupts in a multi-core processor and in certain shared-memory multi-processor systems
CN108920409B (en) Heterogeneous multi-core processor organization structure for realizing fault-tolerant function
CN104094235A (en) Multithreaded computing
CN115061803A (en) Multi-core processing system and task scheduling method, chip and storage medium thereof
CN112306946A (en) Overlays for networks of processor cores
CN103282888A (en) Data processing method, graphics processing unit (gpu) and first node device
WO2024119869A1 (en) Method for executing inter-chip communication task, and related product
CN111767995A (en) Operation method, device and related product
CN111475205B (en) Coarse-grained reconfigurable array structure design method based on data flow decoupling
CN114706813B (en) Multi-core heterogeneous system-on-chip, asymmetric synchronization method, computing device and medium
US20230125149A1 (en) Fractional Force-Quit for Reconfigurable Processors
WO2022143194A1 (en) Method for executing asynchronous task, device, and computer program product
US20230259737A1 (en) Integrated computing apparatus, chip, board card, device and computing method
US20230153157A1 (en) Inter-node communication method and device based on multiple processing nodes
CN114691311A (en) Method, device and computer program product for executing asynchronous task
CN114281558A (en) Multi-core processor, method for multi-core processor and corresponding product
CN117389625B (en) Process synchronization method, system, equipment and medium based on active interrupt instruction
CN114281559A (en) Multi-core processor, synchronization method for multi-core processor and corresponding product
WO2024012280A1 (en) Method and device for task scheduling, board, and computer-readable storage medium
US20230016049A1 (en) Subscription to Sync Zones
CN113032298B (en) Computing device, integrated circuit device, board card and order preserving method for order preserving