WO2024119869A1

WO2024119869A1 - Method for executing inter-chip communication task, and related product

Info

Publication number: WO2024119869A1
Application number: PCT/CN2023/112579
Authority: WO
Inventors: 朝鲁
Original assignee: 上海寒武纪信息科技有限公司
Priority date: 2022-12-09
Filing date: 2023-08-11
Publication date: 2024-06-13
Also published as: CN118170553A

Abstract

A method for executing an inter-chip communication task, a corresponding electronic device and a readable storage medium. An inter-chip communication task is described by means of a communication primitive queue, the communication primitive queue comprising a plurality of communication primitives, and the plurality of communication primitives comprising serial communication primitives which are serially connected. The method comprises: executing a search for a communication primitive queue to determine states of serial communication primitives in the communication primitive queue; and in response to having found an interrupted serial communication primitive, re-executing the communication primitive queue from the interrupted serial communication primitive.

Description

A method for performing inter-chip communication tasks and related products

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to a Chinese patent application filed on December 9, 2022, with application number 202211589123.4 and titled “A method for performing inter-chip communication tasks and related products”.

Technical Field

The present disclosure relates to the field of chips, and more specifically, to the field of inter-chip communication of chips.

Background technique

How to program software based on the chip's unique and efficient inter-chip communication device is a key issue in achieving highly scalable artificial intelligence network training. There are two core problems in inter-chip communication: one is how to write data to the remote chip, and the other is how to sense whether the waiting data has arrived at the remote chip. Among them, the latter communication synchronization is the main issue of this article. However, the use of computing cores for polling will lead to an extremely tight supply of computing resources and eventually cause communication deadlock. How to avoid communication deadlock in the case of a single computing core and improve computing resource utilization is a problem that is expected to be solved.

Summary of the invention

One purpose of the present disclosure is to solve how to use a single computing core of an artificial intelligence chip to avoid communication deadlock through coroutine programming. A further purpose of the present disclosure is to use a single computing core of an artificial intelligence chip to complete time-division multiplexing inter-chip communication through coroutine programming to support concurrent communication tasks.

According to a first aspect of the present disclosure, a method for executing an inter-chip communication task is provided, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes a plurality of communication primitives, the plurality of communication primitives including a serial communication primitive connected in series, the method comprising: performing a search for the communication primitive queue to determine the status of the serial communication primitives in the communication primitive queue; in response to searching for an interrupted serial communication primitive, re-executing the communication primitive queue starting from the interrupted serial communication primitive.

According to a second aspect of the present disclosure, an electronic device is provided, comprising: one or more processors; and a memory, wherein the memory stores computer executable instructions, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method described above.

According to a third aspect of the present disclosure, a computer-readable storage medium is provided, comprising computer-executable instructions. When the computer-executable instructions are executed by one or more processors, the method described above is executed.

The technical solution provided by the present disclosure can bring at least one of the following beneficial effects: without introducing a hardware multi-threading mechanism, the time-sharing reuse capability of the computing core can be realized by using a software coroutine method, so that the computing core can be fully utilized and task deadlock can be avoided. The coroutine execution process has relatively small changes to the hardware, and generally supports various SIMD (Single Instruction Multiple Data) processing architectures to realize software time-sharing reuse. In addition, the asynchronous confirmation method of asynchronous communication primitives is supported by the primitive jump mechanism, and automatic software communication retransmission can be realized without modifying the OP (communication primitive) logic. Through the alternating execution mechanism, the concurrent execution of multiple communication primitives can be supported, which is similar to the effect of single-core multi-threading and saves the use of computing cores. The solution disclosed in the present disclosure is sufficient to solve the deadlock problem caused by communication congestion.

BRIEF DESCRIPTION OF THE DRAWINGS

By reading the detailed description below with reference to the accompanying drawings, the above and other purposes, features and advantages of the exemplary embodiments of the present disclosure will become readily understood. In the accompanying drawings, several embodiments of the present disclosure are shown in an exemplary and non-limiting manner, and the same or corresponding reference numerals represent the same or corresponding parts, wherein:

FIG1 is a schematic diagram showing the structure of a board 10 according to an embodiment of the present disclosure;

FIG2 is a schematic diagram showing the combined processing device 101 of this embodiment;

FIG3 shows a schematic diagram of the internal structure of the computing device 201;

Figure 4 shows the internal architecture of the processing core;

FIG. 5 shows a method for executing an inter-chip communication task according to an embodiment of the present disclosure.

FIG6 shows an example of coroutine execution according to one embodiment of the present disclosure;

FIG7 shows possible changes in the working state of a communication primitive (OP);

FIG8 is a schematic diagram showing the execution of a serial primitive queue according to an embodiment of the present disclosure;

FIG9 shows a schematic diagram of communication primitives involving in-situ operations;

FIG10 shows an exemplary application scenario in which there are multiple concurrent communication primitives; and

11a to 11f are schematic diagrams showing a method of setting concurrent communication primitives between serial communication primitives according to an embodiment of the present disclosure.

Detailed ways

The following will be combined with the drawings in the embodiments of the present disclosure to clearly and completely describe the technical solutions in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present disclosure.

It should be understood that the terms "first", "second", "third", "fourth", etc. in the claims, specifications and drawings of the present disclosure are used to distinguish different objects rather than to describe a specific order. "First", "second", "third", "fourth", etc. do not just mean one, but may also mean multiple. The terms "include" and "comprise" used in the specification and claims of the first disclosure indicate the presence of the described features, wholes, steps, operations, elements and/or components, but do not exclude the presence or addition of one or more other features, wholes, steps, operations, elements, components and/or their collections.

It should also be understood that the terms used in this disclosure are only for the purpose of describing specific embodiments and are not intended to limit the disclosure. As used in this disclosure and claims, the singular forms of "a", "an", and "the" are intended to include the plural forms unless the context clearly indicates otherwise. It should also be further understood that the term "and/or" used in this disclosure and claims refers to any combination of one or more of the associated listed items and all possible combinations, including these combinations.

As used in this specification and claims, the term "if" may be interpreted as "when" or "upon" or "in response to determining" or "in response to detecting," depending on the context. Similarly, the phrase "if it is determined" or "if [described condition or event] is detected" may be interpreted as meaning "upon determination" or "in response to determining" or "upon detection of [described condition or event]" or "in response to detecting [described condition or event]," depending on the context.

The specific implementation of the present disclosure is described in detail below with reference to the accompanying drawings.

Today's semiconductor manufacturing process starts with a complete wafer. Wafers are circular sheets made of pure silicon, generally divided into 6-inch, 8-inch, 12-inch and other specifications. Wafers are cut into small pieces, which are called dies. Each die is mounted with a chip and wired to achieve specific electrical functions. Then the die is packaged into a particle. The purpose of packaging is to place, fix, seal, protect the chip and enhance the electrical and thermal performance. At the same time, the contacts of the chip are connected to the pins of the package shell with wires, and a chip package structure is completed.

The memory is used to temporarily store the computing data required by the system on chip and the data exchanged with the external memory. In this embodiment, the memory can be a high-bandwidth memory (HBM), which is a high-performance DRAM (Dynamic Random Access Memory) made based on a 3D stacking process and is suitable for applications with high memory bandwidth requirements, such as graphics processors, online switching and forwarding equipment (such as routers, switches), etc.

System on Chip (SoC) refers to a technology that integrates a complete system on a single chip and packages all or part of the necessary electronic circuits. In this embodiment, the system on chip is assembled on a board. 1 shows a schematic diagram of the structure of a board 10 of the embodiment of the present disclosure. As shown in FIG1 , the board 10 includes a combined processing device 101, which is an artificial intelligence computing unit to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, data mining, etc. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is the large amount of input data, which places high demands on the storage capacity and computing power of the platform. The board 10 of this embodiment is suitable for cloud intelligence applications and has huge off-chip storage, on-chip storage and a large amount of computing power.

The combined processing device 101 is connected to the external device 103 through the external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a Wifi interface. The data to be processed can be transmitted from the external device 103 to the combined processing device 101 through the external interface device 102. The calculation result of the combined processing device 101 can be transmitted back to the external device 103 via the external interface device 102. According to different application scenarios, the external interface device 102 can have different interface forms, such as a PCIe (Peripheral Component Interconnect express) interface, etc.

The board 10 also includes an external memory 104 for storing data, which includes one or more storage units 105. The external memory 104 is connected to the control device 106 and the combined processing device 101 through a bus and transmits data. The control device 106 in the board 10 is configured to control the state of the combined processing device 101. To this end, in an application scenario, the control device 106 may include a single chip microcomputer, also known as a micro control unit (Micro Controller Unit, MCU).

FIG2 is a schematic diagram showing the combined processing device 101 of this embodiment. As shown in FIG2 , the combined processing device 101 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204. In one application scenario, the computing device 201, the interface device 202, and the processing device 203 are integrated into the aforementioned system on chip. In another application scenario, the computing device 201 itself is the aforementioned system on chip.

The computing device 201 is configured to execute user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.

The interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 can obtain input data from the processing device 203 via the interface device 202 and write it into the storage device on the computing device 201 chip. Further, the computing device 201 can obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the computing device 201 chip. Alternatively or optionally, the interface device 202 can also read data in the storage device of the computing device 201 and transmit it to the processing device 203.

The processing device 203, as a general processing device, performs basic controls including but not limited to data handling, starting and/or stopping the computing device 201, etc. According to different implementations, the processing device 203 can be a central processing unit, a graphics processing unit, or one or more types of processors in other general and/or special processors, which include but are not limited to digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, only with respect to the computing device 201 disclosed in the present invention, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are integrated and considered together, the two are regarded as forming a heterogeneous multi-core structure.

DRAM 204 is the aforementioned high-bandwidth memory, which is used to store data to be processed. Its size is usually 16G or larger and is used to save data of the computing device 201 and/or the processing device 203.

3 shows a schematic diagram of the internal structure of a computing device 201. The computing device 201 is used to process input data such as computer vision, speech, natural language, and data mining. The computing device 201 in the figure adopts a multi-core hierarchical structure design, which includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and multiple clusters 305.

There can be multiple external storage controllers 301, and two are shown in the figure as an example. They are used to respond to access requests from the processor core and access external storage devices, such as DRAM 204 in FIG. 2, so as to read data from outside the chip or Write the data. The peripheral communication module 302 is used to receive the control signal from the processing device 203 through the interface device 202, and start the computing device 201 to perform the task. The on-chip interconnect module 303 connects the external storage controller 301, the peripheral communication module 302 and the multiple clusters 305 to transmit data and control signals between the modules. The synchronization module 304 is a global synchronization barrier controller (Global Barrier Controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information. Multiple clusters 305 are the computing cores of the computing device 201. Four are shown as examples in the figure. With the development of hardware, the computing device 201 disclosed in the present invention can also include 8, 16, 64, or even more clusters 305. Clusters 305 are used to efficiently execute deep learning algorithms.

Each cluster 305 includes multiple processor cores (IPU Core) 306 and a memory core (MEM Core) 307.

The figure shows four processor cores 306 as an example, and the present disclosure does not limit the number of processor cores 306. Its internal architecture is shown in FIG4. Each processor core 306 includes three modules: a control module 41, a computing module 42, and a storage module 43.

The control module 41 is used to coordinate and control the operation of the operation module 42 and the storage module 43 to complete the deep learning task, and includes an instruction fetch unit (IFU) 411 and an instruction decode unit (IDU) 412. The instruction fetch unit 411 is used to obtain instructions from the processing device 203, and the instruction decode unit 412 decodes the obtained instructions and sends the decoding results to the operation module 42 and the storage module 43 as control information.

The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 422 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.

The storage module 43 is used to store or transfer related data, including a neuron RAM (NRAM) 431, a weight RAM (WRAM) 432, an input/output direct memory access module (IODMA) 433, and a transfer direct memory access module (MVDMA) 434. NRAM 431 is used to store input and output data and intermediate results for calculation by the processor core 306; WRAM 432 is used to store the weights of the deep learning network; IODMA 433 controls the memory access between NRAM 431/WRAM 432 and DRAM 204 through the broadcast bus 309; MVDMA 434 is used to control the memory access between NRAM 431/WRAM 432 and SRAM 308.

Returning to FIG. 3 , the storage core 307 is mainly used for storage and communication, that is, to store shared data or intermediate results between the processor cores 306, and to perform communication between the cluster 305 and the DRAM 204, between the clusters 305, and between the processor cores 306. In other embodiments, the storage core 307 has the ability of scalar operations and is used to perform scalar operations.

The storage core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a cluster direct memory access module (CDMA) 310, and a global direct memory access module (GDMA) 311. The SRAM 308 plays the role of a high-performance data transfer station. The data reused between different processor cores 306 in the same cluster 305 does not need to be obtained from the DRAM 204 by each processor core 306, but is transferred between the processor cores 306 through the SRAM 308. The storage core 307 only needs to quickly distribute the reused data from the SRAM 308 to multiple processor cores 306, so as to improve the efficiency of inter-core communication and greatly reduce on-chip and off-chip input/output access.

Broadcast bus 309, CDMA 310 and GDMA 311 are used to perform communication between processor cores 306, communication between clusters 305 and data transmission between clusters 305 and DRAM 204, respectively. They will be described below.

The broadcast bus 309 is used to complete high-speed communication between the processor cores 306 in the cluster 305. The broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (i.e., single processor core to single processor core) data transmission, multicast is a communication mode of transmitting a copy of data from SRAM 308 to specific processor cores 306, and broadcast is a communication mode of transmitting a copy of data from SRAM 308 to all processor cores 306, which is a special case of multicast.

CDMA 310 is used to control the access of SRAM 308 between different clusters 305 in the same computing device 201. GDMA 311 cooperates with external memory controller 301 to control the access of SRAM 308 of cluster 305 to DRAM 204. memory access, or reading data from DRAM 204 to SRAM 308.

The "inter-chip" described in the present disclosure includes multiple meanings. First, "machine" usually refers to a server computing node host, and "inter-machine communication" can refer to the communication between multiple computing node hosts. "Card" usually refers to a dedicated AI (Artificial Intelligence) computing device installed on a server computing node, and the "card" has one or more chips, such as MLU (Machine Learning Unit, machine learning processor) and GPU (Graphics Processing Unit, graphics processor). A "machine" usually has multiple "cards", and a distributed training may involve multiple "machines" and multiple "cards" and "chips". There are inter-chip high-speed interconnection communication devices between multiple machines and multiple cards, such as an inter-chip communication network built based on Serdes (serial-deserialization) and a host-level network based on Infiniband. In the present disclosure, inter-chip communication includes communication between different chips between multiple hosts, communication between different chips on the same "card", and communication between different chips in multiple cards on the same host.

RDMA (Remote Direct Memory Access) refers to remote DMA, that is, card A can asynchronously write/read data to/from the memory of card B without card B performing any operation.

Allreduce operator: In the process of multi-machine multi-card neural network training, in order to ensure the convergence of the data concurrent training results of multi-machine multi-card, each device participating in the distributed training needs to pass the gradient information _ΔWi of the current device back propagation (BP) to other devices, so that each device can finally obtain the reduced result of all gradient information, that is, _∑ΔWi . The method of propagating and accumulating gradient information is called the AllReduce operator.

Ring Allreduce algorithm: The Allreduce operator can be implemented on different network topologies. The Allreduce operator optimized in the ring topology (Ring) uses the Ring Allreduce algorithm. From the perspective of a single card, the core processes that Allreduce needs to implement are: Receive (abbreviated as R), Compute (abbreviated as C), and Send (abbreviated as S). In the Ring Allreduce algorithm, the R part corresponds to receiving the gradient information ΔW_(i-1) sent by the upstream device, the C part corresponds to calculating ΔW_(i) = Add(ΔW_(i-1), ΔW_(i)), and the S part corresponds to sending the updated gradient information ΔW_(i) to the downstream device.

Synchronization problem: In RDMA mode, the computing core of card A writes data payload to the memory area of card B. At this time, the computing core of card B cannot sense whether the data payload has been written. At this time, if the subsequent execution steps of the computing core of card B depend on the arrival of the data payload to continue execution, the computing core of card B needs to sense the arrival of the data payload. The process of sensing the arrival of the data payload is called communication synchronization.

Communication deadlock problem: If card A and card B each have a computing core, and there are two dual-card communication tasks X and Y, which are sent to the two cards respectively as X_A, X_B and Y_A, Y_B. The communication tasks require that both ends of the communication must have the same task in order to communicate normally. There is a moment when the computing core on card A is occupied by X_A and the computing core on card B is occupied by Y_B. At this time, X_A and Y_B are both trapped in communication synchronization and polling and waiting for data to arrive. But in fact, because the tasks do not match, communication tasks X and Y will wait forever, causing communication deadlock problems.

Currently, communication synchronization usually involves the following solutions.

The first solution is polling: hardware multi-threaded programming method (Single Instruction, Multiple Threads, SIMT). The specific communication synchronization steps are as follows:

Step a. Device A writes data (Data) and a tag (Flag) to the specified memory area of device B in sequence through RDMA.

Step b. The computing core of device B polls whether the Flag has changed in the communication receiving task. If the Flag has not changed, go to step c; if the Flag has changed, go to step d.

Step c. If the Flag never changes, it means that the Data has not yet completed the transmission. At this time, the thread can be switched out of the current processing due to invalid waiting, and the computing core processing is released to other computing/communication tasks. The specific hardware multi-threading method is to save the scene (including the current program execution pointer, stack information, and register information) when the thread is switched out, and restore the scene when the thread is switched back, and resume execution from the thread breakpoint. When the thread resumes execution, if the Flag is read to have changed, it will enter step d, otherwise continue to poll in step c.

Step d. If the Flag changes, it means that the Data has been transmitted. At this time, the computing core of device B can safely read the Data, for example, to perform Reduce calculations.

The first solution is mainly based on SIMT (Single Instruction, Multiple Threads) implementation and requires hardware multi-threaded support for execution. The advantage is that software developers write multi-threaded kernels (Kernel) from the Warp perspective and are unaware of switching.

The second solution, the interruption type, can be found in the Chinese patent application with publication number CN114691312A. Specifically, the specific communication synchronization steps of the above interruption type are as follows:

Step a. Device A writes the hardware descriptor and data to device B via RDMA.

Step b. Device B receives the hardware descriptor, which means that the data has been received. The data reception completion interrupt causes the hardware to parse the hardware descriptor, and then triggers the computing core to perform computing tasks according to the hardware descriptor content, such as Reduce computing.

The second solution is mainly based on SIMD (Single Instruction, Multiple Data) implementation, which supports computing core reuse through software and requires minimal hardware changes, thus avoiding a large number of hardware replacements or changes.

The specific embodiments of the present disclosure will be described below in conjunction with the accompanying drawings.

Figure 5 shows a method for executing an inter-chip communication task according to an embodiment of the present disclosure, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes multiple communication primitives, and the multiple communication primitives include a serial communication primitive connected in series. The method includes: in operation S510, performing a search for the communication primitive queue to determine the status of the serial communication primitives in the communication primitive queue; in operation S520, in response to searching for an interrupted serial communication primitive, re-executing the communication primitive queue starting from the interrupted serial communication primitive.

First, the above technical features of the present disclosure are executed in the "coroutine" mode. Coroutine is a non-preemptive scheduling mechanism in which the software itself actively wakes up and sleeps. In contrast, threads are preemptive scheduling mechanisms in which software is passively awakened and sleeps by the operating system and hardware scheduling.

For example, in multi-threaded mode, the code can be:

In the hardware multithreading of the above code, the thread will be automatically switched out by the operating system to execute other functions due to the exhaustion of the time slice during the read(X) process, and will resume execution after a period of time. The thread is unaware of the switching-out and switching-in process.

In coroutine mode, the code can be:

In coroutine mode, the coroutine will actively sleep and wake up other functions in Sleep state.

For the wake-up-sleep coroutine mode, please refer to the Chinese patent application with publication number CN114691312A.

During the communication process, taking Ring AllReduce as an example, communication primitives such as SEND, RecvReduceSEND, RecvReduceCopySEND, RecvCopySEND, and Recv will be executed, which can be abbreviated as OP in this application. It should be understood that in this article, there is no essential difference between the two description methods of communication primitives and OP, but it is only to match the context at the time.

The communication primitives disclosed herein may be asynchronous communication primitives. Asynchronous communication primitives may have the following characteristics:

Asynchronicity: The execution of the communication primitive is asynchronous, that is, when the communication primitive function returns, the communication primitive represented by the communication primitive is The communication operation may still be executing on the hardware and the communication primitive cannot be confirmed to be completely completed until the asynchronous response is received.

Non-idempotent: Repeated execution of a communication primitive may lead to incorrect execution results, so communication primitives usually cannot be executed twice unless the previous communication primitive was not fully executed due to a communication error.

Unreliable: The communication primitive may cause packet loss due to problems such as link quality, so the communication primitive needs to re-execute the transmission of the lost part.

In the present disclosure, communication primitive queues may be used to describe inter-chip communication tasks, which may be executed serially or in parallel. For ease of description, serially connected communication primitives are referred to as serial communication primitives, and parallel connected communication primitives are referred to as concurrent communication primitives. A communication queue may include multiple serial communication primitives, or may include mixed serial communication primitives and concurrent communication primitives, which will be described in more detail later.

FIG. 6 shows an example of coroutine execution according to one embodiment of the present disclosure.

In FIG6 , the communication process can be abstracted as a serial execution process of a series of communication primitives OP, the initial state of which is START (start), and the terminal state is FINISH (end). Under normal circumstances, if no communication blocking is encountered, the coroutine operation will execute communication primitives OP0 to OP5 one by one from START until FINISH is executed. It should be pointed out that, for the convenience of identification, in FIG6 , the communication primitives OP that are skipped and not executed are marked with a dotted frame, and the communication primitives OP that are actually executed are represented by a solid frame.

But in fact, in real communication, we usually encounter the situation of "communication blocking". At this time, the computing core actively calls Sleep to exit, and the variables involved in the scene reside in a specific storage space area. As shown in "Coroutine Sleep" in Figure 6, assuming that the execution of communication primitives OP0 and OP1 is successfully completed, but the communication primitive OP2 is currently blocked, then the subsequent communication primitives OP3, OP4, and OP5 will no longer be executed. At this time, the computing core can turn to perform other communication tasks.

As shown in the "Coroutine Recovery" in Figure 6, when the "blocking is released", that is, when the communication synchronization is completed, the communication primitive OP2 that was in Sleep last time can be continued to execute. If it is no longer blocked, it will be executed directly to FINISH to end the task of the computing core.

In order to continue the execution of the last communication primitive OP when the blockage is released, the software needs to perform context recovery by itself. Context recovery includes two parts of recovery:

1) Restoration of context data: Restoration of context data only requires reloading the content in a specific storage space area.

2) Program execution position recovery: Due to hardware simplification design considerations, the hardware does not support direct recovery to the position of the communication primitive OP when Sleep exits. Therefore, this solution designs a communication primitive jump mechanism.

Therefore, according to one embodiment of the present disclosure, in order to implement the jump mechanism described above, the method of the present disclosure further includes: defining a state machine, wherein the state machine is used to describe the working state of the communication primitive; determining whether the communication primitive is interrupted according to the working state of the communication primitive described by the state machine; wherein the working state includes: waiting state; working state; executed state; and confirmed state.

FIG. 7 shows possible changes in the working state of the communication primitive OP.

As shown in Figure 7, when a communication primitive OP is not executed, it is in a pending state, which indicates that the communication primitive is not executed. All communication primitives OP may be in a pending state initially until the state changes after being executed.

The working state, after being in the waiting state, is used to indicate that the communication primitive is being executed, the communication request of the communication primitive has not been completely issued, and the response signal has not been completely received. In other words, it refers to the state of the communication primitive OP when the computing core executes it. For a single computing core, at most one communication primitive is in the "working" state. In the "working" state, the communication primitive OP has not completely issued the communication request to other communication primitives, and has not completely received the response signal. Under normal circumstances, the state of all communication primitives OP will change from the waiting state to the working state, unless the execution of the OP is skipped.

The Executed state, after the "Working" state, is used to indicate that the communication primitive has been executed. The communication request of the communication primitive has been completely issued, and the response signal has not been completely received. After a communication primitive OP is executed and the communication request has been completely issued, the communication primitive OP will wait for the response signal for the communication request. However, in many cases, the communication primitive OP may not receive the response signal, or the reception of the response signal may be greatly delayed. These situations include but are not limited to: the request signal issued does not reach the downstream communication primitive OP (for example, due to a communication line failure); the request signal issued reaches the downstream OP, but the signal of the downstream OP does not issue a response signal (for example, the downstream OP fails), or although the downstream OP issues a response signal, the OP that issued the communication request does not receive the response signal (for example, the communication line fails). The above are just a few examples of situations where the OP does not receive a response signal. Many other types of failures may also cause the OP to not receive a response signal, which will not be listed here.

The Confirmed state is after the Executed state and is used to indicate that the communication request of the communication primitive has been completely sent out and the response signal has been completely received. The Confirmed state is the last state of the four states in the present disclosure.

In the present disclosure, it is assumed that all communication primitives are in a waiting state. Then, when the communication primitive queue is started, if no interruption occurs, each communication primitive in the queue enters a working state one by one from a waiting state. It should be understood here that the term "one by one" here refers to the serial execution of the communication primitives, that is, in the case of communication for a single computing core, at most one communication primitive is in a working state. In other words, when the computing core is executing a communication primitive, other communication primitives are not executed at the same time, but the next communication primitive is executed only after exiting the execution of a communication primitive.

According to an embodiment of the present disclosure, the working state is converted according to the execution status of the communication primitive, wherein the conversion of the working state is unidirectional.

In this embodiment, the communication primitive is converted in the order of waiting state, working state, executed state and confirmed state, that is, the state of the communication primitive can only be converted from the waiting state to the working state, from the working state to the executed state, and from the executed state to the confirmed state, but cannot be converted in the opposite direction.

It is also necessary to understand that the "state" described in this application has two meanings, namely the state of the communication primitive and the state of the state machine; the change in the state of the communication primitive can cause the state of the state machine to change accordingly, but in practice, the change in the state of the state machine is not necessarily synchronized with the state of the communication primitive. For example, the state machine will periodically scan the queue of communication primitives, thereby updating its own state according to the state of the scanned communication primitive. In this case, the state of the communication primitive may have changed, but the state of the state machine may not have changed. In addition, since there may be a certain time interval for scanning the state of the communication primitive, the state change in the state machine can jump, for example, it can jump directly from the "waiting" state to the "executed" state, or it can jump to the "confirmed" state, while the state of the communication primitive itself cannot jump.

In addition, according to an embodiment of the present disclosure, in the communication primitive queue, the working state of the subsequent serial communication primitive is prohibited from being behind the working state of the previous serial communication primitive; in other words, the working state of the previous serial communication primitive is prohibited from being before the working state of the subsequent serial communication primitive.

Still taking Figures 6 and 7 as an example, assuming that communication primitives OP1 to OP5 are communication primitives that are executed serially and in sequence, in this embodiment, if the communication primitive OP3 in the "executed" state is taken as a reference, then the communication primitives OP1 and OP2 before the communication primitive OP3 can be in the executed state or the confirmed state, but are prohibited from being in the waiting state or the working state, that is, the states of the communication primitives OP1 and OP2 before the communication primitive OP3 cannot be before the state of the communication primitive OP3; and the communication primitives OP4 and OP5 after the communication primitive OP3 can be in the waiting state, the working state or the executed state, but cannot be in the confirmed state, that is, the states of the communication primitives OP4 and OP5 after the communication primitive OP3 cannot be after the state of the communication primitive OP3.

It should be noted that the above paragraphs are explained based on the communication primitive OP3 only, and the above rules must be followed in the entire communication primitive queue. In addition, it should be noted that, as mentioned above, only one communication primitive is in the working state, so in the serial communication primitive queue, the communication primitive before the communication primitive in the working state needs to be in the executed state or the confirmed state, and cannot be in the working state at the same time.

According to the above implementation, the state of the communication primitive will not be reversed, so it is convenient to When blocking occurs, the execution of the communication primitive queue is put to sleep.

FIG8 is a schematic diagram showing the execution of a serial primitive queue according to an embodiment of the present disclosure. In FIG8 , in order to distinguish different states, different backgrounds are used for distinction, for example, the "waiting" state is represented by a blank background, the "working" state is represented by a horizontal line background, the "executed" state is represented by a vertical line background, and the "confirmed" state is represented by a gray background. In addition, a dotted frame represents a skipped OP, a solid frame represents an executed OP, and a bold solid frame represents the currently selected and executed OP.

According to one embodiment of the present disclosure, the method of the present disclosure further comprises: in response to communication congestion occurring in a serial communication primitive, exiting execution of the communication primitive queue at the serial communication primitive where communication congestion occurs.

In step a, as shown in the line "Start" in FIG8 , first, search for the OP in the "waiting" state from front to back, and convert the OP into the "working" state. If the OP is executed, it is converted into the "executed" state. Thus, the above step a is repeatedly executed until a communication blockage occurs. As shown in FIG8 , assuming that a communication blockage occurs at OP3, a sleep operation is performed at OP3. Then, after the hardware wakes up the communication primitive queue, step b is executed. As shown in FIG8 , after a period of execution, OP0-OP4 may all be in the "executed" state, while OP5-OP7 are still in the "waiting" state because they have not been selected for execution.

According to one embodiment of the present disclosure, the method further includes: in response to communication congestion occurring in a serial communication primitive, maintaining a working state of the corresponding serial communication primitive in an executed state.

After running for a period of time, OP0-OP3 all successfully received the response signal, so they entered the "confirmation" state, and communication blocking occurred at OP3, so the sleep operation was performed at OP3. According to the above, the working state of the subsequent serial communication primitive is prohibited from being after the working state of the previous serial communication primitive, so the working state of OP4 is also in the "executed" state, and cannot proceed to the "confirmation" state.

In step b, when waiting to be reawakened, i.e., when recovering, it can be determined that some OPs have been converted to OPs in the "confirmed" state (OP0-OP2 in the "recovery" row) due to asynchronous confirmation. Search the OPs in the "executed" state from front to back, because these OPs have communication errors and need to be re-executed (OP3-OP4 in the "recovery" row). When all the "executed" state OPs have been searched (OP5 in the "recovery" row), repeat step a. If all OPs are in the "confirmed" state, go to step c.

It is understandable that after recovery, if the execution of all communication primitives is normal, the next OPs (eg, OP5 to OP7) will be executed one by one until all OPs are executed.

In step c, all OPs have been executed and the asynchronous operation has been confirmed to return, so the end state can be entered (the last line).

The above describes a schematic diagram of state changes of different OPs as they are executed in conjunction with Figure 8. According to one embodiment of the present disclosure, when the execution of an OP is resumed, for example, as shown in Figure 8, when the execution of an OP is resumed from OP3, starting from the serial communication primitive where the interruption occurs, re-executing the communication primitive queue includes: for a partially executed serial communication primitive, only re-executing the portion of the serial communication primitive that has not been executed.

It is understandable that, as described above, for the communication primitive queue, the communication primitives that have been executed and are in the "confirmed" state will no longer be executed when the communication primitive queue is restored, and only those communication primitives that have not been executed or are incompletely executed will be executed. According to the preferred embodiment of the present disclosure, for a certain communication primitive, if the communication primitive is blocked after a part of it is executed, the execution of the communication primitive is incomplete; then, for the incompletely executed communication primitive, the part that has been executed will no longer be executed, and only the part that has not been executed will be executed.

Such beneficial effects include multiple ones. For example, only executing the part of the communication primitives that have not been executed is helpful to reduce the repeated execution of the communication primitives and improve the efficiency during recovery.

In addition, for in-place operation, only executing the part of the communication primitive that has not been executed can avoid data errors. In-place operation means that the operation occurs at the storage location where the data involved in the operation originally resides. Figure 9 shows a schematic diagram of the communication primitive involving in-place operation.

As shown in Figure 9, assume that the original data of an in-situ operation includes a 4*4 data matrix, the 0th row contains data {0,0,0,0}, the 1st row contains data {1,1,1,1}, the 2nd row contains data {2,2,2,2}, and the 3rd row contains data {3,3,3,3}. When performing in-situ operations, the above original data will be overwritten. Assume that these original data have only undergone partial operations, and the original data of rows 0 and 1 are overwritten by new data, for example, the new data are {0,1,2,3} and {3,2,1,0} respectively, but the data of rows 2 and 3 are not operated due to communication congestion, and the original data is still retained. In this case, if all data are still operated during recovery, errors may occur because the data has been updated, while if only the part that has not been executed is executed, the above errors will not occur.

The above describes the scenario of serial execution between multiple OPs, but there may also be a situation where multiple OPs are executed concurrently.

FIG. 10 shows an exemplary application scenario in which there are multiple concurrent communication primitives.

As shown in Figure 10, there is a connection relationship between the upstream communication node A and the two downstream communication nodes B and C, and the communication primitive OP1 is executed between the communication nodes A and B, and the communication primitive OP2 is executed between the communication nodes A and C. These are two concurrent communication primitives, because the two communication primitives OP1 and OP2 use different communication paths and do not interfere with each other. In this case, the concurrent execution of the communication primitives can be achieved through the aforementioned coroutine and primitive jump mechanism. According to one embodiment of the present disclosure, concurrent execution means that different communication primitives can be executed alternately, thereby realizing an execution method similar to multi-threading in a single core. Through this mechanism, a single computing core can be used to complete the concurrent execution of multiple different communication primitives, and multiple different computing cores are not required.

According to one embodiment of the present disclosure, the plurality of communication primitives further include concurrent communication primitives that can be executed concurrently, and the method further includes: executing the concurrent communication primitives in a time-sharing manner.

It should be understood that the "concurrency" mentioned here does not mean execution at the same time, but rather concurrent execution of communication primitives in the form of time division or time division multiplexing.

Figures 11a to 11f show schematic diagrams of setting concurrent communication primitives between serial communication primitives according to an embodiment of the present disclosure. Wherein, similar to Figures 7 and 8, in order to distinguish different states, different backgrounds are used for distinction, for example, the "waiting" state is represented by a blank background, the "working" state is represented by a horizontal line background, the "executed" state is represented by a vertical line background, and the "confirmed" state is represented by a gray background. In addition, the skipped OP is represented by a dotted box, and the executed OP is represented by a solid box. It should be understood that the boxes "Start", "End", "FB" and "FE" are all represented by a blank background, but they only represent some key points of the OP execution, and do not mean that they must also participate in the execution of the OP.

According to one embodiment of the present disclosure, the method of the present disclosure further includes: inserting a concurrent start identifier between the concurrent communication primitive and the previous serial communication primitive, so that when the communication primitive queue executes to the concurrent start identifier, the concurrent communication primitive is executed in time-sharing; and inserting a concurrent end identifier between the concurrent communication primitive and the subsequent serial communication primitive, so that when the communication primitive queue executes to the concurrent end identifier, the concurrent communication primitive is re-executed according to the state of the concurrent communication primitive, or the execution of the concurrent communication primitive is exited.

As shown in FIG11a , assuming that OP0, OP1, OP3, OP4 and OP5 are the serial communication primitives described above, there are three concurrent communication primitives OP21, OP22 and OP23 between OP1 and OP3, wherein if OP21, OP22 and OP23 are regarded as a whole (assuming they are represented as OP2), then OP0, OP1, OP2, OP3, OP4 and OP5 still constitute a serial communication primitive queue, which still complies with the rules described above; however, if OP21, OP22 and OP23 are treated separately, then OP21, OP22 and OP23 can be executed in time-sharing manner.

The difference between the concurrent communication primitives OP21, OP22 and OP23 and the serial communication primitives OP0, OP1, OP3, OP4 and OP5 is that when executing the serial communication primitives, if communication blocking occurs, the sleep operation is performed from the serial communication primitive where the blocking occurs; while when executing concurrent communication primitives, the blocking of one concurrent communication primitive does not affect the execution of other concurrent communication primitives.

Still as shown in FIG. 11a, in order to distinguish between the serial communication primitive and the concurrent communication primitive, a flag can be inserted between the two to distinguish them, so as to adopt different execution modes. Specifically, a concurrent start flag FB (Flag Begin) can be inserted between the serial communication primitive OP1 and the concurrent communication primitives OP21, OP22 and OP23. When the concurrent start flag FB is executed, the communication primitives after the concurrent start flag FB can be automatically executed in parallel; similarly, a concurrent end flag FE (Flag End) can be inserted between the concurrent communication primitives OP21, OP22 and OP23 and the serial communication primitive OP3. When the concurrent end flag FE is executed, it means that the execution of this concurrent communication primitive is completed. bundle.

It should be understood that the above-mentioned "end of execution of this concurrent communication primitive" does not mean to exit the execution of all concurrent primitives, but it can be executed again, that is, re-execute the blocked concurrent communication primitive, or exit the execution of the concurrent communication primitive.

Figure 11a shows the states of all communication primitives and their relationships when not executed, wherein all communication primitives OP are in a waiting state, represented by a dotted frame. Figure 11b shows a schematic diagram of executing concurrent communication primitives multiple times.

According to one embodiment of the present disclosure, executing the concurrent communication primitives in time-sharing manner includes: making the concurrent communication primitives alternately enter a working state from a waiting state.

As shown in Figure 11b, assuming that the serial communication primitives OP0 and OP1 have been executed and are in the "confirmed" state, after passing through the concurrent start mark FB, the execution of the concurrent communication primitives OP21, OP22 and OP23 can be entered, so that OP21 to OP23 can alternately (or time-sharingly) enter the "working" state, but unlike the execution of the serial communication primitives, the concurrent communication primitives OP21 to OP23 are concurrent, so their states are independent of each other. For example, as shown in Figure 11b, after the first execution of the concurrent communication primitives OP21 to OP23, OP21 can be in the "working" state, OP22 can be in the "executed" state, and OP23 can be in the "working" state. Assuming that the concurrent communication primitive OP2 is blocked at this time, the communication primitive can be re-executed according to the following methods and rules.

According to an embodiment of the present disclosure, according to the status of the concurrent communication primitive, re-executing the concurrent communication primitive includes: in response to not all of the multiple concurrent communication primitives experiencing communication congestion, re-executing the concurrent communication primitive.

Still taking FIG. 11b as an example, since the communication primitive OP22 is not blocked, according to a rule of the present disclosure, when the concurrent communication primitives OP21, OP22 and OP23 are not all blocked, the blocked communication primitives OP21 and OP23 can be executed again. Specifically, after the concurrent communication primitive OP23 is executed, the concurrent end identifier FE is entered, and the concurrent communication primitive OP21 is returned to re-execute the concurrent communication primitive that was blocked last time.

According to one embodiment of the present disclosure, re-executing the concurrent communication primitive includes: for a concurrent communication primitive that has been partially executed, re-executing only the portion of the concurrent communication primitive that has not been executed.

As mentioned above, for a certain communication primitive, if the communication primitive is blocked after executing part of it, the execution of the communication primitive is incomplete; then, for the incompletely executed communication primitive, the part that has been executed will no longer be executed, and only the part that has not been executed will be executed.

In addition, the above text describes the beneficial effects of executing only the unexecuted portion in some specific situations in conjunction with FIG. 9 , and such beneficial effects are also applicable to concurrent communication primitives.

According to an embodiment of the present disclosure, re-executing the concurrent communication primitive according to the state of the concurrent communication primitive further includes: skipping the concurrent communication primitive in the confirmation state without re-executing.

As shown in Figure 11c, assuming that when the concurrent communication primitives OP21, OP22 and OP23 are re-executed, the concurrent communication primitive OP22 is already in the "confirmed" state, and communication congestion has occurred at the concurrent communication primitives OP21 and OP23 and is in the "executed" state, then the execution of the concurrent communication primitive OP22 is skipped at this time, and only the concurrent communication primitives OP21 and OP23 are executed alternately again.

It should be understood that, when re-executing the concurrent communication primitives OP21, OP22 and OP23, the "confirmation" state of OP22 in the above embodiment is only an implementation method, and it does not have to be in the "confirmation" state to be skipped. In essence, as long as the execution of the concurrent communication primitive does not cause communication congestion, even if it is not in the "confirmation" state in the second round of execution, the execution of the concurrent communication primitive is still skipped.

On the other hand, according to one embodiment of the present disclosure, based on the status of the concurrent communication primitive, exiting the execution of the concurrent communication primitive includes: in response to communication congestion occurring in all of the multiple concurrent communication primitives, exiting the execution of the concurrent communication primitive and exiting the execution of the communication primitive queue.

In this case, if during the execution of the concurrent communication primitives OP21, OP22 and OP23, the three concurrent communication primitives OP21, OP22 and OP23 are blocked in communication, causing all concurrent communication primitives to fail to execute normally, then in this case, the execution of these communication primitives can be exited and the state of sleep can be entered. OP21, OP22 and OP23 can be regarded as a serial communication primitive as a whole, which forms a serial relationship with the upstream serial communication primitive OP1 and the downstream OP3. Therefore, according to the above description, when a serial communication primitive is blocked in communication, the execution can be exited from the currently blocked serial communication primitive.

Further, according to one embodiment of the present disclosure, based on the state of the concurrent communication primitive, exiting the execution of the concurrent communication primitive includes: in response to the number of times the concurrent communication primitive is re-executed reaching a predetermined number of times and communication congestion occurs in at least one of the multiple concurrent communication primitives, exiting the execution of the concurrent communication primitive and exiting the execution of the communication primitive queue.

According to the above-mentioned embodiment of the present disclosure, after returning from the concurrent end identifier FE to the concurrent communication primitive for execution again, if the number of returns exceeds the predetermined number of times, when there is still a communication blocking in the concurrent communication primitive, these concurrent communication primitives may not be returned, and the communication primitive queue may be exited. For example, as shown in Figure 11d, after returning from the concurrent end identifier FE to the concurrent communication primitive twice, if there is still at least one concurrent communication primitive (OP21 and OP23 in Figure 11d) among the concurrent communication primitives OP21, OP22 and OP23 that is still in a blocked state, these concurrent communication primitives may not be returned, and the communication primitive queue may be exited. Such a beneficial effect is that the execution of the concurrent communication primitive will not be endlessly looped. When it is tried many times and still cannot be solved, it will normally exit from the concurrent communication primitive and exit the execution of the entire communication primitive queue to avoid deadlock.

A counter may be added at the concurrent end identifier FE. When the number of concurrent communication primitives returned from the concurrent end identifier FE reaches a value specified by the counter, the system may exit and enter sleep mode.

According to one embodiment of the present disclosure, in response to exiting the execution of the concurrent communication primitive, a resume identifier is added at the concurrent start identifier to facilitate easy search for the exit location when resuming the execution of the communication primitive.

Still as shown in Figure 11d, after executing the concurrent communication primitives OP21, OP22 and OP23 multiple times, some of these concurrent communication primitives still have communication blockage, and will exit from the execution of these concurrent communication primitives. When exiting, in order to facilitate the smooth search for the exit point when resuming execution, a recovery marker can be added at the concurrent start marker.

FIG. 11e shows a schematic diagram of coroutine recovery.

As shown in FIG. 11e , according to one embodiment of the present disclosure, in response to searching for the recovery identifier, the concurrent communication primitive in which the communication congestion occurred is re-executed.

As shown in Figure 11e, first search whether there is a recovery mark. If the recovery mark is found at the concurrent start mark, it means that at least one of the concurrent communication primitives after the concurrent start mark FB has a communication blockage and needs to be re-executed. In this case, directly enter the concurrent start mark FB from the beginning, and skip the execution of the serial communication primitives OP1 and OP2; then, enter the execution of the concurrent communication primitives OP21, OP22 and OP23. Since the concurrent communication primitive OP22 is already in the "confirmed" state, the execution of the concurrent communication primitive OP22 will be skipped, and only the concurrent communication primitives OP21 and OP23 that have communication blockage will be executed. If at least one of OP21 and OP23 still has communication blockage, it can be re-executed multiple times from the concurrent end mark FE, or the execution of the parallel communication primitive can be exited from the concurrent end mark FE, and the execution of the entire communication primitive queue can be exited.

Alternatively, according to one embodiment of the present disclosure, according to the status of the concurrent communication primitive, exiting the execution of the concurrent communication primitive includes: in response to all the concurrent communication primitives being in a confirmed state, exiting the execution of the concurrent communication primitive, and executing the serial communication primitive after the concurrent end identifier.

As shown in FIG11f, if communication blocking does not occur in both OP21 and OP23, the execution of the parallel communication primitive can be exited from the concurrent end mark FE, and the serial communication primitive OP3 can be executed next. The execution of the serial communication primitives OP3, OP4 and OP5 has been described above in conjunction with FIG8, and will not be repeated here. The execution of the concurrent communication primitives has been described above in conjunction with FIG11a to FIG11f, and these concurrent communication primitives can be separate or combined with the serial communication primitive as shown in FIG11a to FIG11f.

According to an embodiment of the present disclosure, there is also provided an electronic device, comprising: one or more processors; and a memory, wherein the memory stores computer executable instructions, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method as described above.

According to one embodiment of the present disclosure, a computer-readable storage medium is further provided, comprising computer-executable instructions. When the computer-executable instructions are executed by one or more processors, the method described above is executed.

Table 1 below shows the differences between the technical solution of the present disclosure and the first and second solutions described above.

Table 1

The technical solution disclosed in the present invention uses a software coroutine method to realize the time-sharing reuse capability of the computing core without introducing a hardware multi-threading mechanism, thereby being able to fully utilize the computing core and avoid task deadlock. The coroutine execution process has relatively small changes to the hardware, and generally supports various SIMD processing architectures to realize software time-sharing reuse. In addition, the asynchronous confirmation method of asynchronous communication primitives is supported by the primitive jump mechanism, and automatic software communication retransmission can be realized without modifying the OP logic. Through the alternating execution mechanism, the concurrent execution of multiple communication primitives can be supported, which is similar to the effect of single-core multi-threading and saves the use of computing cores. The solution disclosed in the present invention is sufficient to solve the deadlock problem caused by communication congestion.

According to different application scenarios, the electronic equipment or device disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, transportation, household appliances, and/or medical equipment. The transportation includes airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes magnetic resonance imaging, ultrasound machines and/or electrocardiographs. The electronic equipment or device disclosed herein may also be applied to the Internet, IoT, data centers, energy, transportation, public administration, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical fields. Further, the electronic equipment or device disclosed herein may also be used in cloud, edge, and terminal applications related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solution can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or edge devices (such as smart phones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or edge device are compatible with each other, so that according to the hardware information of the terminal device and/or edge device, appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or edge device, so as to complete the unified management, scheduling and collaborative work of end-to-end or cloud-edge-to-end.

The following clauses can better understand the technical solution of the present disclosure.

Clause 1. A method for performing an inter-chip communication task, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes a plurality of communication primitives, the plurality of communication primitives including a serial communication primitive of a serial connection, the method comprising:

performing a search of a communication primitive queue to determine a status of a serial communication primitive in the communication primitive queue;

In response to searching for an interrupted serial communication primitive, the communication primitive queue is re-executed starting from the interrupted serial communication primitive.

Clause 2. The method according to clause 1, further comprising:

defining a state machine, wherein the state machine is used to describe the working state of the communication primitive;

Determine whether a communication primitive is interrupted according to the working state of the communication primitive described by the state machine;

The working status includes:

Wait state, used to indicate that the communication primitive has not been executed;

The working state, after the waiting state, is used to indicate that the communication primitive is being executed, the communication request of the communication primitive has not been completely sent, and the response signal has not been completely received;

The executed state, after the working state, is used to indicate that the communication primitive has been executed, the communication request of the communication primitive has been completely sent, and the response signal has not been completely received; and

The confirmation state is used to indicate, after the executed state, that the communication request of the communication primitive has been completely issued and the response signal has been completely received.

Clause 3. The method according to clause 2, further comprising:

Starting the execution of the communication primitive queue so that the serial communication primitives enter the working state one by one from the waiting state;

The working state is converted according to the execution status of the serial communication primitive, wherein the conversion of the working state is unidirectional.

Clause 4. The method of clause 3, wherein, for a single computing core, at most one serial communication primitive is in operation.

Clause 5. The method according to clause 3, wherein, in the communication primitive queue, the working state of a subsequent serial communication primitive is prohibited from being behind the working state of a preceding serial communication primitive.

Clause 6. The method according to any one of clauses 1-5 further comprises: in response to a serial communication primitive being blocked in communication, exiting execution of the communication primitive queue at the serial communication primitive where the communication is blocked.

Clause 7. The method according to any one of clauses 1-6 further comprises: in response to communication congestion of a serial communication primitive, maintaining the working state of the corresponding serial communication primitive in an executed state.

Clause 8. The method according to any one of clauses 1 to 7, wherein re-executing the communication primitive queue starting from the interrupted serial communication primitive comprises:

For a serial communication primitive that has been partially executed, only the portion of the serial communication primitive that has not been executed is re-executed.

Clause 9. The method according to any one of clauses 1-8, wherein the multiple communication primitives further include concurrent communication primitives that can be executed concurrently, and the method further includes: executing the concurrent communication primitives in a time-sharing manner.

Clause 10. The method according to clause 9, further comprising:

Inserting a concurrent start marker between the concurrent communication primitive and a previous serial communication primitive, so that when the communication primitive queue executes to the concurrent start marker, the concurrent communication primitive is executed in time-sharing manner; and

A concurrent end marker is inserted between the concurrent communication primitive and the next serial communication primitive, so that when the communication primitive queue executes to the concurrent end marker, the concurrent communication primitive is re-executed or the concurrent communication primitive is exited according to the state of the concurrent communication primitive.

Clause 11. The method according to clause 10, wherein executing the concurrent communication primitives in a time-sharing manner comprises: causing the concurrent communication primitives to alternately enter a working state from a waiting state.

Clause 12. The method of clause 10 or 11, wherein, based on the state of the concurrent communication primitive, re-executing the concurrent communication primitive comprises:

In response to not all of the concurrent communication primitives being blocked, the concurrent communication primitives are re-executed.

Clause 13. The method according to clause 12, wherein re-executing the concurrent communication primitive comprises: for a concurrent communication primitive that has been partially executed, re-executing only the portion of the concurrent communication primitive that has not been executed.

Clause 14. The method of clause 12, wherein, based on the state of the concurrent communication primitive, re-execution The concurrent communication primitives further include: skipping concurrent communication primitives in a confirmed state without re-executing.

Clause 15. The method of any one of clauses 10-14, wherein, based on the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:

In response to communication congestion occurring in all of the multiple concurrent communication primitives, the execution of the concurrent communication primitives is exited, and the execution of the communication primitive queue is exited.

Clause 16. The method of any one of clauses 10-15, wherein, based on the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:

In response to the number of times the concurrent communication primitive is re-executed reaching a predetermined number and communication congestion occurs in at least one of the plurality of concurrent communication primitives, the execution of the concurrent communication primitive is exited, and the execution of the communication primitive queue is exited.

Clause 17. The method according to clause 15 or 16, wherein, in response to exiting the execution of the concurrent communication primitive, a resume identifier is added at the concurrent start identifier.

Clause 18. The method according to Clause 17 further comprises: in response to searching for the recovery identifier, re-executing the concurrent communication primitive where communication congestion occurs.

Clause 19. The method of any one of clauses 10-18, wherein, based on the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:

In response to all the concurrent communication primitives being in a confirmed state, the execution of the concurrent communication primitives is exited, and the serial communication primitive after the concurrent end marker is executed.

Clause 20. An electronic device comprising:

one or more processors; and

A memory, wherein computer executable instructions are stored in the memory, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method as described in any one of clauses 1-19.

Clause 21. A computer-readable storage medium comprising computer-executable instructions, which, when executed by one or more processors, perform the method as described in any one of Clauses 1-19.

The embodiments of the present disclosure are described in detail above. Specific examples are used herein to illustrate the principles and implementation methods of the present disclosure. The description of the above embodiments is only used to help understand the method and its core idea of the present disclosure. At the same time, changes or deformations made by those skilled in the art based on the ideas of the present disclosure, the specific implementation methods and the scope of application of the present disclosure, all fall within the scope of protection of the present disclosure. In summary, the content of this specification should not be understood as a limitation on the present disclosure.

Claims

A method for performing an inter-chip communication task, wherein the inter-chip communication task is described by a communication primitive queue, and the communication primitive queue includes a plurality of communication primitives, and the plurality of communication primitives include a serial communication primitive of a serial connection, and the method comprises:

performing a search of a communication primitive queue to determine a status of a serial communication primitive in the communication primitive queue;

In response to searching for an interrupted serial communication primitive, the communication primitive queue is re-executed starting from the interrupted serial communication primitive.
The method according to claim 1, further comprising:

defining a state machine, wherein the state machine is used to describe the working state of the communication primitive;

Determine whether a communication primitive is interrupted according to the working state of the communication primitive described by the state machine;

The working status includes:

Wait state, used to indicate that the communication primitive has not been executed;

The working state, after the waiting state, is used to indicate that the communication primitive is being executed, the communication request of the communication primitive has not been completely sent, and the response signal has not been completely received;

The executed state, after the working state, is used to indicate that the communication primitive has been executed, the communication request of the communication primitive has been completely sent, and the response signal has not been completely received; and

The confirmation state is used to indicate, after the executed state, that the communication request of the communication primitive has been completely issued and the response signal has been completely received.
The method according to claim 2, further comprising:

Starting the execution of the communication primitive queue so that the serial communication primitives enter the working state one by one from the waiting state;

The working state is converted according to the execution status of the serial communication primitive, wherein the conversion of the working state is unidirectional.
The method according to claim 3, wherein, for a single computing core, at most one serial communication primitive is in operation.
The method according to claim 3, wherein, in the communication primitive queue, the working state of the subsequent serial communication primitive is prohibited from being behind the working state of the preceding serial communication primitive.
The method according to any one of claims 1 to 5, further comprising: in response to communication congestion occurring in a serial communication primitive, exiting execution of the communication primitive queue at the serial communication primitive where the communication congestion occurs.
The method according to any one of claims 1 to 6, further comprising: in response to communication congestion of the serial communication primitive, maintaining the working state of the corresponding serial communication primitive in an executed state.
The method according to any one of claims 1 to 7, wherein starting from the serial communication primitive where the interruption occurs, re-executing the communication primitive queue comprises:

For a serial communication primitive that has been partially executed, only the portion of the serial communication primitive that has not been executed is re-executed.
According to the method according to any one of claims 1 to 8, wherein the multiple communication primitives further include concurrent communication primitives that can be executed concurrently, and the method further comprises: executing the concurrent communication primitives in a time-sharing manner.
The method according to claim 9, further comprising:

Inserting a concurrent start marker between the concurrent communication primitive and a previous serial communication primitive, so that when the communication primitive queue executes to the concurrent start marker, the concurrent communication primitive is executed in time-sharing manner; and

Insert a concurrent end marker between the concurrent communication primitive and the subsequent serial communication primitive, so that when the communication primitive queue When the execution sequence reaches the concurrent end mark, the concurrent communication primitive is re-executed or the concurrent communication primitive is exited according to the state of the concurrent communication primitive.
The method according to claim 10, wherein executing the concurrent communication primitives in time-sharing comprises: causing the concurrent communication primitives to alternately enter a working state from a waiting state.
The method according to claim 10 or 11, wherein, according to the state of the concurrent communication primitive, re-executing the concurrent communication primitive comprises:

In response to not all of the concurrent communication primitives being blocked, the concurrent communication primitives are re-executed.
The method according to claim 12, wherein re-executing the concurrent communication primitive comprises: for a concurrent communication primitive that has been partially executed, re-executing only the portion of the concurrent communication primitive that has not been executed.
The method according to claim 12, wherein, according to the status of the concurrent communication primitive, re-executing the concurrent communication primitive further comprises: skipping the concurrent communication primitive in the confirmed state without re-executing.
The method according to any one of claims 10 to 14, wherein, according to the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:

In response to communication congestion occurring in all of the multiple concurrent communication primitives, the execution of the concurrent communication primitives is exited, and the execution of the communication primitive queue is exited.
The method according to any one of claims 10 to 15, wherein, according to the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:

In response to the number of times the concurrent communication primitive is re-executed reaching a predetermined number and communication congestion occurs in at least one of the plurality of concurrent communication primitives, the execution of the concurrent communication primitive is exited, and the execution of the communication primitive queue is exited.
The method according to claim 15 or 16, wherein, in response to exiting the execution of the concurrent communication primitive, a recovery identifier is added at the concurrent start identifier.
The method according to claim 17 further comprises: in response to searching for the recovery identifier, re-executing the concurrent communication primitive where communication congestion occurs.
The method according to any one of claims 10 to 18, wherein, according to the state of the concurrent communication primitive, exiting execution of the concurrent communication primitive comprises:

In response to all the concurrent communication primitives being in the confirmation state, the execution of the concurrent communication primitives is exited, and the serial communication primitive after the concurrent end mark is executed.
An electronic device, comprising:

one or more processors; and

A memory, wherein computer executable instructions are stored in the memory, and when the computer executable instructions are executed by the one or more processors, the electronic device executes the method as described in any one of claims 1-19.
A computer-readable storage medium comprises computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the method according to any one of claims 1 to 19 is executed.