CN110377340B

CN110377340B - Operation method, device and related product

Info

Publication number: CN110377340B
Application number: CN201910671036.5A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2021-06-01
Anticipated expiration: 2039-07-24
Also published as: CN113204373A; CN110377340A

Abstract

The present disclosure relates to methods, apparatus and related products for computing, the products comprising a controller unit, the controller unit comprising: the device comprises an instruction cache unit, an instruction processing unit and a storage queue unit; the instruction cache unit is used for storing the calculation instruction associated with the artificial neural network operation; the instruction processing unit is used for analyzing the calculation instruction to obtain a plurality of operation instructions; the storage queue unit is configured to store an instruction queue, where the instruction queue includes: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue. Through the method, the operation efficiency of the related product in the operation of the neural network model can be improved.

Description

Operation method, device and related product

Technical Field

The present disclosure relates to the field of information processing technologies, and in particular, to an operation method, an operation device, and a related product.

Background

In the technical field of artificial intelligence, a neural network algorithm is a very popular machine learning algorithm in recent years, and has a very good effect in various fields, such as image recognition, voice recognition, natural language processing and the like. Along with the development of neural network algorithms, the complexity of the algorithms is higher and higher, and in order to improve the recognition degree, the scale of the model is gradually increased.

Disclosure of Invention

According to a first aspect of the present disclosure, there is provided an instruction generating method, the method comprising:

receiving a calculation graph;

counting the scheduling nodes in the calculation graph to obtain a first scheduling set;

merging the parallel nodes in the first scheduling set into a parallel scheduling unit to obtain a second scheduling set comprising the parallel scheduling unit, wherein the parallel nodes are scheduling nodes meeting parallel execution conditions;

and generating an instruction according to the second scheduling set.

According to a second aspect of the present disclosure, there is provided an instruction generating apparatus including:

a receiving unit configured to receive a computation graph;

the statistical unit is used for counting the scheduling nodes in the calculation graph to obtain a first scheduling set;

a merging unit, configured to merge parallel nodes in the first scheduling set into a parallel scheduling unit, so as to obtain a second scheduling set including the parallel scheduling unit, where the parallel nodes are scheduling nodes meeting parallel execution conditions;

and the instruction generating unit is used for generating an instruction according to the second scheduling set.

According to a third aspect of the present disclosure, there is provided an arithmetic device comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of the first aspect described above.

By merging the parallel nodes in the first scheduling set into the parallel scheduling units, the second scheduling set comprising the parallel scheduling units is obtained, and the instructions are generated according to the second scheduling set.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a flow diagram of an instruction generation method according to an embodiment of the present disclosure.

FIG. 2 shows a flow diagram of an instruction generation method according to an embodiment of the present disclosure.

Fig. 3 illustrates a computational graph according to an embodiment of the present disclosure.

FIG. 4 shows a flow diagram of an instruction generation method according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of an application example according to the present disclosure.

Fig. 6 shows a block diagram of an instruction generation apparatus according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of an instruction generation apparatus according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

The main body of the deep learning algorithm is a neural network, the neural network can be formed by connecting a series of basic operations according to a certain topological structure, each basic operation can contain one or more input and output neurons, and meanwhile, the neurons can be shared among the operations. Thus, in one possible implementation, the execution of the deep learning algorithm may be represented as a computational graph. The calculation graph can comprise nodes and edges connecting the nodes, the number of the nodes and the number of the edges of the nodes are not limited, and the calculation graph is determined according to a specific process of a deep learning algorithm. The nodes may be used to represent operations performed in a deep learning process, such as convolution operations or batch normalization operations. Edges between nodes may be used to represent neurons and may be used to indicate the orientation of data between nodes. For deep learning, the trained model data is also an important component of the neural network where deep learning is located, such as a weight of a convolution operation. And inputting input data of the deep learning algorithm into the initial node of the computational graph, completing the operation by the nodes in the computational graph according to the edges among the nodes, and outputting the final result of the deep learning by the computational graph.

In the deep learning process, since the nodes in the computation graph may represent operations performed in the deep learning process, in one possible implementation, the instructions may be generated based on the nodes in the computation graph. Because the computation graph may be a directed graph, operations represented by nodes in the computation graph may have an execution sequence, and therefore, in the process of generating instructions based on the computation graph, corresponding instructions may be sequentially generated for the nodes in the computation graph according to the direction of the computation graph.

However, since the generated instructions are finally executed by the hardware device, the structure and data storage manner of the hardware device itself may be different. Therefore, when a part of hardware equipment executes an instruction, a double-buffering technology can be adopted, namely, for the calculation between different data segments without dependency relationship, a part of memory access or calculation functions can be executed in parallel in a double-buffering mode. At this time, if the instruction generation method proposed in the above-mentioned disclosed embodiment is adopted, that is, corresponding instructions are sequentially generated for nodes in the computation graph according to the direction of the computation graph, then the hardware device will not be convenient to implement parallel memory access or computation, thereby greatly reducing the computation and memory access speed, and further reducing the operation and working efficiency of the device.

In this example, after receiving the computation graph and before generating the instruction, numbering nodes in the computation graph according to an operation execution sequence, then merging the nodes meeting the parallel execution condition into parallel units, regarding the merged parallel units as a whole, and numbering according to the operation execution sequence in the computation graph. The operations executed by the nodes meeting the parallel execution condition can be simultaneously realized by hardware equipment in the operation process, and the operations are not influenced mutually. After the nodes meeting the parallel execution condition in the computation graph are merged into the parallel units, when generating the instruction based on the merged computation graph, the instruction may be generated for the parallel units first, and if a plurality of parallel units exist, the instruction may be generated for the parallel units sequentially according to the numbering order of the parallel units. In the process of generating instructions for the parallel units, the parallel instructions can be generated according to nodes contained in the parallel units, that is, for different nodes in the same parallel unit, the execution sequence of the instructions is the same, and the instructions can be executed at the same time. After generating instructions for the parallel units, instructions may be generated for the remaining nodes in sequence, in the order of the numbers in the computational graph. For the technical scheme, the nodes capable of executing the operations in parallel are combined into the parallel unit, and the parallel instruction is generated according to the parallel unit, so that the hardware equipment can execute the parallel operations which are not interfered with each other simultaneously when executing the operations, thereby being convenient for executing the parallel memory access or operation process, further greatly improving the speed of the calculation and memory access of the whole hardware equipment, and further improving the operation and working efficiency of the equipment.

FIG. 1 shows a flow diagram of an instruction generation method according to an embodiment of the present disclosure. The instruction generating method may be executed by a terminal device or other processing device, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the method of operation may be implemented by a processor calling computer readable instructions stored in a memory. As shown, the method may include:

in step S11, the calculation map is received.

Step S12, counting the scheduling nodes in the calculation graph to obtain a first scheduling set.

Step S13, merging the parallel nodes in the first scheduling set into a parallel scheduling unit, to obtain a second scheduling set including the parallel scheduling unit, where the parallel nodes are scheduling nodes meeting parallel execution conditions.

In step S14, an instruction is generated according to the second scheduling set.

In the above step, in a possible implementation manner, the scheduling node in the computational graph may represent an operation node in the computational graph; in a possible implementation manner, the scheduling node in the computational graph may also represent an operation node set formed by a plurality of operation nodes that can be sequentially executed in the computational graph, and the number and content of the operation nodes specifically included in the operation node set are not limited, and may be flexibly determined according to an actual situation. In one example, in a directed graph, an operation node set formed by a plurality of operation nodes that can be partially sequentially executed may be scheduled as one scheduling node, the plurality of operation nodes that can be sequentially executed may be an architecture with a low computation granularity, and the generated instruction may also be a continuous segment of computation instructions that may be saved by using an order container, and in one example, the segment of instruction may be formed by a scalar. For example, in an example, a convolution operation may be performed by a plurality of connected operation nodes performing matrix calculation in a computation graph, and at this time, the plurality of connected operation nodes performing matrix calculation may be collectively regarded as a scheduling node, and at this time, an instruction may be generated based on the scheduling node, that is, a convolution operation may be implemented by an instruction of matrix granularity, because the matrix operations are all executed on a matrix operation unit and thus may be executed sequentially, in which case, a convolution instruction may be loaded in a sequence container, and the sequence container may be executed and scheduled by hardware as an operation node of a convolution operator.

Because the scheduling node may be a certain operation node in the computational graph or an operation node set in the computational graph, the number of the scheduling nodes included in the first scheduling set obtained by counting the scheduling nodes in the computational graph may be the same as or different from the number of the operation nodes included in the computational graph.

The parallel nodes may be scheduling nodes meeting parallel execution conditions, the condition content of the specific parallel execution conditions is not limited, and any condition that hardware can simultaneously realize operations corresponding to two scheduling nodes may be used as a parallel execution condition. In a possible implementation manner, the parallel node being a scheduling node meeting the parallel execution condition may include:

when the operations corresponding to different parallel nodes are executed, the resources for executing the operations are inconsistent.

Since the hardware device needs to consume hardware resources, such as occupying an arithmetic unit, when executing an operation corresponding to a scheduling node, in a possible implementation manner, it may be determined whether the scheduling nodes meet a parallel condition by determining whether resources of two scheduling nodes when executing the operation are consistent, and at this time, in an example, it may be determined whether the scheduling nodes are parallel nodes by determining whether resources of the execution operation are consistent when the operations corresponding to different scheduling nodes are executed.

The scheduling node may be one operation node in the computational graph or an operation node set formed by a plurality of sequentially executed operation nodes, and the parallel node may be a scheduling node meeting a parallel execution condition. Therefore, in a possible implementation manner, the parallel nodes may be N operation nodes meeting parallel execution conditions in the computation graph, and the number of N is not limited and can be flexibly determined according to actual conditions; in one example, the parallel nodes may be 2 operation nodes meeting the parallel execution condition in the computation graph, and are respectively marked as an operation node a and an operation node B, where operations corresponding to the operation node a and the operation node B may be executed simultaneously by hardware devices without affecting each other; in an example, the parallel node may also be 5 operation nodes meeting the parallel execution condition in the computation graph, which are respectively denoted as an operation node a, an operation node B, an operation node C, an operation node D, and an operation node E, where 5 operations respectively corresponding to the operation node a to the operation node E may be executed simultaneously by hardware devices without mutual influence. In a possible implementation manner, the parallel node may also be N operation node sets formed by a plurality of sequentially executable operation nodes, where the N number is also not limited, and may be flexibly determined according to an actual situation, and meanwhile, the number of operation nodes included in each operation node set is also not limited, and is also flexibly determined according to an actual situation; in an example, the parallel nodes may be 2 operation node sets that conform to parallel execution in the computation graph, which are respectively denoted as an operation node set a and an operation node set B, where the operation node set a may be composed of 2 operation nodes 1 and 2 that can be sequentially executed, and the operation node set B may be composed of 3 operation nodes 3, 4 and 5 that can be sequentially executed, when the operation node set a executes an operation, operations corresponding to the operation nodes 1 and 2 may be sequentially executed, when the operation node set B executes an operation, operations corresponding to the operation nodes 3, 4 and 5 may be sequentially executed, respectively, in this disclosed embodiment, since the operation node set a and the operation node set B conform to a parallel execution condition, the hardware device may simultaneously implement operations of the operation node set a and the operation node set B, when the operation node set A and the operation node set B execute the operation, no influence is generated between the operation node set A and the operation node set B; in an example, the parallel nodes may be 3 operation node sets conforming to parallel execution in the computational graph, which are respectively denoted as an operation node set a, an operation node set B, and an operation node set C, where the operation node set a may be composed of 2 operation nodes 1 and 2 that can be sequentially executed, the operation node set B may be composed of 3 operation nodes 3, 4, and 5 that can be sequentially executed, the operation node set C may be composed of 5 operation nodes 6, 7, 8, 9, and 10 that can be sequentially executed, when the operation node set a executes an operation, operations corresponding to the operation nodes 1 and 2 may be sequentially executed, when the operation node set B executes an operation, operations corresponding to the operation nodes 3, 4, and 5 may be sequentially executed, when the operation node set C executes operations, operations corresponding to the operation nodes 6 to 10 may be executed in sequence; in the embodiment of the present disclosure, because the operation node set a, the operation node set B, and the operation node set C conform to the parallel execution condition, the hardware device may simultaneously implement operations of the operation node set a, the operation node set B, and the operation node set C, and when the operation node set a, the operation node set B, and the operation node set C execute operations, no influence is generated therebetween. In a possible implementation manner, the parallel nodes may also simultaneously include N1 operation nodes and N2 operation node sets that meet the parallel execution condition in the computation graph, the numbers of N1 and N2 are also not limited, and may be flexibly determined according to the actual situation, and meanwhile, the number of operation nodes included in each operation node set is also not limited, and is also flexibly determined according to the actual situation; in an example, the parallel nodes may be 1 operation node and 1 operation node set that conform to parallel execution in the computation graph, which are respectively denoted as an operation node a and an operation node set B, where the operation node set B may be composed of 2 operation nodes 1 and 2 that can be sequentially executed, and when the operation node set B executes an operation, the operations corresponding to the operation nodes 1 and 2 may be sequentially executed, in this embodiment of the present disclosure, since the operation node a and the operation node set B conform to a parallel execution condition, the hardware device may simultaneously implement the operations of the operation node a and the operation node set B, that is, the hardware device may simultaneously execute the operation of the operation node a and the operations from the operation node 1 to the operation node 2, and the operation node a and the operation node set B do not affect each other when executing the operations; in an example, the parallel nodes may be 2 operation nodes and 1 operation node set conforming to parallel execution in the computation graph, which are respectively denoted as an operation node a, an operation node B, and an operation node set C, where the operation node set C may be composed of 3 operation nodes 1, 2, and 3 that may be sequentially executed, and when the operation node set C executes an operation, the operations corresponding to the operation nodes 1, 2, and 3 may be sequentially and respectively executed; in the embodiment of the present disclosure, since the operation node a, the operation node B, and the operation node assembly C conform to the parallel execution condition, the hardware device may simultaneously implement operations of the operation node a, the operation node B, and the operation node assembly C, and the operation node a, the operation node B, and the operation node assembly C do not affect each other when executing the operations.

According to the instruction generation method, the instruction generation device and the related products of the embodiments of the aspects of the disclosure, parallel instructions capable of being executed in parallel can be generated, so that parallel processing and calculation of data segments without dependency relationship are facilitated, and the efficiency of data processing and calculation is improved.

In the above disclosed embodiment, the implementation manner of step S11 is not limited, and any manner that can receive the computation graph may be implemented as step S11, and is not listed here. The implementation manner of step S12 is not limited, and it has been proposed in the above-mentioned disclosed embodiment that the scheduling node may be a single operation node in the computation graph, or an operation node set composed of operation nodes that can be sequentially executed in the computation graph, and therefore, the implementation manner of step S12 may also be adaptively changed according to different forms of the scheduling node. Fig. 2 shows a flowchart of an instruction generation method according to an embodiment of the present disclosure, and as shown in the figure, in one possible implementation, step S12 may include:

step S121, counting the scheduling nodes in the calculation graph, wherein the scheduling nodes are operation nodes meeting scheduling conditions.

And step S122, obtaining a first scheduling set according to all scheduling nodes.

In the above-mentioned embodiment, it has been proposed that the scheduling node may correspond to one operation node or multiple operation nodes in the computational graph, and therefore, in a possible implementation manner, statistics on the scheduling node in the computational graph may be completed by determining whether the operation node in the computational graph meets the scheduling condition.

The specific condition content of the scheduling condition may be set according to an actual situation, and in a possible implementation manner, the scheduling node being an operation node meeting the scheduling condition may include:

when the operation corresponding to the scheduling node is executed, the resource for executing the operation is in an idle state, and all dependent operation nodes of the scheduling node generate instructions; and the output data corresponding to the operation node is depended on and is related to the input data corresponding to the scheduling node.

As can be seen from the foregoing disclosure, in a possible implementation manner, determining whether one or some operation nodes in the computation graph can be used as a scheduling node or not mainly includes two aspects: one is all the operation nodes included in the scheduling node, and all the resources required for executing the operation are available. Specifically, how to judge whether all the resources required in the scheduling node are available can be accomplished by checking the resource table. In a possible implementation manner, the resource table may be used to track usage of resources in the instruction generation process, in an example, all hardware resources may be defined in an abstract manner, so that, each time a scheduling node is selected to execute a corresponding operation, the hardware resources may change correspondingly, in an example, when an operation corresponding to a certain scheduling node is executed, there may be a case that a certain operator is occupied in the hardware resources, and therefore, the hardware resources may be integrated into the resource table after being defined in the abstract manner, so that it may be checked whether all resources required by the scheduling node when executed are in an available state by checking the resource table, if it is determined that all resources required by the scheduling node when executed are in an available state by checking the resource table, the scheduling node may be considered to be in compliance with an aspect of the scheduling condition.

In one possible implementation manner, determining whether some or some operation nodes in the computation graph can serve as scheduling nodes, and another aspect of the scheduling conditions that need to be met may be: the scheduling node has no outstanding dependent nodes. In one possible implementation manner, the output data corresponding to the dependent node is related to the input data corresponding to the scheduling node, and in one example, when the operation corresponding to the dependent node of the scheduling node is executed, the obtained output data may be input data required when the operation corresponding to the scheduling node is executed, or pre-preparation data of the input data required when the operation corresponding to the scheduling node is executed. The type and the number of the dependent nodes of the scheduling node are not limited, and can be determined according to the actual situation of the calculation graph. Fig. 3 shows a computation graph according to an embodiment of the present disclosure, as shown in the figure, in an example, for a scheduling node composed of a single operation node, its dependent node may be an operation node connected or indirectly connected to an input terminal of the scheduling node, for example, for a CONV-0 node in the graph, after operations corresponding to a LOAD input0 node and a LOAD synapse0 are executed, the operations corresponding to the CONV-0 node may be executed, so that the LOAD input0 node and the LOAD synapse0 may be dependent nodes of the CONV-0 node, and besides, as can be seen from the figure, since an operation corresponding to a LOAD input0 node is executed, an operation corresponding to an ALLOC input0 node is required to be executed, so that an ALLOC input0 node may be a dependent node of the CONV-0 node, and may also be a dependent node of the CONV-0 node, similarly, the ALLOC synapse0 node can be used as a dependent node of the LOAD synapse0 node and the CONV-0 node at the same time. Meanwhile, the CONV-0 node can be simultaneously used as a dependency node of the VADD-0 node, the REL input0 node and the REL synapse0 node. In one example, if the scheduling node is formed by a plurality of operation nodes together, the dependent node thereof may be an operation node connected or indirectly connected to the input terminals of all the operation nodes of the scheduling node, as shown in fig. 3, in one example, if both the LOAD input0 node and the CONV-0 node are considered as one scheduling node together, since the input of the LOAD input0 node is related to the ALLOC input0 node and the input of the CONV-0 node is related to the LOAD synapse0 node and the ALLOC synapse0 node, the ALLOC input0 node, the ALLOC synapse0 node, and the LOAD synapse0 node may all be considered as dependent nodes of the scheduling node formed by both the LOAD input0 node and the CONV-0 node together; in one example, if both the CONV-0 node and the VADD-0 node are considered as a scheduling node together, since the input of the VADD-0 node is related to the CONV-0 node, that is, related to the operation node included in the scheduling node itself, at this time, the dependent node of the VADD-0 node may not be the dependent node of the scheduling node, but only the dependent node of the CONV-0 node is the dependent node of the scheduling node, and therefore, the scheduling nodes formed by the CONV-0 node and the VADD-0 node together are all input0 node, all sync 0 node, LOAD input0 node, and LOAD sync 0, respectively.

Therefore, when the above two conditions are satisfied in the scheduling nodes, the scheduling nodes can be regarded as the counted scheduling nodes, and then the corresponding first scheduling set is obtained through step S122.

The implementation manner of step S122 is also not limited, and any manner that can integrate all statistical scheduling nodes to obtain the first scheduling set can be used as the implementation manner of step S122. In one possible implementation, step S122 may include:

and labeling the scheduling nodes according to the direction of the calculation graph to obtain a first scheduling set formed by the scheduling nodes corresponding to the labeling sequence.

For a directed computation graph, when an operation corresponding to each operation node in the graph is executed, there is an execution order corresponding to the direction of the computation graph, so in a possible implementation manner, according to the direction of the computation graph, a scheduling node counted in the computation graph may be labeled, so as to form a first scheduling set formed by scheduling nodes corresponding to the labeled order, so that the execution order of the operation corresponding to the scheduling node may be determined according to the label of the scheduling node, and then the order of subsequently generated instructions may be determined. The manner of labeling the scheduling nodes according to the direction of the computation graph is not limited, and in one example, the scheduling nodes may be labeled according to the direction of the computation graph from small to large, and then a first scheduling set is obtained, for example, the first scheduling set may be in the form of S ═ u₁,u₂,...u_nWherein u is₁,u₂,...u_nThe scheduling nodes in the set S are labels which are sequentially carried out from small to large according to the direction of a calculation graph, so that in the first scheduling set S, the smaller the label is, the earlier the scheduling node is received into the first scheduling set, namely, the earlier the operation corresponding to the scheduling node is receivedPerforming, i.e. u₁The order in which the corresponding operations are performed is no later than u₂The order in which the corresponding operations are performed. In one example, the scheduling nodes may also be labeled according to the direction of the computation graph and from large to small, and then a first scheduling set is obtained, for example, the first scheduling set may be in the form of S ═ u₁,u₂,...u_nWherein u is₁,u₂,...u_nThe scheduling nodes in the set S are labels which are sequentially carried out from large to small according to the direction of a calculation graph, so that in the first scheduling set S, the smaller the label is, the later the scheduling node is received into the first scheduling set, namely, the later the operation corresponding to the scheduling node is executed, namely u is executed, the later the scheduling node is₁The order in which the corresponding operations are performed is no earlier than u₂The order in which the corresponding operations are performed.

After the first scheduling set is obtained, the parallel nodes in the first scheduling set may be merged into parallel scheduling units through step S13, so as to obtain a second scheduling set including parallel scheduling units. The implementation manner of step S13 is not limited, and fig. 4 shows a flowchart of an instruction generation method according to an embodiment of the present disclosure, and as shown in the figure, in one possible implementation manner, step S13 may include:

step S131, merging the parallel nodes in the first scheduling set into a parallel scheduling unit.

And step S132, labeling the scheduling nodes and the parallel scheduling units in the first scheduling set according to the direction of the calculation graph to obtain a second scheduling set formed by the scheduling nodes and the parallel scheduling units corresponding to the labeling sequence.

The implementation manner of step S131 is not limited, that is, the manner of merging the parallel nodes in the first scheduling set into the parallel scheduling unit is not limited, and any merging manner that can schedule the parallel nodes in the first scheduling set as an entire scheduling unit can be used as the implementation manner of step S131. In a possible implementation manner, the parallel nodes may be counted in the same set, and the set is used as a subset of the first scheduling set, as an implementation manner of step S131. In the above disclosure, it has been proposed that the parallel node is a scheduling node meeting the parallel execution condition, and in a possible implementation manner, the scheduling node meeting the parallel execution condition may include: when the operations corresponding to different parallel nodes are executed, the resources for executing the operations are inconsistent. Since the number of resources that can execute operations is not limited, and the kind of simultaneous operations is also not limited, the number of parallel nodes that can be executed simultaneously can be flexibly determined according to the actual situation, and is not limited herein, therefore, in the embodiment of the present disclosure, the number of parallel nodes included in one parallel scheduling unit is not limited. Meanwhile, in a possible implementation manner, the first scheduling set may include only one parallel scheduling unit, that is, parallel nodes included in the first scheduling set, and their corresponding operations may be executed only in the same time sequence; in a possible implementation manner, the first scheduling set may also include multiple parallel scheduling units, that is, parallel nodes included in the first scheduling set, and when their corresponding operations are executed, a part of the parallel nodes may be executed simultaneously as one parallel scheduling unit in a certain time sequence, and another part of the parallel nodes may be executed simultaneously as another parallel scheduling unit in another time sequence, and operations corresponding to different parallel scheduling units may not be executed in parallel, and specifically include several parallel scheduling units, which may be determined according to actual conditions of the scheduling nodes included in the first scheduling set, and are not limited herein. Therefore, when step S131 is executed, in a possible implementation manner, the number of all parallel scheduling units in the first scheduling set and the actual situation of the parallel nodes included in the first scheduling set may be determined by traversing the first scheduling set, and in a possible implementation manner, a part of the parallel nodes included in the first scheduling set may be merged into one parallel scheduling unit, then a subsequent instruction generating operation is executed, then another part of the parallel nodes included in the first scheduling set is merged into a new parallel scheduling unit, then a subsequent instruction generating operation is executed, and the process is repeated until there is no parallel node that can be merged in the scheduling set.

After the parallel scheduling units in the first scheduling set are obtained, the scheduling nodes and the parallel scheduling units in the first scheduling set may be labeled in step S132 to obtain a second scheduling set. The specific implementation manner of step S132 is not limited, that is, how to label the parallel nodes and the parallel scheduling units according to the direction of the computation graph is not limited, and in an example, the scheduling nodes and the parallel scheduling units may be labeled according to the direction of the computation graph and in a sequence from small to large, so as to obtain a second scheduling set; in an example, according to the direction of the computation graph, labeling the scheduling nodes and the parallel scheduling units in a descending order, and then obtaining a second scheduling set; in one example, the scheduling node and the parallel scheduling unit may also be labeled according to the label of the first scheduling set itself, in combination with the direction of the computation graph, for example, the first scheduling set may be in the form of S ═ u₁,u₂,...u_nWherein u is₁,u₂,...u_nN is not limited, and can be flexibly determined according to actual conditions, and for the set S, there may be a scheduling node u_iAnd u_jThe two nodes can be merged, at this time, according to the sequence in which the corresponding operations are executed in the computation graph after the two nodes are merged, the label k of the merged parallel scheduling unit in the second scheduling set is determined, and the labels of other scheduling nodes are correspondingly changed according to the original label sequence, so that the merged second scheduling set S is obtained^’＝u₁,u₂,...[u_i,u_j]_k,…u_mThe total number of the scheduling nodes and the scheduling units included in the second scheduling set is changed from the original n to m, and the specific values of n and m can be determined according to the actual situation, which is not limited in the specific values.

Marking the scheduling nodes and the parallel scheduling units in the first scheduling set according to the direction of a calculation graph by merging the parallel nodes in the first scheduling set into the parallel scheduling units to obtain a second scheduling set formed by the scheduling nodes and the parallel scheduling units corresponding to the marking sequence; the process can effectively take the operation which can be executed in parallel in the calculation graph as a whole, so that when the instruction is generated subsequently, the parallel instruction can be directly generated according to the parallel scheduling unit, and the instruction does not need to be confirmed to be executed in parallel after the instruction, thereby improving the overall speed and efficiency of the instruction generation process and laying a good foundation for improving the operation speed and efficiency.

After the second scheduling set is obtained, the step S14 may be performed to generate the instruction according to the second scheduling set, and it can be seen from the above disclosed embodiments that the instruction generation manner is not unique, the instruction may be generated uniformly after all parallel scheduling units are generated, or the parallel instruction may be generated after each parallel scheduling unit is obtained by merging, and then the scheduling node is traversed to determine a new parallel scheduling unit and generate the instruction again, so the implementation manner of the step S14 is not specifically limited, and may be flexibly determined according to the actual situation, and in a possible implementation manner, the step S14 may include:

repeatedly executing the following operations until the scheduling node and the parallel scheduling unit which can generate the instruction are not contained in the second scheduling set; wherein performing operations comprising:

when a parallel scheduling unit capable of generating an instruction is included in the second scheduling set, an instruction corresponding to the parallel scheduling unit is generated.

And when the second scheduling set contains scheduling nodes capable of generating instructions and does not contain parallel scheduling units capable of generating instructions, generating instructions corresponding to the scheduling nodes.

As can be seen from the foregoing disclosure, in a possible implementation manner, the process of generating the instruction according to the second scheduling set may be that, when the second scheduling set includes the parallel scheduling unit, the parallel scheduling unit is preferentially selected to generate the corresponding parallel instruction, and then the instructions corresponding to the remaining scheduling nodes are generated.

Specifically, how to determine the conditions of the parallel scheduling unit and the scheduling node included in the second scheduling set may be flexibly selected according to actual conditions, and in a possible implementation manner, the specific state of the second scheduling set may be determined according to the label conditions of the scheduling node and the parallel scheduling unit included in the second scheduling set. In one example, as can be seen from the foregoing disclosed embodiments, the total number of scheduling nodes and scheduling units included in the second scheduling set may be m, and therefore, the instruction generation may be performed according to the value of m, and in one example, according to the value of m, a specific process of generating an instruction through step S14 may be:

when m is equal to 0, it may indicate that the translation of the current computation graph has ended, that is, instructions corresponding to all operation nodes in the computation graph have been generated, and at this time, it may indicate that step S14 has ended, that is, the instruction generation process has been completed.

When m is 1, it may indicate that the current second scheduling set only includes a unique object that can generate an instruction, and this object may be a scheduling node, or may be a parallel scheduling unit, and no matter which form the object is, this unique object generation instruction may be selected at this time, the resource table is updated, after this unique object generation instruction, the value of m changes from 1 to 0, at this time, the second scheduling set may be updated, and the case where m is 0 is returned.

When m >1, it may indicate that the current scheduling set includes a plurality of objects that can generate instructions, and among the plurality of objects that can generate instructions, if a parallel scheduling unit is included, the parallel scheduling unit may be selected according to the order of the labels, and in the embodiment of the present disclosure, the parallel scheduling unit with the smallest label may be selected to generate a parallel instruction; if the parallel scheduling unit is not included, the scheduling node generation instruction may be selected according to the label sequence, and in the embodiment of the present disclosure, the scheduling node with the smallest label may be selected to generate the instruction until the value of m becomes 1, and at this time, the case where m is 1 may be returned.

By the process, the parallel instructions can be preferentially generated in the instruction generation process, so that when the hardware executes operation according to the instructions, the parallel space can be preferentially utilized for operation and processing, and the overall operation speed and efficiency of the hardware equipment are improved.

In the above-described disclosed embodiment, the instruction corresponding to the parallel scheduling unit is generated according to the second scheduling unit, and the specific generation process may be flexibly determined according to the actual situation of the second scheduling unit and the operation capability of the hardware device, and is not limited to the following disclosed embodiments. In one possible implementation manner, when a parallel scheduling unit capable of generating an instruction is included in the second scheduling set, generating an instruction corresponding to the parallel scheduling unit may include:

when a parallel scheduling unit capable of generating an instruction is included in the second scheduling set, a parallel instruction corresponding to the parallel scheduling unit is generated according to the label order of the parallel scheduling unit.

In the above disclosure, it has been proposed that, in a process of generating a second scheduling unit by merging parallel nodes into parallel scheduling units, scheduling nodes and parallel scheduling units in a first scheduling set may be labeled according to a direction of a computation graph to obtain a second scheduling set, and therefore, it can be seen through the above disclosure that labels of parallel scheduling units in the second scheduling set may indicate an execution sequence of instructions corresponding to the parallel scheduling units; in a possible implementation manner, the second scheduling set may include a plurality of parallel scheduling units, and in this case, in order to ensure the order of instruction execution, the execution order of the parallel instructions corresponding to the parallel scheduling units may be determined according to the label order of the parallel scheduling units, so as to sequentially generate the parallel instructions corresponding to the parallel scheduling units. Because the correspondence between the label sequence of the parallel scheduling unit and the execution sequence of the corresponding instruction is not unique, how to generate the parallel instruction corresponding to the parallel scheduling unit according to the label sequence of the parallel scheduling unit is also not limited, in one example, the label sequence of the second scheduling set may label the scheduling node and the parallel scheduling unit in a sequence from small to large according to the direction of a computation graph, so that the smaller the label of the parallel scheduling unit at this time is, the earlier the parallel scheduling unit is taken into the second scheduling set, that is, the earlier the execution sequence of the instruction corresponding to the parallel scheduling unit is, the earlier the parallel scheduling unit is, and thus the parallel instruction corresponding to each parallel scheduling unit is sequentially generated according to the sequence from small to large of the label of the parallel scheduling unit; in one example, the order of the labels of the second scheduling unit may be according to the direction of the computation graph, and the scheduling nodes and the parallel scheduling units are labeled in descending order, so that the smaller the label of the parallel scheduling unit at this time, the later the parallel scheduling unit is received into the second scheduling unit, that is, the later the execution order of the instructions corresponding to the parallel scheduling unit is, and thus the parallel instructions corresponding to each parallel scheduling unit may be sequentially generated according to the descending order of the label of the parallel scheduling unit.

The parallel instructions corresponding to the parallel scheduling units are generated, the specific implementation mode is not limited, and the parallel instructions can be flexibly selected according to actual conditions.

When the parallel scheduling units capable of generating instructions are included in the second scheduling set, the parallel instructions corresponding to the parallel scheduling units are generated according to the label sequence of the parallel scheduling units, so that when the second scheduling set includes a plurality of parallel scheduling units, the parallel instructions can be orderly generated according to the execution sequence corresponding to the computation graph, and the instruction generation efficiency is improved.

In the above disclosed embodiment, when the second scheduling unit does not include a parallel scheduling unit capable of generating an instruction, the instruction corresponding to the scheduling node is generated according to the second scheduling unit, and a specific generation process thereof may be flexibly determined according to an actual situation of the second scheduling unit, an arithmetic capability of a hardware device, and the like, and is not limited to the following disclosed embodiments. In one possible implementation manner, when the second scheduling set includes a scheduling node capable of generating an instruction and does not include a parallel scheduling unit capable of generating an instruction, generating an instruction corresponding to the scheduling node may include:

and when the second scheduling set contains scheduling nodes capable of generating instructions and does not contain parallel scheduling units capable of generating instructions, generating instructions corresponding to the scheduling nodes according to the label sequence of the scheduling nodes.

In the above disclosure, it has been proposed that, in the process of generating the second scheduling unit by merging parallel nodes into a parallel scheduling unit, the scheduling nodes and the parallel scheduling units in the first scheduling set may be labeled according to the direction of the computation graph to obtain the second scheduling set, and therefore, it can be seen through the above disclosure that the labels of the scheduling nodes in the second scheduling set may indicate the execution sequence of the instructions corresponding to the scheduling nodes, in a possible implementation manner, the second scheduling set may only include one scheduling node, and at this time, the labels of the scheduling nodes, that is, the execution sequence of the instructions corresponding to the scheduling nodes, may not be considered, and the parallel instructions are directly generated according to the scheduling nodes; in a possible implementation manner, the second scheduling set may include a plurality of scheduling nodes, and in order to ensure the order of instruction execution, the order of instruction execution corresponding to the scheduling nodes may be determined according to the label order of the scheduling nodes, so as to sequentially generate instructions corresponding to the scheduling nodes. Because the correspondence between the label sequence of the scheduling node and the execution sequence of the instruction corresponding to the scheduling node is not unique, how to generate the instruction corresponding to the scheduling node according to the label sequence of the scheduling node is also not limited, in one example, the label sequence of the second scheduling set may label the scheduling node and the parallel scheduling unit in a sequence from small to large according to the direction of a calculation graph, so that the smaller the label of the scheduling node at this time is, the earlier the scheduling node is received into the second scheduling set, that is, the earlier the execution sequence of the instruction corresponding to the scheduling node is, and therefore, the instructions corresponding to each scheduling node may be sequentially generated according to the sequence from small to large of the label of the scheduling node; in an example, the labeling order of the second scheduling set may be according to the direction of the computation graph, and label the scheduling nodes and the parallel scheduling units in a descending order, so that the smaller the label of the scheduling node at this time, the later the scheduling node is received into the second scheduling set, that is, the later the execution order of the instructions corresponding to the scheduling node is, and thus the parallel instructions corresponding to each scheduling node may be sequentially generated according to the descending order of the label of the scheduling node.

Generating an instruction corresponding to a scheduling node, wherein the specific implementation manner is not limited, and the instruction can be flexibly selected according to actual conditions, and in one example, the scheduling node can only correspond to one operation node in a computation graph, so that when the corresponding instruction is generated according to the scheduling node, only an execution instruction corresponding to the operation node can be generated; in one example, the scheduling node may include a plurality of sequentially executed operation nodes in the computation graph, and thus when the corresponding instruction is generated according to the scheduling node, a series of sequential instructions corresponding to the plurality of sequentially executed operation nodes may be generated.

When the second scheduling set comprises scheduling nodes capable of generating instructions and does not comprise parallel scheduling units capable of generating instructions, the instructions corresponding to the scheduling nodes are generated according to the label sequence of the scheduling nodes, so that when the second scheduling set comprises a plurality of residual scheduling nodes, sequential instructions can be orderly generated according to the execution sequence corresponding to the computation graph, and the instruction generation efficiency is improved.

Application example

In the application example of the present disclosure, instruction generation can be performed for high-level operators implemented by low-level operators, and these operators are first expanded into a loop form, then are subjected to corresponding optimized scheduling, and finally generate instructions. In an application example of the present disclosure, the scheduling process may be optimized by using a double buffering (double buffering) technique. Because the deep learning algorithm has the characteristic of intensive calculation and access and storage, after segmentation, the calculation of different data segments has no dependency relationship, so that the parallelism of access and storage and calculation can be increased by using a double-buffer technology. The generation of instructions without double buffering is relatively simple and can be directly seen as the unrolling of a loop. When the loop is expanded, two modes are available, the loop can be realized by adopting a jump instruction for the architecture supporting the conditional jump, and for the architecture not supporting the conditional jump, because the segmented numerical values are determined at the moment, the loop can be directly expanded, and the variable of the loop is replaced in the expanding process.

In order to implement the double buffering technique, a graph-based method is proposed in the application example of the present disclosure, which may synthesize every two iterations, and perform topological ordering on the data dependency graph, thereby determining the execution order of the instructions. The main process can comprise the following steps:

the double buffering technique may use a block of memory divided into two portions, and then perform memory access and computation operations between the two cycles in parallel. We do this by merging two adjacent cycles. Fig. 5 is a schematic diagram of an application example according to the present disclosure, and as shown in the application example of the present disclosure, a loop (corresponding to the source code of fig. 5) that needs double buffering optimization may be obtained according to a primitive identified by a user. Expanding the iterations may merge the two iterations into one iteration, modify the address (variable name) of the statement (corresponding to the loop body reduced loop body in fig. 5), and both inputs are allocated on the input buffer, which may be input1 and input2 in the application example of the present disclosure. When input1 is used for computation, input2 can be loaded.

Then, the code of the loop body is translated into a directed graph to represent the dependency between the data. The two iterations are translated into two separate graphs (corresponding to the dataflow graph data in fig. 5). Since the two directed graphs are generated from the same piece of code, their topologies are the same, and the difference is only in address and iteration number. After the computation graph is generated, instructions may be generated according to the two graphs, and in order to make the instructions generated according to the computation graph effectively applicable to a hardware device adopting a double buffering technology, in an application example of the present disclosure, an instruction generation method is proposed.

The instruction generation method provided in the application example of the present disclosure may use three different data structures in implementation, and the specific definitions thereof include:

in an example of the application of the present disclosure, the size of the window may be 2, which is used to fit the size of two iterations, which may use the on-chip resources at the same time, and therefore need to be allocated at the same time. After instructions in an iteration are all assigned, the window may automatically slide back through an iteration. As shown in fig. 5, iteration 0 and iteration 1 are both in the window, and when the node in the directed graph 0 is released after the last node (RELEASE map 0) in the directed graph 0 is released, the window may be slid back one iteration, that is, the directed graph 2 is placed in the window. By using the focus window, the space for saving a plurality of directed graphs can be saved, and because the structures of the directed graphs in the application example of the present disclosure are all the same, in the application example of the present disclosure, the directed graph structure can be directly reused without repeatedly constructing and deleting the directed graphs.

The sequential container, in this disclosure application example, may be an atomic structure in instruction scheduling optimization, and may be considered as a piece of sequential code segment. In scheduling, instructions in a sequential container are generated sequentially. In the directed graph, the sequential node can be scheduled as a scheduling node. Architectures with very low computational granularity may utilize sequential containers to hold a piece of computational instructions, which may be composed of scalars. For example, a convolution operation is implemented with instructions at the granularity of a matrix, which require sequential execution because they are performed on matrix arithmetic units. In this case, a convolution statement is loaded into the order container, and the order container can be scheduled as a scheduling node for a convolution operator.

Parallel containers, which may be used to store statements or sequential containers that may be executed in parallel. All contents in the parallel container will be translated in parallel.

And the resource table can be used for tracking the resource use condition in the instruction generation process. Since all hardware resources in the application example of the present disclosure are defined in an abstract manner, each time an atomic operation is selected, the resources change accordingly, for example, an operator is occupied, and therefore, instructions that can be executed in parallel can be found through the use condition of the resources in the scheduling process.

Based on the above data structure, in the application example of the present disclosure, a specific process of translating the directed graph into an instruction sequence may be described, and this process may be:

an available scheduling node is found. In this process, a scheduling set may be constructed, and in an application example of the present disclosure, the scheduling set may be S ═ u₁,u₂,...u_nThe scheduling set S may be made up of available scheduling nodes. In the application example of the present disclosure, one available scheduling node needs to satisfy two conditions: one is that all resources required by this scheduling node are available, and determining whether all resources required by this scheduling node are available can be accomplished by checking the resource table. Second, this scheduling node has no outstanding dependent nodes. For example, as shown in FIG. 5, after the LOADINPUT0 and LOAD synapse0 are executed, Conv-0 is ready, so Conv-0 satisfies the condition that there are no outstanding dependent nodes, and the operation unit is available at this time, i.e. the Conv-0 node satisfies the condition that all resources are available, so the Conv-0 node is an available scheduling node at this time.

Finding executable in parallelAnd scheduling the nodes. From the constructed scheduling set, the scheduling nodes corresponding to the instructions which can be executed in parallel can be continuously identified and then are fused into a parallel scheduling unit. The specific implementation manner may be to find two arithmetic units that can be executed simultaneously by checking the resource table, and then the scheduling nodes corresponding to the operations executed by the two arithmetic units may be merged into a parallel scheduling unit. For example, CONV and LOAD use completely different resources and can therefore be executed in parallel. As shown in FIG. 5, CONV1 and LOADINPUT2 are executable in parallel and are placed in parallel containers as a parallel scheduler unit. In this case, the original set S is converted into a set S' ═ u₁,u₂,...[u_i,u_j]_k,...u_mWherein u is_iAnd u_jAre combined into one parallel scheduling unit. The total number of scheduling nodes and parallel scheduling units may change from n to m.

One scheduling unit is selected. In this process, a scheduling node or a parallel scheduling unit may be selected and pushed into the instruction sequence to generate the instruction. For set S', there may be three states in this case:

and m is 0, which indicates that the translation of the current computational graph is finished, and the step can be directly finished.

Since m is 1, it indicates that there is only one scheduling object (scheduling node or parallel scheduling unit) that can be selected, the unique scheduling object can be selected, the resource table is updated, and then the state where m is 0 is jumped back to, and the set is updated.

m >1, at which time if there are parallel scheduling units, the parallel scheduling unit with the smallest sequence number may be selected (smaller sequence number indicates earlier revenue set in the application example of the present disclosure); otherwise, selecting the scheduling node with the minimum sequence number, and jumping back to the state that m is 1 when only one scheduling object in the set can generate an instruction.

Fig. 6 shows a block diagram of an instruction generation apparatus according to an embodiment of the present disclosure, as shown, the apparatus 20 includes: a receiving unit 21 configured to receive a calculation map; a counting unit 22, configured to count scheduling nodes in the computation graph to obtain a first scheduling set; a merging unit 23, configured to merge parallel nodes in the first scheduling set into a parallel scheduling unit, so as to obtain a second scheduling set including the parallel scheduling unit, where the parallel nodes are scheduling nodes meeting parallel execution conditions; and an instruction generating unit 24, configured to generate an instruction according to the second scheduling set.

In one possible implementation, the statistical unit is configured to: counting scheduling nodes in the calculation graph, wherein the scheduling nodes are operation nodes meeting scheduling conditions; and obtaining a first scheduling set according to all scheduling nodes.

In a possible implementation manner, a scheduling node is an operation node meeting a scheduling condition, and includes: when the operation corresponding to the scheduling node is executed, the resource for executing the operation is in an idle state, and all dependent operation nodes of the scheduling node generate instructions; and the output data corresponding to the operation node is depended on and is related to the input data corresponding to the scheduling node.

In a possible implementation, the statistical unit is further configured to: and labeling the scheduling nodes according to the direction of the calculation graph to obtain a first scheduling set formed by the scheduling nodes corresponding to the labeling sequence.

In one possible implementation, the merging unit is configured to: merging parallel nodes in the first scheduling set into a parallel scheduling unit; and according to the direction of the calculation graph, labeling the scheduling nodes and the parallel scheduling units in the first scheduling set to obtain a second scheduling set formed by the scheduling nodes and the parallel scheduling units corresponding to the labeling sequence.

In one possible implementation manner, the parallel node is a scheduling node meeting the parallel execution condition, and the scheduling node includes: when the operations corresponding to different parallel nodes are executed, the resources for executing the operations are inconsistent.

In one possible implementation, the instruction generating unit is configured to: repeatedly executing the following operations until the scheduling node and the parallel scheduling unit which can generate the instruction are not contained in the second scheduling set; wherein performing operations comprising: when the second scheduling set comprises a parallel scheduling unit capable of generating instructions, generating instructions corresponding to the parallel scheduling unit; and when the second scheduling set contains scheduling nodes capable of generating instructions and does not contain parallel scheduling units capable of generating instructions, generating instructions corresponding to the scheduling nodes.

In one possible implementation, the instruction generating unit is further configured to: when a parallel scheduling unit capable of generating an instruction is included in the second scheduling set, a parallel instruction corresponding to the parallel scheduling unit is generated according to the label order of the parallel scheduling unit.

In one possible implementation, the instruction generating unit is further configured to: and when the second scheduling set contains scheduling nodes capable of generating instructions and does not contain parallel scheduling units capable of generating instructions, generating instructions corresponding to the scheduling nodes according to the label sequence of the scheduling nodes.

Fig. 7 is a block diagram illustrating an instruction generation apparatus 1300 according to an example embodiment. For example, the apparatus 1300 may be provided as a server. Referring to fig. 7, apparatus 1300 includes a processing component 1322, which further includes one or more processors, and memory resources, represented by memory 1332, for storing instructions, such as application programs, that may be executed by processing component 1322. The application programs stored in memory 1332 may include one or more modules that each correspond to a set of instructions. Further, processing component 1322 is configured to execute instructions to perform the methods described above.

The apparatus 1300 may also include a power component 1326 configured to perform power management for the apparatus 1300, a wired or wireless network interface 1350 configured to connect the apparatus 1300 to a network, and an input-output (I/O) interface 1358. The apparatus 1300 may operate based on an operating system stored in the memory 1332, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1332, is also provided that includes computer program instructions that are executable by the processing component 1322 of the apparatus 1300 to perform the methods described above.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing may be better understood in light of the following clauses:

clause a1, a method of instruction generation, the method comprising:

receiving a calculation graph;

and generating an instruction according to the second scheduling set.

Clause a2, the method according to clause A1, where the counting the scheduling nodes in the computational graph to obtain a first scheduling set includes:

counting scheduling nodes in the calculation graph, wherein the scheduling nodes are operation nodes meeting scheduling conditions;

and obtaining a first scheduling set according to all the scheduling nodes.

Clause A3, the method of clause a2, the scheduling node being an operational node meeting scheduling conditions, comprising:

when the operation corresponding to the scheduling node is executed, the resource for executing the operation is in an idle state, and all dependent operation nodes of the scheduling node generate instructions;

and the output data corresponding to the dependent operation node is related to the input data corresponding to the scheduling node.

Clause a4, the method of clause a2, the obtaining a first set of schedules from all of the scheduling nodes, comprising:

Clause a5, the method of any one of clause a1 to clause a4, wherein merging parallel nodes in the first scheduling set into parallel scheduling units to obtain a second scheduling set including parallel scheduling units, comprises:

merging parallel nodes in the first scheduling set into a parallel scheduling unit;

and labeling the scheduling nodes and the parallel scheduling units in the first scheduling set according to the direction of the calculation graph to obtain a second scheduling set formed by the scheduling nodes and the parallel scheduling units corresponding to the labeling sequence.

Clause a6, the method of clause a1, the parallel node being a scheduling node eligible for parallel execution comprising:

when the operation corresponding to different parallel nodes is executed, the resource executing the operation is inconsistent.

Clause a7, the method of clause a1, the generating instructions from the second scheduling set, comprising:

repeatedly executing the following operations until a scheduling node and a parallel scheduling unit which can generate an instruction are not contained in the second scheduling set; wherein the performing comprises:

when a parallel scheduling unit capable of generating instructions is included in the second scheduling set, generating instructions corresponding to the parallel scheduling unit;

Clause A8, the method of clause a7, wherein generating instructions corresponding to parallel schedule units capable of generating instructions when the parallel schedule units are included in the second schedule set, comprises:

and when the second scheduling set comprises a parallel scheduling unit capable of generating instructions, generating the parallel instructions corresponding to the parallel scheduling unit according to the label sequence of the parallel scheduling unit.

Clause a9, the method of clause a7, wherein generating an instruction corresponding to a scheduling node capable of generating an instruction when the scheduling node is included in the second scheduling set and no parallel scheduling unit capable of generating an instruction is included comprises:

Clause B10, an instruction generating apparatus, comprising:

a receiving unit configured to receive a computation graph;

Clause B11, the apparatus of clause B10, the statistics unit to:

and obtaining a first scheduling set according to all the scheduling nodes.

Clause B12, the apparatus of clause B11, the scheduling node being a scheduling eligible operational node comprising:

Clause B13, the apparatus of clause B11, the statistics unit further to:

Clause B14, the apparatus of any one of clauses B10 to 13, the merging unit being configured to:

Clause B15, the apparatus of clause B10, the parallel node being a scheduling node eligible for parallel execution comprising:

Clause B16, the apparatus of clause B10, the instruction generation unit to:

Clause B17, the apparatus of clause B16, the instruction generation unit further to:

Clause B18, the apparatus of clause B16, the instruction generation unit further to:

Clause C19, an instruction generating apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of any of clause A1-clause A9.

Clause D20, a non-transitory computer readable storage medium having computer program instructions stored thereon that, when executed by a processor, implement the method of any one of clauses a 1-a 9.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An instruction generation method, the method comprising:

receiving a calculation graph;

generating instructions according to the second scheduling set,

generating instructions according to the second scheduling set, including:

2. The method of claim 1, wherein the counting the scheduling nodes in the computational graph to obtain a first scheduling set comprises:

and obtaining a first scheduling set according to all the scheduling nodes.

3. The method of claim 2, wherein the scheduling node is an operation node meeting the scheduling condition, and comprises:

4. The method of claim 2, wherein obtaining the first scheduling set according to all the scheduling nodes comprises:

5. The method according to any of claims 1 to 4, wherein said merging the parallel nodes in the first scheduling set into parallel scheduling units to obtain a second scheduling set including parallel scheduling units comprises:

6. The method of claim 1, wherein the parallel node being a scheduling node that satisfies a parallel execution condition comprises:

7. The method according to claim 1, wherein generating an instruction corresponding to a parallel scheduling unit capable of generating an instruction when the parallel scheduling unit is included in the second scheduling set comprises:

8. The method according to claim 1, wherein when the second scheduling set includes a scheduling node capable of generating an instruction and does not include a parallel scheduling unit capable of generating an instruction, generating an instruction corresponding to the scheduling node comprises:

9. An instruction generating apparatus, comprising:

a receiving unit configured to receive a computation graph;

an instruction generation unit for generating an instruction according to the second scheduling set,

the instruction generation unit is to:

10. The apparatus of claim 9, wherein the statistics unit is configured to:

and obtaining a first scheduling set according to all the scheduling nodes.

11. The apparatus of claim 10, wherein the scheduling node is an operation node meeting a scheduling condition, and the method comprises:

12. The apparatus of claim 10, wherein the statistics unit is further configured to:

13. The apparatus according to any one of claims 9 to 12, wherein the merging unit is configured to:

14. The apparatus of claim 9, wherein the parallel node is a scheduling node that satisfies a parallel execution condition, and wherein the scheduling node comprises:

15. The apparatus of claim 9, wherein the instruction generation unit is further configured to:

16. The apparatus of claim 9, wherein the instruction generation unit is further configured to:

17. An instruction generating apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1 to 8.