CN115203126B

CN115203126B - Operator fusion processing method, device, equipment and storage medium

Info

Publication number: CN115203126B
Application number: CN202211118517.1A
Authority: CN
Inventors: 闫夏超; 徐旎林; 张文斌; 叶楠; 高伟
Original assignee: Taichu Wuxi Electronic Technology Co ltd
Current assignee: Taichu Wuxi Electronic Technology Co ltd
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2023-04-18
Anticipated expiration: 2042-09-15
Also published as: CN115203126A

Abstract

The invention discloses an operator fusion processing method, device, equipment and storage medium, which are applied to a heterogeneous many-core acceleration processor and comprise the following steps: receiving an operation request of a target network, determining a plurality of target operators, and calling a forward fusion interface of each target operator through a main core; forward calculation is carried out on each target operator through a plurality of secondary cores by adopting matched accelerating components, and a first output result of the target operator is obtained; determining a target output result from the first output result and writing the target output result back to the memory according to the composition structure of the target operator in the network; and calling the reverse fusion interface of each target operator through the master core, and performing reverse calculation on each target operator through the plurality of slave cores by adopting the matched acceleration components to obtain a second output result of the target operator. The technical scheme of the embodiment of the invention can reduce the occupancy rate of the access bandwidth, and improve the processing efficiency of operators in the target network and the utilization rate of hardware resources in the processor.

Description

Operator fusion processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an operator fusion processing method, apparatus, device, and storage medium.

Background

With the development of computer technology, deep learning becomes an important technical means in the field of artificial intelligence, and in order to improve the calculation efficiency of a deep learning network, a network algorithm needs to be combined with a specific hardware structure so as to achieve an optimal calculation effect.

At present, the computing process of the deep learning network is usually executed by an Artificial Intelligence (AI) acceleration processor, and the AI acceleration processor mostly adopts a heterogeneous many-core design, which can reduce power consumption to a certain extent while obtaining high-efficiency computing performance.

However, the conventional AI acceleration processor has the following defects when calculating the deep learning network: if the calling process of a plurality of operators exists in the network, the AI acceleration processor usually writes back the intermediate processing result between the operators to the memory, thereby occupying limited memory access bandwidth and influencing the overall calculation efficiency; secondly, the existing operator processing process usually includes non-matrix multiplication operation and matrix multiplication operation, and the matrix multiplication accelerating component and the non-matrix multiplication accelerating component in the secondary core are independent from each other, so that only one accelerating component can be used when the secondary core is used for processing an operator, the other accelerating component is in an idle state, and hardware resources are not fully utilized.

Disclosure of Invention

The invention provides an operator fusion processing method, an operator fusion processing device and a storage medium, which can reduce the occupancy rate of access and storage bandwidth, improve the processing efficiency of operators in a target network and improve the utilization rate of hardware resources in a processor.

According to an aspect of the present invention, an operator fusion processing method is provided, which is applied to a heterogeneous many-core accelerated processor, and includes:

after receiving an operation request of a target network, determining a plurality of target operators included in the target network, and calling a forward fusion interface of each target operator through a main core;

performing forward calculation on each target operator by a plurality of slave cores and adopting an accelerating component matched with each target operator to obtain a first output result corresponding to each target operator;

determining a target output result in each first output result according to the composition structure of each target operator in a target network, and writing the target output result back to a memory;

and calling the reverse fusion interface of each target operator through the master core, and performing reverse calculation on each target operator according to the target output result by adopting an accelerating component matched with each target operator through a plurality of slave cores to obtain a second output result corresponding to each target operator.

According to another aspect of the present invention, there is provided an operator fusion processing apparatus, for use in a heterogeneous many-core accelerated processor, the apparatus comprising:

the forward interface calling module is used for determining a plurality of target operators included in the target network after receiving an operation request of the target network, and calling a forward fusion interface of each target operator through the main core;

the forward calculation module is used for performing forward calculation on each target operator by adopting an acceleration component matched with each target operator through a plurality of slave cores to obtain a first output result corresponding to each target operator;

a result writing module, configured to determine a target output result in each first output result according to a structure of each target operator in a target network, and write the target output result back to the memory;

and the reverse calculation module is used for calling a reverse fusion interface of each target operator through the master core, and performing reverse calculation on each target operator according to the target output result by adopting the accelerating components matched with each target operator through the plurality of slave cores to obtain a second output result corresponding to each target operator.

According to another aspect of the present invention, there is provided an electronic apparatus including:

one or more heterogeneous many-core acceleration processors;

storage means for storing one or more programs;

when the one or more programs are executed by one or more heterogeneous many-core acceleration processors, the one or more processors execute the programs to realize the operator fusion processing method in any embodiment of the invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing a computer program, which when executed by a processor implements the operator fusion processing method according to any one of the embodiments of the present invention.

According to the technical scheme provided by the embodiment of the invention, after an operation request of a target network is received, a plurality of target operators included in the target network are determined, a forward fusion interface of each target operator is called through a master core, a forward calculation is carried out on each target operator through a plurality of slave cores and an accelerating component matched with each target operator, a first output result corresponding to each target operator is obtained, a target output result is determined in each first output result according to the composition structure of each target operator in the target network and is written back to a memory, a reverse fusion interface of each target operator is called through the master core, and the accelerating components matched with each target operator are adopted through the plurality of slave cores, each target operator is reversely calculated according to the target output result, and a second output result corresponding to each target operator is obtained.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of an operator fusion processing method according to an embodiment of the present invention.

Fig. 2 is a flowchart of another operator fusion processing method according to the second embodiment of the present invention.

Fig. 3 is a flowchart of another operator fusion processing method according to the third embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an operator fusion processing apparatus according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an electronic device implementing the operator fusion processing method according to the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1 is a flowchart of an operator fusion processing method according to an embodiment of the present invention, where the embodiment is applicable to a case of performing fusion processing on an operator in a deep learning network, and the method may be executed by an operator fusion processing apparatus, where the operator fusion processing apparatus may be implemented in a form of hardware and/or software, and the operator fusion processing apparatus may be configured in a heterogeneous many-core acceleration processor. As shown in fig. 1, the method includes:

step 110, after receiving an operation request of the target network, determining a plurality of target operators included in the target network, and calling a forward fusion interface of each target operator through the main core.

In this embodiment, the target network may be a neural network in deep learning, and after receiving an operation request for the target network, the processor may determine a plurality of target operators according to the operation request, and call a forward fusion interface of each target operator through a main core in the processor.

In a specific embodiment, the heterogeneous many-core accelerated processor may include a master core and a plurality of slave cores, and the master core and the plurality of slave cores may share a memory in the processor. Each slave core has an independent cache space, and an acceleration component that handles matrix multiplication or other operations exclusively.

And step 120, performing forward calculation on each target operator by using a plurality of slave cores and an acceleration component matched with each target operator to obtain a first output result corresponding to each target operator.

In this embodiment, after the forward fusion interface of each target operator is called by the master core, multiple slave cores may be started, and each slave core may adopt an acceleration component matched with the operation type of the target operator to perform forward calculation on each target operator, so as to obtain a first output result corresponding to each target operator.

Step 130, determining a target output result in each first output result according to the composition structure of each target operator in the target network, and writing the target output result back to the memory.

In this embodiment, optionally, after the first output result corresponding to each target operator is obtained, the intermediate processing result may be screened from the multiple first output results according to the composition structure of each target operator in the target network, and the output results except the intermediate processing result are used as the target output result, and finally the target output result is written back to the memory.

The method has the advantages that the intermediate processing result can be prevented from being frequently written into the memory, so that the occupancy rate of the memory access bandwidth is reduced, the influence of the bottleneck of the memory on the processing of the operator is reduced, and the processing efficiency of the operator in the target network is improved.

In this embodiment, after the target output result is written back to the memory, the plurality of slave cores may be closed to end the forward calculation process of the target operator.

And step 140, calling the reverse fusion interface of each target operator through the master core, and performing reverse calculation on each target operator according to the target output result by adopting an acceleration component matched with each target operator through the plurality of slave cores to obtain a second output result corresponding to each target operator.

In this embodiment, after the forward calculation process of the target operators is completed, the reverse fusion interface of each target operator may be called by the master core, the plurality of slave cores are started, and then, each target operator is reversely calculated by the plurality of slave cores by using an acceleration component matched with the operation type of the target operator, so as to obtain a second output result corresponding to each target operator.

The method has the advantages that the starting expense of the slave cores can be reduced by fusing a plurality of target operators through a plurality of slave cores; secondly, the accelerating components matched with the target operators are adopted by the plurality of slave cores to perform forward calculation or backward calculation on the target operators, the accelerating components in the slave cores can be fully utilized, and certain accelerating components are prevented from being in an idle state, so that the utilization rate of hardware resources in the processor can be improved.

According to the technical scheme provided by the embodiment of the invention, after an operation request of a target network is received, a plurality of target operators included in the target network are determined, a forward fusion interface of each target operator is called through a master core, an acceleration component matched with each target operator is adopted through a plurality of slave cores, forward calculation is carried out on each target operator to obtain a first output result corresponding to each target operator, a target output result is determined in each first output result according to the composition structure of each target operator in the target network, the target output result is written back to a memory, a reverse fusion interface of each target operator is called through the master core, and the acceleration components matched with each target operator are adopted through the plurality of slave cores, reverse calculation is carried out on each target operator according to the target output result to obtain a second output result corresponding to each target operator, so that the occupation rate of access bandwidth can be reduced, the processing efficiency of the operators in the target network is improved, and the utilization rate of hardware resources in a processor is increased.

Fig. 2 is a flowchart of an operator fusion processing method according to a second embodiment of the present invention, which is a further refinement of the foregoing embodiment. As shown in fig. 2, the method includes:

step 210, after receiving an operation request of the target network, determining a plurality of target operators included in the target network, and calling a forward fusion interface of each target operator through the master core.

In a specific embodiment, assuming that the target network includes an operator a, an operator B, and an operator C, a forward fusion interface of the operator a, the operator B, and the operator C may be called through the master core, and multiple slave cores may be started.

Step 220, obtaining the first input data corresponding to each target operator from the memory through a plurality of slave cores, and storing the first input data corresponding to each target operator into a corresponding cache.

In this step, specifically, taking operator a, operator B, and operator C as an example, the data load may be initiated from the core to the memory, and forward input data (i.e., the first input data) of operator a, operator B, and operator C is obtained to the corresponding cache.

And step 230, acquiring first input data corresponding to each target operator from the corresponding cache through the plurality of slave cores, and performing forward calculation on each target operator by adopting an acceleration component matched with each target operator according to the first input data to obtain a first output result corresponding to each target operator.

In an implementation manner of the embodiment of the present invention, performing forward calculation on each target operator by using an acceleration component matched with each target operator according to the first input data to obtain a first output result corresponding to each target operator, includes:

if the operation type of the target operator is non-matrix multiplication operation, performing forward calculation on the target operator according to the first input data by adopting a non-matrix multiplication accelerating component through a plurality of secondary cores to obtain a first output result corresponding to the target operator;

and if the operation type of the target operator is matrix multiplication operation, performing forward calculation on the target operator according to the first input data by adopting a matrix multiplication accelerating component through a plurality of secondary cores to obtain a first output result corresponding to the target operator.

In a specific embodiment, taking operator a, operator B and operator C as an example, assuming that the operation type of operator a and operator C is a non-matrix multiplication operation, and the operation type of operator B is a matrix multiplication operation, forward calculation of operator B may be completed by using a matrix multiplication accelerating component through a plurality of slave cores, and forward calculation of operator a and operator C may be completed by using a non-matrix multiplication accelerating component.

Compared with the mode that a plurality of slave cores simultaneously process only a single operator in the prior art, the method has the advantages that a certain acceleration component in the slave cores can be prevented from being in an idle state, the plurality of operators are subjected to fusion processing, the slave cores can fully play the functions of the plurality of acceleration components, and accordingly the utilization rate of hardware resources in the processor is improved.

Step 240, determining a target output result in each first output result according to the composition structure of each target operator in the target network, and writing the target output result back to the memory.

And step 250, calling the reverse fusion interface of each target operator through the main core.

Step 260, obtaining the target output result and the second input data corresponding to each target operator from the memory through the plurality of slave cores, and storing the target output result and the second input data into the corresponding cache.

In this step, specifically, taking operator a, operator B, and operator C as an example, data loading may be initiated from the core to the memory, and the reverse input data (i.e., the second input data) of operator a, operator B, and operator C is obtained into the corresponding cache.

And 270, acquiring a target output result and second input data from the cache through the plurality of slave cores, and performing reverse calculation on each target operator according to the target output result and the second input data by adopting an accelerating component matched with each target operator to obtain a second output result corresponding to each target operator.

In this step, similarly, taking the operator a, the operator B and the operator C as an example, assuming that the operation type of the operator a and the operator C is non-matrix multiplication operation and the operation type of the operator B is matrix multiplication operation, the inverse calculation of the operator B can be completed by using the matrix multiplication accelerating component and the inverse calculation of the operator a and the operator C can be completed by using the non-matrix multiplication accelerating component through a plurality of slave kernels.

The technical scheme provided by the embodiment of the invention includes that after an operation request of a target network is received, a plurality of target operators included in the target network are determined, a forward fusion interface of each target operator is called through a main core, first input data of each target operator is obtained from a memory through a plurality of slave cores, the first input data are stored in a cache, the first input data are obtained from the cache through the plurality of slave cores, an acceleration component matched with each target operator is adopted according to the first input data, forward calculation is carried out on each target operator to obtain a first output result of each target operator, a target output result is determined in each first output result according to the composition structure of each target operator in the target network and is written back to the memory, a reverse fusion interface of each target operator is called through the main core, a target output result is obtained from the memory through the plurality of slave cores, second input data corresponding to each target operator are obtained through the plurality of slave cores, the target output results and the second input data are stored in the cache, the target output results and the second input results are obtained through the plurality of slave cores, the cache cores, the target output results and the target operators are matched with the target operators, the target hardware utilization rate of each target operator is increased, and the target output efficiency of the target operators is calculated, and the target hardware utilization rate of each target operator is increased.

Fig. 3 is a flowchart of an operator fusion processing method according to a third embodiment of the present invention, which is a further refinement of the foregoing embodiment. As shown in fig. 3, the method includes:

step 310, after receiving an operation request of the target network, determining a plurality of target operators included in the target network, and calling a forward fusion interface of each target operator through the main core.

And step 320, performing forward calculation on each target operator by using a plurality of slave cores and an acceleration component matched with each target operator to obtain a first output result corresponding to each target operator.

In this embodiment, optionally, if the operation types of the multiple target operators include non-matrix multiplication and matrix multiplication, performing forward calculation on each target operator by using the acceleration component matched with each target operator through the multiple slave cores and using the acceleration component matched with each target operator to obtain a first output result corresponding to each target operator, including: and performing forward calculation on the matched target operators by adopting a matrix multiplication accelerating component and then adopting a non-matrix multiplication accelerating component through a plurality of secondary cores to obtain a first output result corresponding to each target operator.

In this embodiment, each of the caches of the slave cores is provided with a double buffer space corresponding to the input data and the output result, and when the slave cores process each target operator, the slave cores may initiate a matrix multiplication instruction first and then initiate a non-matrix multiplication instruction.

This has the advantage that, in performing the matrix multiplication, a non-matrix multiplication is performed simultaneously, therefore, the time consumed by operation of non-matrix multiplication can be hidden, and the processing time of operators in the target network is saved.

And step 330, determining a target output result in each first output result according to the composition structure of each target operator in the target network, the reverse calculation association degree of each first output result and the calculation amount of each first output result.

In this embodiment, the reverse calculation relevance is used to characterize the relevance degree of the first output result to each target operator in the reverse calculation process. Specifically, if a certain first output result can be used as input data of a target operator in the reverse calculation process, the reverse calculation relevance of the first output result can be considered to be higher.

In this step, optionally, after the first output result corresponding to each target operator is obtained, the first output result with a higher reverse calculation relevance and a higher calculation amount may be used as the target output result.

In a specific embodiment, taking an operator a, an operator B, and an operator C in a target network as an example, after first output results of the operator a, the operator B, and the operator C are obtained, since a reverse calculation process of the operator C needs to use the first output result of the operator B as input data, and an operation process of the first output result of the operator B is complex and has a high calculation amount, the first output results of the operator B and the operator C can be written back to a memory as target output results.

Meanwhile, the first output result of the operator a is needed to be used as input data in the reverse calculation process of the operator B, the operation process of the first output result of the operator a is simple, and the calculation amount is low, so that the first output result of the operator a is not written back to the memory.

Step 340, writing the target output result back to the memory.

And 350, calling a reverse fusion interface of each target operator through the master core, and performing reverse calculation on each target operator according to the target output result by adopting an acceleration component matched with each target operator through the plurality of slave cores to obtain a second output result corresponding to each target operator.

In this embodiment, optionally, if the operation types of the multiple target operators include non-matrix multiplication and matrix multiplication, performing inverse computation on each target operator according to the target output result by using the accelerating components matched with each target operator through the multiple slave cores, and obtaining a second output result corresponding to each target operator, including: and performing reverse calculation on the matched target operators by adopting a matrix multiplication accelerating component according to the target output results through a plurality of secondary cores, and then performing reverse calculation on the matched target operators by adopting a non-matrix multiplication accelerating component to obtain second output results corresponding to the target operators.

The advantage of the arrangement is that the non-matrix multiplication operation can be executed simultaneously in the process of executing the matrix multiplication operation, so that the time consumption of the non-matrix multiplication operation can be hidden, and the processing time of operators in the target network can be saved.

According to the technical scheme provided by the embodiment of the invention, after an operation request of a target network is received, a plurality of target operators included in the target network are determined, a forward fusion interface of each target operator is called through a main core, forward calculation is carried out on each target operator through a plurality of auxiliary cores by adopting matched accelerating components to obtain a first output result of each target operator, a target output result is determined in each first output result according to the composition structure of each target operator in the target network, the backward calculation relevance of each first output result and the calculation amount of each first output result, the target output result is written back to a memory, the backward fusion interface of each target operator is called through the main core, and each target operator is reversely calculated according to the target output result by adopting the plurality of auxiliary cores by adopting the matched accelerating components, so that a second output result corresponding to each target operator is obtained.

In order to better introduce the technical solution provided by the embodiment of the present invention, it is assumed in the embodiment of the present invention that the target network includes the following operators: batch Normalization (BN), activation function Relu, and Convolution function Convolution. When the heterogeneous many-core acceleration processor performs fusion processing on the operators, reference may be made to the following embodiments:

s1, firstly processing a forward computing process of a network, calling forward fusion interfaces of operators BN, relu and Convolition through a master core, wherein the operators BN can be decomposed into two parts, namely BNStatistics and BNFinalize, according to the characteristics of an algorithm of the operators BN, and then starting a plurality of slave cores; the BNStatistics can be a forward statistical calculation layer of the BN operator, and the BNFinailize can be a forward final output calculation layer in the BN operator; the BNFinailize is output according to the statistical calculation result of BNStatistics to obtain a final forward output result of the BN operator;

and S2, initiating data loading from the cores to the memory, and acquiring input data of operators BNFinailze, relu, convolition and BNStatistics into the cache.

Wherein, the input data of BNFinailize comprises xBN and the correlation coefficient for calculating BNFinailize, which can be loaded by a data transmission engine; the input data of Relu is the output result of BNFinailze, and loading is not needed; the input data of the Convolation comprises the output of Relu and weight, and the weight is loaded by a data transmission engine; the input data of BNStatistics comprises the output result of Convolition and the correlation coefficient for calculating BNStatistics, and the correlation coefficient is loaded by a data transmission engine; wherein xBN can be input data of forward calculation performed by an operator BN;

s3, each slave core completes forward calculation of Convolume by using a matrix multiplication accelerating component, and completes forward calculation of BNFinalize, relu and BNStatistics by using a non-matrix multiplication accelerating component. Wherein, the output results of BNFinailize and Relu are directly calculated on the cache along the network without writing back to the memory; the output results of the Convolition and BNStatistics can be written back to the memory and used as the input of the lower-layer network;

s4, circulating the steps S2 and S3 until all forward calculations of BNFinalize, relu, convolition and BNStatistics are completed;

s5, closing the slave core to complete network forward calculation;

s6, processing a reverse calculation process of the network, calling a reverse fusion interface of operators BN, relu and Convolition through a master core, wherein the reverse calculation process of the operator BN can be decomposed into two parts of BNBSstatins and BNBFinitialize according to the characteristics of an algorithm of the reverse calculation process, and then starting a slave core; the BNBStatistics can be a reverse statistical calculation layer of the BN operator, and the BNBFinhibitor can be a reverse final output calculation layer of the BN operator; the BNBFinalize outputs the result according to the statistical calculation result of BNBSstatistics to obtain the final BN operator reverse output gradient result;

s7, a plurality of slave cores initiate data loading to the memory, and input data of operators BNBFinitialize, convolationBD, reluB and BNBSstatistics are obtained to the cache.

Wherein, the input data of BNBFinalize comprises xBN, dyBN and the correlation coefficient for calculating BNBFinalize, and is loaded by the data transmission engine; the input data of ConvolitionBD comprises the output result of BNBFinhibit and weight, and the weight is loaded by a data transmission engine; the input data of ReluB comprises the output result of ConvolitionBD and xRelu, wherein xRelu is not written into a memory in the forward calculation process and needs to be recalculated; input data of BNBSstatins comprise output results of xBN and ReluB and a correlation coefficient for calculating BNBFINalize, wherein the xBN and the correlation coefficient are loaded by a data transmission engine; dyBN is input data for performing reverse calculation by an operator BN; convolationBD is a reverse update error operator corresponding to ConvolationE; relu B is a reverse update error operator corresponding to Relu; the xRelu is input data for forward calculation of Relu;

s8, each slave core uses a matrix multiplication acceleration component to complete the reverse calculation of ConvolitionBD, and uses a non-matrix multiplication acceleration component to complete the related calculation of BNBFinaliz, reluB, BNBSstatins and xRelu.

Wherein, the output result of the ConvolitionBD is directly calculated on the cache downwards along the network without writing back; the output results of ReluB and BNBSstatistics are written back to the memory and used as the input of backward propagation updating data to the upper network transmission; the output result of BNBFINalize is written back to the memory and is used as the input data of the back propagation updating weight; carrying out BNFinailize operation by using the input xBN of BNBSstatics and the BNFinailize correlation coefficient, and finishing recalculation of xRelu;

s9, circulating the steps S7 and S8 until all reverse calculations of BNBFinalize, convolitionBD, reluB and BNBStitics are completed;

and S10, closing the slave core, and completing the calculation of the network reverse update data.

Further, in S2 and S3, double buffer spaces are provided for the inputs and outputs of the operators BNFinalize, relu, constraint, and bnstaticisics, respectively, in the cache of the slave core. In the process of S4 loop calculation, a matrix multiplication related instruction for calculating the constraint can be submitted first, and then other non-matrix multiplication related instructions are submitted, so that hardware resources can be fully utilized, and the mutual coverage of calculation time is achieved.

Similarly, in S7 and S8 above, on the cache of the slave core, double buffer spaces are set for the inputs and outputs of the operators bnbfinhibitor, convolutionBD, reluB, and BNBStatistics, respectively. In the process of S9 loop calculation, a matrix multiplication related instruction for calculating the ConvolutionBD may be submitted first, and then other non-matrix multiplication related instructions may be submitted, so that hardware resources may be fully utilized to achieve mutual coverage of calculation time.

In this embodiment, in S1 and S6, the forward and reverse calculation processes are decomposed into two parts according to the algorithm characteristics of the BN operator, so that the overall calculation cycle can be conveniently implemented, mutual masking between the matrix multiplication and non-matrix multiplication processes can be better completed, and unnecessary data dependency is decoupled.

The method provided by the embodiment of the invention can reduce the occupancy rate of the memory access bandwidth, improve the processing efficiency of an operator in a target network and improve the utilization rate of hardware resources in a processor.

Fig. 4 is a schematic structural diagram of an operator fusion processing apparatus according to a fourth embodiment of the present invention, where the apparatus is applied to a heterogeneous many-core accelerated processor, as shown in fig. 4, the apparatus includes: a forward interface calling module 410, a forward calculation module 420, a result writing module 430, and a reverse calculation module 440.

The forward interface calling module 410 is configured to, after receiving an operation request of a target network, determine a plurality of target operators included in the target network, and call a forward fusion interface of each target operator through a master core;

the forward calculation module 420 is configured to perform forward calculation on each target operator through a plurality of slave cores by using an acceleration component matched with each target operator to obtain a first output result corresponding to each target operator;

a result writing module 430, configured to determine a target output result from each first output result according to a structure of each target operator in the target network, and write the target output result back to the memory;

and the reverse calculation module 440 is configured to invoke a reverse fusion interface of each target operator through the master core, and perform reverse calculation on each target operator according to the target output result by using the acceleration component matched with each target operator through the plurality of slave cores, so as to obtain a second output result corresponding to each target operator.

On the basis of the above embodiment, the result writing module 430 includes:

the target result determining unit is used for determining a target output result in each first output result according to the composition structure of each target operator in the target network, the reverse calculation relevance of each first output result and the calculation amount of each first output result;

and the reverse calculation relevance is used for representing the relevance degree of the first output result and each target operator in the reverse calculation process.

The forward calculation module 420 includes:

the first data cache unit is used for acquiring first input data corresponding to each target operator from a memory through a plurality of slave cores and storing the first input data corresponding to each target operator into a corresponding cache;

the first data acquisition unit is used for acquiring first input data corresponding to each target operator from a corresponding cache through a plurality of slave cores, and performing forward calculation on each target operator by adopting an acceleration component matched with each target operator according to the first input data to obtain a first output result corresponding to each target operator;

the non-matrix multiplication operation unit is used for performing forward calculation on the target operator according to the first input data by adopting a non-matrix multiplication acceleration component through a plurality of secondary cores if the operation type of the target operator is non-matrix multiplication operation, so as to obtain a first output result corresponding to the target operator;

the matrix multiplication operation unit is used for performing forward calculation on the target operator according to the first input data by adopting a matrix multiplication accelerating component through a plurality of secondary cores if the operation type of the target operator is matrix multiplication operation to obtain a first output result corresponding to the target operator;

and the forward calculation parallel execution unit is used for performing forward calculation on the matched target operators by adopting a matrix multiplication accelerating component through a plurality of slave cores and then performing forward calculation on the matched target operators by adopting a non-matrix multiplication accelerating component to obtain first output results corresponding to the target operators if the operation types of the target operators comprise non-matrix multiplication operation and matrix multiplication operation.

The inverse calculation module 440 includes:

the second data cache unit is used for acquiring a target output result and second input data corresponding to each target operator from the memory through a plurality of slave cores and storing the target output result and the second input data into corresponding caches;

the second data acquisition unit is used for acquiring a target output result and second input data from the cache through the plurality of slave cores, and performing reverse calculation on each target operator according to the target output result and the second input data by adopting an acceleration component matched with each target operator to obtain a second output result corresponding to each target operator;

and the reverse computation parallel execution unit is used for performing reverse computation on the matched target operators by adopting the matrix multiplication accelerating component according to the target output results through a plurality of slave cores if the operation types of the plurality of target operators comprise non-matrix multiplication operation and matrix multiplication operation, and then performing reverse computation on the matched target operators by adopting the non-matrix multiplication accelerating component to obtain second output results corresponding to the target operators.

The device can execute the methods provided by all the embodiments of the invention, and has corresponding functional modules and beneficial effects for executing the methods. For technical details which are not described in detail in the embodiments of the present invention, reference may be made to the methods provided in all the aforementioned embodiments of the present invention.

FIG. 5 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one heterogeneous many-core acceleration processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to pass through a computer network such as the internet networks and/or various telecommunications networks exchange information/data with other devices.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the operator fusion processing method.

In some embodiments, the operator fusion processing method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the operator fusion processing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the operator fusion processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An operator fusion processing method is applied to a heterogeneous many-core accelerated processor, and comprises the following steps:

if the operation types of the multiple target operators comprise non-matrix multiplication operation and matrix multiplication operation, performing forward calculation on the matched target operators by adopting a matrix multiplication accelerating component through multiple slave cores, and simultaneously performing forward calculation on the matched target operators by adopting the non-matrix multiplication accelerating component to obtain first output results corresponding to the target operators;

determining an intermediate processing result in each first output result according to the composition structure of each target operator in the target network, the reverse calculation association degree of each first output result and the calculation amount of each first output result; writing back other output results except the intermediate processing result as target output results into the memory; the reverse calculation relevance is used for representing the relevance degree of the first output result and each target operator in the reverse calculation process;

and calling the reverse fusion interface of each target operator through the master core, and performing reverse calculation on the matched target operator by adopting a matrix multiplication accelerating component according to a target output result through a plurality of slave cores, and performing reverse calculation on the matched target operator by adopting a non-matrix multiplication accelerating component to obtain a second output result corresponding to each target operator.

2. The operator fusion processing method according to claim 1, wherein the forward calculation of the matched target operators is performed by a plurality of slave kernels by using a matrix multiplication accelerating component, and the forward calculation of the matched target operators is performed by using a non-matrix multiplication accelerating component, so as to obtain the first output result corresponding to each target operator, comprising:

acquiring first input data corresponding to each target operator from a memory through a plurality of slave cores, and storing the first input data corresponding to each target operator into a corresponding cache;

and acquiring first input data corresponding to each target operator from a corresponding cache through a plurality of slave cores, and simultaneously and respectively adopting an accelerating component matched with each target operator to perform forward calculation on each target operator according to the first input data to obtain a first output result corresponding to each target operator.

3. The operator fusion processing method according to claim 2, wherein the step of performing forward computation on each target operator by using an acceleration component matched with each target operator according to the first input data to obtain a first output result corresponding to each target operator comprises:

4. The operator fusion processing method according to claim 1, wherein the step of performing reverse computation on the matched target operators by using a matrix multiplication accelerating component according to the target output results through a plurality of slave cores, and performing reverse computation on the matched target operators by using a non-matrix multiplication accelerating component to obtain second output results corresponding to the target operators comprises:

acquiring a target output result and second input data corresponding to each target operator from a memory through a plurality of slave cores, and storing the target output result and the second input data into corresponding caches;

and acquiring a target output result and second input data from the cache through the plurality of slave cores, and simultaneously and respectively adopting an accelerating component matched with each target operator to perform reverse calculation on each target operator according to the target output result and the second input data to obtain a second output result corresponding to each target operator.

5. An operator fusion processing device, applied to a heterogeneous many-core accelerated processor, comprising:

the forward calculation module is used for performing forward calculation on the matched target operators by adopting a matrix multiplication accelerating component through a plurality of slave cores and adopting a matrix multiplication accelerating component to perform forward calculation on the matched target operators at the same time so as to obtain first output results corresponding to the target operators if the operation types of the target operators comprise non-matrix multiplication operation and matrix multiplication operation;

a result writing module, configured to determine an intermediate processing result in each first output result according to a composition structure of each target operator in a target network, a reverse calculation association degree of each first output result, and a calculation amount of each first output result; writing back other output results except the intermediate processing result as target output results into the memory; the reverse calculation relevance is used for representing the relevance degree of the first output result and each target operator in the reverse calculation process;

and the reverse calculation module is used for calling a reverse fusion interface of each target operator through the master core, performing reverse calculation on the matched target operator through the plurality of slave cores by adopting the matrix multiplication accelerating component according to the target output result, and performing reverse calculation on the matched target operator by adopting the non-matrix multiplication accelerating component to obtain a second output result corresponding to each target operator.

6. An electronic device, the electronic device comprising:

one or more heterogeneous many-core acceleration processors;

storage means for storing one or more programs;

the operator fusion processing method of any of claims 1-4 when the one or more programs are executed by one or more heterogeneous many-core acceleration processors such that the one or more heterogeneous many-core acceleration processors implement the one or more programs.

7. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the operator fusion processing method according to any one of claims 1 to 4.