CN114692861A

CN114692861A - Computation graph updating method, computation graph processing method and related equipment

Info

Publication number: CN114692861A
Application number: CN202011606177.8A
Authority: CN
Inventors: 姚棋中; 郑淼; 何占盈
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-07-01

Abstract

The application discloses a calculation graph updating method, a calculation graph processing method and related equipment, which are used for reducing training cost. The method provided by the embodiment of the application is applied to a neural network model training scene, the processing equipment can update the initial calculation graph corresponding to the target neural network model according to the optimization operation set comprising various optimization operations, and the training equipment retrains the updated initial calculation graph once to obtain the target neural network model.

Description

Computation graph updating method, computation graph processing method and related equipment

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a calculation graph updating method, a calculation graph processing method and related equipment.

Background

The key technology of artificial intelligence at present is a neural network, which is a complex network system formed by connecting a large number of simple processing units (called neurons) widely through simulating the connection of human brain nerve cells. The computation process between two adjacent layers of neurons can be abstracted as performing two computation steps on input data, where a "computation step" is referred to as an Operation (OP) in a neural network. In practical applications, to facilitate analyzing the structure, calculation characteristics, and data flow direction of the neural network, all OPs of the neural network are usually put together to form a calculation graph (computing graph). With the improvement of the performance of the deep neural network, the parameters and the calculated amount of the model are increased more and more, and the calculation speed of the model is severely restricted. For terminal equipment with high real-time requirements, the neural network model with high resource requirements greatly increases the deployment difficulty. In addition, trained neural networks often consume significant resources, which can be wasted if not utilized.

In the prior art, a computation graph is updated, for example, an OP related to quantization/sparseness is inserted into the computation graph, and the OP related to quantization/sparseness can reduce the calculation amount in the retraining process of a neural network in the process of participating in computation of a weight of the computation graph, so as to achieve the acceleration effect of the neural network, and then the weight of the retrained computation graph is subjected to quantization/sparseness OP processing and then stored.

In the prior art, the sparse processing or the quantitative processing respectively needs retraining, when the neural network needs to obtain the sparse acceleration effect and the quantitative acceleration effect at the same time, retraining is needed twice, and the training cost is high.

Disclosure of Invention

The embodiment of the application provides a computation graph updating method, a computation graph processing method and related equipment, which are used for reducing training cost.

A first aspect of the present embodiment provides a computation graph updating method, including: the processing equipment receives a first calculation graph and an optimization operation set, wherein the first calculation graph is an initial calculation graph corresponding to the target neural network model, the first calculation graph comprises a plurality of initial operations, and the optimization operation set comprises a plurality of optimization operations; the processing equipment determines an operation to be accelerated from the first calculation graph according to the multiple optimization operations, and updates the first calculation graph according to the operation to be accelerated and the multiple optimization operations to obtain a second calculation graph; and the processing device sends a second calculation graph to the training device, and the second calculation graph is used for the training device to retrain the target neural network model.

In the first aspect, the first computation graph is a graph structure for analyzing a neural network structure, computation characteristics, and a data flow direction of the neural network model, and the first computation graph is an initial computation graph of a target neural network model to be finally obtained. The computational graph includes a plurality of initial operations corresponding to computational steps in the neural network model. The optimization operation set comprises a plurality of optimization operations which can be Fixed-point quantization (Fixed-point quantization), sparse, Winograd convolution and other future optimization methods corresponding quantization operations, sparse operations, Winograd operations and other future optimization methods corresponding other operations. The processing device may receive the first calculation graph and the optimization operation set, determine an operation to be accelerated in the first calculation graph according to a sequence of a plurality of optimization operations in the optimization operation set, update the first calculation graph to obtain a second calculation graph, and send the second calculation graph to the training device, so that a multi-acceleration effect may be obtained only by retraining once, and training cost may be reduced.

In one possible design, the processing device determines an operation to be accelerated from the first computation graph according to a plurality of optimization operations, and updates the first computation graph according to the operation to be accelerated and the plurality of optimization operations, including: the processing equipment determines a first initial operation from the plurality of initial operations according to the corresponding relation between the first optimization operation and the first initial operation, wherein the first initial operation comprises the operation to be accelerated; the processing device inserts a first optimization operation into the first computational graph, wherein the inserted first optimization operation is an operation set before or after the first initial operation, and the operation set comprises one or more operations; the processing equipment determines a second initial operation from a plurality of initial operations of the first calculation graph inserted into the first optimization operation according to the corresponding relation between the second optimization operation and the second initial operation, wherein the second initial operation comprises the operation to be accelerated, the second optimization operation is adjacent to the first optimization operation in the optimization operation set, and the arrangement sequence of the second optimization operation is positioned after the first optimization operation; the processing device inserts a second optimization operation into the first computational graph containing the first optimization operation, wherein the inserted second optimization operation is one operation set before or one operation set after the second initial operation.

In this possible design, the processing device may update the first computation graph in the order of the optimization operations in the set of optimization operations, and may, for example, determine a first initial operation in the first computation graph according to a correlation between a first optimization operation ranked as 1 in the set of optimization operations and a first initial operation in the operations to be accelerated, and then insert the first optimization operation before, after, or at both ends of the first initial operation. Similarly, a second initial operation is determined from the first computation graph inserted into the first optimization operation according to a second optimization operation ordered as 2 in the set of optimization operations, and the second initial operation is inserted before, after, or at both ends of the second initial operation. In this embodiment, the first optimization operation may also include the second initial operation, that is, the second optimization operation may also accelerate the first optimization operation, so as to improve the calculation rate.

In one possible design, the computation graph updating method further includes: the processing equipment determines a third initial operation from a plurality of initial operations of the first calculation graph inserted into the second optimization operation according to the corresponding relation between the third optimization operation and the third initial operation, wherein the third initial operation comprises the operation to be accelerated; the processing device inserts a third optimization operation into the first computation graph containing the first optimization operation and the second optimization operation, wherein the inserted third optimization operation is one operation set before or one operation set after the third initial operation.

In this possible design, after the processing device inserts the first optimization operation and the second optimization operation into the first computation graph, a corresponding third initial operation may be determined from the first computation graph including the first optimization operation and the second optimization operation according to a third optimization operation ranked as 3 in the optimization operation set, and the third optimization operation may be inserted before, after, or at both ends of the third initial operation. In this embodiment, the first optimization operation and the second optimization operation may also include the third initial operation, that is, the third optimization operation may also accelerate the first optimization operation and the second optimization operation, so as to improve the calculation rate.

In one possible design, the first optimization operation comprises a vorogold winograd operation, and the first initial operation comprises a matrix multiplication operation; the second optimization operation comprises a sparse operation, and the second initial operation comprises a matrix multiplication operation; the third optimization operation includes a quantization operation and the third initial operation includes a matrix multiplication operation and a point-by-point addition operation.

In one possible design, the sort and arrangement order of the optimization operations in the set of optimization operations are preset.

In a possible design, after the processing device sends the second computation graph to the training device, the method further includes: and the processing equipment sends the optimization operation corresponding to the operation to be accelerated to the training equipment.

In this possible design, the operation to be accelerated may be an operation for processing the weight in the computation graph, and the processing device may send the corresponding optimization operation to the training device, so that the training device performs the corresponding optimization operation on the weight of the retrained target neural network and stores the weight, thereby reducing memory usage.

A second aspect of the present application provides a computation graph processing method, including: the training device receives a second calculation graph from the processing device, the second calculation graph is a calculation graph obtained after the first calculation graph is updated based on an optimization operation set, the optimization operation set comprises multiple optimization operations, the multiple optimization operations are used for determining an operation to be accelerated from multiple initial operations of the first calculation graph and updating the first calculation graph, and the first calculation graph is an initial calculation graph corresponding to the target neural network model; and the training equipment retrains the second computational graph to obtain a target neural network model.

In this second aspect, after receiving the second computation graph, the training device may obtain training data and an initial weight matrix from the storage device through the AXI bus to perform computation, so as to perform retraining. The second computation graph is obtained by updating the first computation graph based on the optimization operation set, and compared with the computation amount of retraining respectively by retraining.

In a possible design, after the training device retrains the second computation graph, the method further includes: the training equipment receives optimization operation corresponding to operation to be accelerated from the processing equipment; the training equipment executes optimization operation on the weight of the retrained target neural network model to generate an optimized weight; and the training equipment stores the optimized weight in the storage equipment.

In this possible design, after the training device retrains the second computation graph, the obtained weight of the target neural network model may be saved, and before the saving, the weight may also be subjected to an optimization operation corresponding to an operation to be accelerated with respect to the processing weight from the processing device, so as to reduce the memory occupation amount of the weight.

A third aspect of the embodiments of the present application provides a processing device, where the processing device has a function of implementing the method according to the first aspect or any one of the possible implementation manners of the first aspect. The function can be realized by hardware, and can also be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, such as: a first processing unit, a second processing unit and a third processing unit, which may also be implemented by one or two processing units.

A fourth aspect of the embodiments of the present application provides a training apparatus having a function of implementing a method according to any one of the second aspect and the second possible implementation manner. The function can be realized by hardware, and can also be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, such as: the device comprises an acquisition unit, a first processing unit and a second processing unit, wherein the two processing units can also be realized by one processing unit.

A fifth aspect of the embodiments of the present application provides a computer device, which includes at least one processor, a memory, an input/output (I/O) interface, and computer executable instructions stored in the memory and executable on the processor, wherein when the computer executable instructions are executed by the processor, the processor executes the method according to the first aspect or any one of the possible implementation manners of the first aspect.

A sixth aspect of the embodiments of the present application provides a computer device, which includes at least one processor, a memory, an input/output (I/O) interface, and computer executable instructions stored in the memory and executable on the processor, wherein when the computer executable instructions are executed by the processor, the processor executes the method according to any one of the possible implementation manners of the second aspect or the second aspect.

A seventh aspect of embodiments of the present application provides a computer-readable storage medium storing one or more computer-executable instructions, which, when executed by a processor, cause the processor to perform the method according to the first aspect or any one of the possible implementation manners of the first aspect.

An eighth aspect of embodiments of the present application provides a computer-readable storage medium storing one or more computer-executable instructions, which, when executed by a processor, perform a method according to any one of the possible implementation manners of the second aspect or the second aspect.

A ninth aspect of the embodiments of the present application provides a computer program product storing one or more computer executable instructions, where the computer executable instructions are executed by a processor, and the processor executes the method according to the first aspect or any one of the possible implementation manners of the first aspect.

A tenth aspect of the embodiments of the present application provides a computer program product storing one or more computer executable instructions, where the computer executable instructions are executed by a processor, and the processor executes the method according to the second aspect or any one of the possible implementation manners of the second aspect.

An eleventh aspect of an embodiment of the present application provides a chip system, where the chip system includes at least one processor, and the at least one processor is configured to implement the functions in the first aspect or any one of the possible implementation manners of the first aspect. In one possible design, the system-on-chip may further include a memory, the memory storing program instructions and data necessary for the processing device. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

A twelfth aspect of the present embodiment provides a chip system, where the chip system includes at least one processor, and the at least one processor is configured to implement the functions in the second aspect or any one of the possible implementation manners of the second aspect. In one possible design, the system-on-chip may further include a memory, the memory being used to store program instructions and data necessary for the exercise device. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence framework provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a neural network training system according to an embodiment of the present disclosure;

fig. 3 is an embodiment of a computation graph updating method according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating a first computation graph update according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a sparse operation provided in an embodiment of the present application;

FIG. 6 is a diagram illustrating a quantization operation provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a processing apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a training apparatus provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will now be described with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely illustrative of some, but not all, embodiments of the present application. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Some terms used in the embodiments of the present application are exemplarily described below:

neural Networks (NN): the neural network is connected by simulating human brain nerve cells, and a large number of simple processing units (called neurons) are widely interconnected to form a complex network system. A simple neural network comprises three layers, such as an input layer, an output layer, and a hidden layer (also called an intermediate layer); each connection line corresponds to a weight (its value is called weight, parameter).

Operation (Operation, OP): a calculation step in a neural network. The OPs may be nested in the OP, such as a convolution OP including an OP converting an input four-dimensional tensor into a matrix, an OP converting a weight four-dimensional tensor into a matrix, an OP matrix-multiplying the input matrix and the weight matrix, and an OP converting the result of the matrix multiplication into a four-dimensional tensor.

Calculation graph (calculating graph): to facilitate the analysis of the neural network structure, computational characteristics, and data flow, all OPs of the neural network are usually put together to form a computational graph.

Tensor (Tensor): tensors are a general multidimensional data expression form, and are a generalization of vectors and matrices. The scalar is a 0-dimensional tensor, the vector is a 1-dimensional tensor, and the matrix is a 2-dimensional tensor.

Forward propagation (Forward propagation): in the first layer of the neural network, the result of the calculation of the input data and the weight is used as output and sent to the next layer or used as the output of the whole neural network.

Back propagation (Backward propagation): in the process of training the weight of the neural network, the output of the neural network is obtained through forward propagation; comparing the output with the real output to obtain a loss value; finally, the gradient of the loss with respect to the weight is calculated.

The weights of the neural network need to be adjusted to make the neural network function correctly, and this adjustment process is usually called "training". Training a neural network requires a large number of matching pairs (x) containing input data x, true output data y_i,y_i) To guide the direction of adjusting the weight, i is 1,2, …, N; the set of these input-output pairs is referred to as a training data set. When the input is x, the neural network is propagated forward, and the final output is

For evaluating neural network output

The loss (loss) of the difference from the true output y is

The steps for training the neural network are generally as follows:

1) preparing data: selecting an input-output pair (x, y) from the training dataset;

2) forward propagation: sending x into the neural network to obtain the output of the neural network

3) Calculating loss: using the real output value y and the neural network output

Calculating loss value

4) And (3) back propagation: calculating a loss value l for a neural networkPartial derivative of each layer weight matrix

5) Updating the weight: using original weight value W and partial derivative

And obtaining a new weight value.

6) And (3) circulation: and if the training is finished, outputting the trained weight value, otherwise, returning to the step 1.

It is through this chain rule that each OP in the neural network realizes the training of the weight. It is noted that if a certain OP needs to introduce other variables (such as quantization, sparseness to be introduced later) and requires these variables not to be updated (such as Winograd convolution), the gradient of the loss with respect to these variables not to be updated can be directly fixed to 0 in the reverse direction.

Residual Networks (Residual Networks): a convolutional neural network. In image identification, the identification accuracy is higher compared with the traditional convolutional neural network. In the design of the residual error network, a plurality of sub-modules with the same structure are arranged, and ResNet is generally used for connecting a number to indicate the number of main operations contained, for example, ResNet50 indicates that 50 operands exist.

First, the general workflow of the artificial intelligence system is described, please refer to fig. 1, fig. 1 shows a schematic structural diagram of an artificial intelligence subject framework, which is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

Decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sorting, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further general capabilities may be formed based on the results of the data processing, such as algorithms or a general system, for example, translation, analysis of text, computer vision processing, speech recognition, recognition of images, and so on.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in each field, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the application on the ground is realized, and the application field mainly comprises: intelligent terminal, intelligent manufacturing, intelligent transportation, intelligent house, intelligent medical treatment, intelligent security protection, autopilot, safe city etc..

The embodiment of the present application may be applied to a computational graph update design, and the present application may be applied to each subdivision field in the field of artificial intelligence, for example, the field of image processing, the field of computer vision, the field of semantic analysis, and the like, and specifically, referring to fig. 1, data in a data set acquired by an infrastructure in the embodiment of the present application may be a plurality of data (which may also be referred to as training data, and a plurality of training data form a training set) of different types acquired by sensors such as a camera and a radar, or may be a plurality of image data or a plurality of video data, as long as the training set satisfies a function for performing iterative training on a neural network and can be used for implementing the computational graph update of the present application, and a specific type of data in the training set is not limited herein.

Hardware acceleration techniques used in neural network acceleration in the prior art include: fixed-point quantization (Fixed-point quantization), sparse, Winograd convolution.

1) Fixed point quantization: fixed point quantization converts a 32bit wide floating point number (FP32) to a low bit wide, fixed point number format Integer Number (INT). Therefore, the storage space, the transmission bandwidth and the complexity of the arithmetic unit of the data can be reduced, and the energy efficiency and the calculation speed of the chip are improved.

The fixed-point quantization of the number x of one FP32 into an 8-bit integer number x_INTExamples of (INT8) are as follows:

x≈s_x×(x_INT+o_x),x_INT,o_x∈R^INT8

wherein s is_xIs a quantization factor of FP32 format, o_xIs an offset in INT8 format. Tong (Chinese character of 'tong')Extracting common quantization factor s from multiple FP32 format x_xAnd an offset o_xThe x of these FPs 32 can all be quantized to x of INT8_INT. If data of 1024 FPs 32 are quantized to INT8 format, the storage space of original 4096Byte (each FP32 floating point number needs 4 bytes, therefore 1024 × 4 ═ 4096) can be reduced to 1029Byte (INT8 data of 1024Byte, quantization factor s of FP32 format of 4 Byte)_xAn offset in the INT8 format of one 1 Byte).

If both the data matrix X and the weight matrix W are quantized from FP32 to INT8, the computation of one output value can be approximated as: y ═ Σ x_i·w_i≈∑s_x(x_INT,i+o_x)·s_w(w_INT,i+o_w)＝s_xs_w∑(x_INT,i+o_x)·(w_INT,i+o_w)

Before quantization (before the approximate number), the operands of multiplication are FP32, so the hardware multiplier must be the multiplier of FP 32; after quantization (after the last equal sign), the multiply operands after the addition sign are all INT8, so the multiplier responsible for a large number of operations can be an INT8 multiplier. The area and power consumption of the INT8 multiplier are 2-3 orders of magnitude lower than those of the FP32 multiplier, so that the competitive power of a chip can be improved by utilizing fixed-point quantization.

2) Sparse: the sparse method artificially introduces a large number of 0 elements into the weight matrix, and skips multiplication and addition related to the 0 elements during matrix multiplication, so that the calculation amount of the neural network can be obviously reduced, and the chip power consumption is reduced. Specific examples are given here. Assume that the input data matrix and weight matrix are as follows:

matrix by X^TW requires a total of 18 multiplications by 2 × 3 × 3 and 9 additions by 3 × 3. Now it isAccording to the importance of the weight elements in the weight matrix, artificially combining w₁₂And w₂₃Setting zero to obtain a sparse weight matrix:

if the calculation of the 0 value part can be skipped, then X is calculated^TW_spOnly 12 multiplications and 3 additions are required.

3) Vorogold (winned) convolution: according to the winngrad algorithm proposed in the paper Fast Algorithms for relational Networks, all convolution OPs can realize reduction of multiplication times through equivalent substitution under the condition of keeping an output result unchanged. For example, take 3 × 3 convolution kernel, stride 1 convolution as an example:

1) the data matrix X of 4 × 4 size is transformed into a data matrix X' of 4 × 4 size by the data transformation matrix B of 4 × 4 size.

2) And converting the weight matrix W with the size of 3 x 3 into a new weight matrix W' with the size of 4 x 4by using the parameter conversion matrix G with the size of 4 x 3.

3) And multiplying the corresponding elements by the new data matrix and the new parameter matrix to obtain an intermediate matrix with the size of 4 x 4.

4) The 4 x 4 intermediate matrix is transformed into a 2 x 2 resultant matrix by the 4 x 2 resultant transformation matrix a.

The three transformation matrices B, G and a are fixed for a combination of convolution parameter kernels and stride of a specific size, and can be derived from the winogrd algorithm, which is described in the above-mentioned article, and for a convolution parameter kernel and stride of 3 × 3, the matrix B is the transpose of the matrix B, where B is 1^TMatrix G and transpose A of matrix A^TRespectively as follows:

the original convolution needs 36 multiplications; after the convolution is converted into the winngrad convolution, the transformation overhead is removed, 16 times of multiplication are required in total, and the speed-up ratio of the multiplication times reaches 2.25

And (4) doubling. The three transformation matrices contain only 0, ± 1,

Multiplication of these values can be accomplished by converting the sign bit and/or shift of the binary number, with low hardware cost.

Referring to fig. 2, as shown in fig. 2, a neural network training System architecture is shown, in which a neural network processor 1 and a processor (CPU)2 are located in the same System on Chip (SoC) and are interconnected via an Advanced eXtensible Interface (AXI) 4. Aiming at a neural network training scene, the interaction of the three is as follows:

1) the CPU2 modifies the computation graph, transmits the modified computation graph to the neural network processor through the AXI bus 4, and requires it to start training;

2) the neural network processor 1 acquires training data and an initial weight matrix from a storage unit (Memory)3 through an AXI bus 4 for calculation;

3) after the neural network processor 1 completes the calculation, the trained weight matrix is written back to the Memory3 through the AXI bus 4, and the CPU2 is notified that the calculation is completed.

The CPU modifies the computation graph, which may cause performance loss of the neural network, for example, fixed-point quantization and sparseness of the computation graph may change the trained weight matrix, which may cause performance loss of the neural network, and to compensate for such loss, academia and industry may further train weights and quantization/sparseness related parameters, quantization and sparseness based on training, on the basis of quantization and sparseness. Quantization and sparsity are respectively added with quantization and sparsity calculation steps on the basis of the original matrix multiplication, and all calculation steps in the neural network can be defined as OP comprising forward calculation and reverse calculation, so that quantization and sparsity can be realized by adding OP in the original calculation graph. In the prior art, retraining is required for realizing sparseness or quantization respectively, if an acceleration effect of sparseness and quantization is expected to be obtained simultaneously, retraining is required twice, training cost is high, and in order to reduce training cost, the embodiment of the application provides a corresponding calculation map updating method, the method includes that processing equipment receives a first calculation map and an optimization operation set, the first calculation map is an initial calculation map corresponding to a target neural network model, the first calculation map includes a plurality of initial operations, and the optimization operation set includes a plurality of optimization operations; the processing equipment determines an operation to be accelerated from the first calculation graph according to the multiple optimization operations, and updates the first calculation graph according to the operation to be accelerated and the multiple optimization operations to obtain a second calculation graph; and the processing device sends a second calculation graph to the training device, and the second calculation graph is used for the training device to retrain the target neural network model. Therefore, the calculation chart can obtain multiple acceleration effects only by training once, and the training cost can be reduced.

Next, based on the application scenario and the system architecture, a computation graph updating method in the embodiment of the present application is described with reference to fig. 3.

Referring to fig. 3, an embodiment of a computation graph updating method of the present application includes:

301. and the terminal equipment sends the first calculation graph and the optimization operation set to the processing equipment.

In this embodiment, the first computation graph is a graph structure used for analyzing a neural network structure, computation characteristics, and a data flow direction of the neural network model, the first computation graph is an initial computation graph of a target neural network model that is to be finally obtained, the first computation graph may be sent to the processing device by the terminal device, may also be obtained by the terminal device from the training device and sent to the processing device, and may also be sent to the processing device directly by the training device without passing through the terminal device.

The set of optimization operations includes a plurality of optimization operations, such as Fixed-point quantization (Fixed-point quantization), sparseness, Winograd convolution, and other operations corresponding to future optimization methods. Optionally, all the optimization operations may be arranged and combined in advance, and then input to the processing device through the terminal device, so that the processing device processes the first computation graph.

The processing device in this embodiment may correspond to a Central Processing Unit (CPU) in the neural network architecture, or may also be a Graphics Processing Unit (GPU), and the training device may be a neural-Network Processing Unit (NPU).

302. The processing device determines an operation to be accelerated from the first computational graph.

In this embodiment, the first computation graph is a set of computation steps in the neural network model, that is, the first computation graph includes a plurality of initial Operations (OPs), and each initial operation corresponds to a computation step in the neural network model. The type of the operation to be accelerated may be preset or may be set in real time, for example, the type of the operation to be accelerated may be set to be a convolution operation in advance or in real time through the terminal device to the processing device, that is, the optimization processing of the weight may be implemented to reduce the amount of calculation, or the operation to be accelerated may be set to be a point-by-point addition operation to implement the optimization processing of the input data.

303. The processing device updates the first computation graph according to the operation to be accelerated and the optimization operation set to obtain a second computation graph.

In this embodiment, a permutation order exists among multiple optimization operations in the optimization operation set, for example, the optimization operation set includes a first optimization operation, a second optimization operation, and a third optimization operation, the permutation order of the second optimization operation is located after the first optimization operation, the permutation order of the third optimization operation is located after the second optimization operation, the processing device selects the optimization operation from the optimization operation set according to the permutation order to update the first computation graph, and the updated first computation graph is the required second computation graph.

The operation to be accelerated includes a plurality of initial operations, a plurality of optimization operations in the set of optimization operations have a corresponding relationship with the operation to be accelerated, and the processing device may determine a first initial operation from the plurality of initial operations in the first computation graph according to the corresponding relationship between the first optimization operation and a first initial operation in the operation to be accelerated, and then insert the first optimization operation into the first computation graph, where the first optimization operation is a set of operations before and/or after the first initial operation, and the set of operations may include one or more operations.

After inserting the first optimization operation into the first computation graph, the processing device selects a second optimization operation according to the ranking order, determines a second initial operation from the multiple initial operations of the first computation graph according to a corresponding relationship between the second optimization operation and the second initial operation in the operations to be accelerated, and then inserts the second optimization operation into the first computation graph inserted into the first optimization operation, where the second optimization operation is an operation set before and/or after the second initial operation, optionally, the processing device may also insert the second optimization operation before and/or after the same operation as the second initial operation in the first optimization operation, which is not limited in this embodiment.

After the second optimization operation is inserted into the first computation graph, the processing device selects a third optimization operation according to the arrangement order, determines a third initial operation from the multiple initial operations in the first computation graph according to a corresponding relationship between the third optimization operation and the third initial operation in the operations to be accelerated, and then inserts the third optimization operation into the first computation graph inserted into the second optimization operation, where the third optimization operation is an operation set before and/or after the third initial operation.

In one example, the first optimization operation comprises a winograd operation, and the first initial operation comprises a matrix multiplication operation; the second optimization operation comprises a sparse operation, and the second initial operation comprises a matrix multiplication operation; the third optimization operation includes a quantization operation and the third initial operation includes a matrix multiplication operation and a point-by-point addition operation.

For a sparse operation, the processing device inserts the sparse operation before the matrix multiplication operation. For the quantization operation, the processing device inserts the quantization operation before the matrix multiplication operation and the point-by-point addition operation. For a wigograd operation, the wigograd operation may include two sub-operations that the processing device inserts before and after the matrix multiplication operation, respectively. The present embodiment takes the thinning operation and the quantization operation as an example.

In this embodiment, taking the optimization operation set including only the sparse operation and the quantization operation as an example, as shown in fig. 4, the trained computation graph 41 received by the processing device includes input data 411, a first weight 412, a second weight 413, a first convolution operation 414, a second convolution operation 415, a point-by-point addition operation 416, a point-by-point subtraction operation 417, and output data 418. Shown in the figure by the dashed boxes are the thinning operation 42 and the quantization operation 43 inserted in the first computation graph.

As shown in fig. 5, the operation set corresponding to the thinning operation 42 includes a plurality of operations: an absolute value (abs) operation 421, a constant (constant) operation 422, a compare (grease) operation 423, a data type modification (cast) operation 424, and a multiply (mul) operation 425, where absolute value operation 421: taking absolute values of all elements in the weight Tensor (Tensor); constant operation 422: generating a Tensor which only comprises a scalar and takes a value of 0.1; compare operation 423: comparing all elements in the Tensor with 0.1 (such as a constant Tensor) one by one to obtain a Boolean type Tensor with the size consistent with that of the compared Tensor; data type modify operation 424: each element of the Boolean type sensor is changed into a 32-bit single precision floating point (FP32) sensor with the value of 0 or 1; the multiplication operation 425: and carrying out element-by-element multiplication on the FP32 Tensor containing 0-1 and the original weight Tensor, wherein the weight with the absolute value less than 0.1 is set to be 0, and other elements are unchanged.

As shown in FIG. 6, the quantization operation is schematically illustrated, wherein the maximum value and the minimum value in the Tensor are M and M, respectively, and the quantization factor in the fixed point quantization is set to

Offset o_xEach value x in Tensor is approximately quantized to:

wherein

Denotes taking the integer nearest to z, s_xB in (b) is the quantization bit width (here chosen to be 8 bits). It follows that the set of operations of the quantization operation 43 comprises a plurality of operations: a maximum (reduce max) operation 431, a minimum (reduce min) operation 432, a subtraction (Sub) operation 433, a division (Div) operation 434, a rounding (Round) operation 435, and a multiplication operation 436, where the maximum operation 431: inputting a sensor and outputting a maximum value; minimum operation 432: inputting a sensor and outputting a minimum value; subtraction operation 433: two tensors subtract element by element, here two scalars M and M subtract; division operation 434: two Tensors divide element by element, where the quantization factor s is calculated_xFor two scalar divisions, calculating

Dividing each element in the Tensor by a scalar; round operation 435: rounding elements in the Tensor one by one; multiply operation 436: two Tensors multiply element by element, here computing

For each element in the Tensor multiplied by a scalar, constant operation 437: generates a Tensor with a value of 0.1 and only containing one scalar.

304. The processing device sends the second computation graph to the training device.

After the processing device performs acceleration processing of the insertion optimization operation on all operations to be accelerated in the first computation graph, the updating operation on the first computation graph is completed, the second computation graph can be sent to the training device through a bus protocol between the processing device and the training device, and correspondingly, the training device receives the second computation graph through the bus protocol.

305. The training device retrains the second computational graph.

After receiving the second computation graph, the training device may obtain training data and an initial weight matrix from the storage device through the AXI bus to perform computation, so as to perform retraining. And the training device sends a computation completion notification to the processing device through the AXI bus after retraining the second computation graph is completed.

306. The processing device sends the optimization operation to the training device.

After sending the second computation graph to the training device, the processing device may wait for receiving a computation completion notification from the training device, and when receiving the notification, may send the optimization operation corresponding to the weight to the training device, for example, as the sparse operation and the quantization operation corresponding to the convolution operation in fig. 4, the processing device intercepts a subgraph only including the sparse operation and the quantization operation from the second computation graph, and then sends the subgraph to the training device.

In another example, step 307 may also be performed before step 306, in which case, in step 306, the training device does not need to send a retraining completion notification to the processing device after retraining is completed, and the processing device directly sends the optimization operation corresponding to the weight value to the training device after sending the second computation graph to the training device.

307. And the training equipment executes optimization operation on the weight of the retrained second computational graph.

And training the initial weight on a second calculation graph with both sparse operation and quantization operation to obtain a new weight. These new weight tensors are still non-sparse, unquantized, and they need to be processed into sparse, quantized weights and stored to obtain acceleration effect when the neural network is deployed. The training device may also receive an optimization operation from the processing device, where the training device may select an optimization operation for the weight from the optimization operation, or may receive an optimization operation that only includes the weight and is sent by the processing device, and then input the retrained weight into a subgraph that only includes a sparse operation and a quantization operation to obtain a sparse and quantized optimized weight.

308. The training device stores the optimized weight in the storage device.

After the training device obtains the sparse and quantized weight, the sparse and quantized optimized weight may be saved in a storage device. The storage device may be a memory in the neural network architecture. In another example, other quantities of the thinning or quantization operation may also be simultaneously retained in the memory device, such as, for example, the quantization factor s_w. And after the optimized weight value is stored, the training equipment sends a calculation completion notification to the computing equipment through the AXI bus.

For a typical image classification neural network that has been trained, after the present application uses the updated computation graph in the present application embodiment and retrains, the classification accuracy changes as shown in table 1 below:

TABLE 1

As can be seen from table 1, by updating the computation graph and retraining in the embodiments of the present application, the degradation of the neural network effect in the prior art can be reduced in some optimization method sets.

In the embodiment of the application, the processing device updates the calculation graph according to the optimization method list and the relation of the operation to be accelerated in the calculation graph, and the training device only needs to perform retraining once, so that the training cost is reduced.

Furthermore, the training equipment sequentially processes the weights of the retrained computation graph according to corresponding optimization operation to obtain and store the converted weights, and only stores and reads the converted weight matrix and performs forward calculation when the neural network model is deployed, without performing conversion again after reading the weights, so that the processing speed is improved.

Fig. 7 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application, and as shown in fig. 7, a processing apparatus 700 includes: a receiving unit 701 configured to perform step 301, an updating unit 702 configured to perform step 302 and step 303, and a sending unit 703 configured to perform step 304.

The processing device 700 corresponds to the processing device in the method embodiment shown in fig. 3, and each module and the other operations and/or functions in the processing device 700 are respectively for implementing various steps and methods implemented by the processing device in the method embodiment shown in fig. 3, and for details, reference may be made to the method shown in fig. 3, and details are not described herein again for brevity.

When the processing device 700 processes a packet, the above-mentioned division of each functional module is merely used for illustration, and in practical applications, the above-mentioned function distribution may be completed by different functional modules according to needs, that is, the internal structure of the processing device 700 is divided into different functional modules to complete all or part of the above-mentioned functions. In addition, the processing device 700 provided by the above embodiment belongs to the same concept as the method shown in fig. 3, and the specific implementation process is detailed in the method shown in fig. 3, which is not described again here.

Fig. 8 is a schematic structural diagram of a training apparatus provided in an embodiment of the present application, and as shown in fig. 8, the training apparatus 800 includes: a receiving unit 801 for performing the step 304, a retraining unit 802 for performing the step 305, and a processing unit 803 for performing the step 306 and the step 308.

The training device 800 corresponds to the processing device in the method embodiment shown in fig. 3, and various modules and other operations and/or functions in the training device 800 are respectively for implementing various steps and methods implemented by the processing device in the method embodiment shown in fig. 3, and for details, reference may be made to the method shown in fig. 3, and for brevity, no further description is provided here.

When the training device 800 processes a packet, only the division of the above functional modules is used for illustration, in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the training device 800 is divided into different functional modules to complete all or part of the above described functions. In addition, the training device 800 provided in the above embodiment belongs to the same concept as the method shown in fig. 3, and the specific implementation process is detailed in the method shown in fig. 3, which is not described again here.

Referring to fig. 9, fig. 9 shows a schematic structural diagram of a computer device 900 provided in an exemplary embodiment of the present application, and the computer device 900 may be implemented by a general bus architecture.

The computer device 900 includes at least one processor 901, a communication bus 902, memory 903, and at least one communication interface 904.

The processor 901 may be a general purpose CPU, NP, microprocessor, or may be one or more integrated circuits such as an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof, for implementing aspects of the present disclosure. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

A communication bus 902 is used to transfer information between the above components. The communication bus 902 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus.

The Memory 903 may be a read-only Memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only Memory (EEPROM), a compact disc read-only Memory (CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), a magnetic disc storage medium, or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of instructions or data structures and which can be accessed by a computer, but is not limited to such. The memory 903 may be separate and coupled to the processor 901 via a communication bus 902. The memory 903 may also be integrated with the processor 901.

Communication interface 904 uses any transceiver or the like for communicating with other devices or communication networks. Communication interface 904 includes a wired communication interface and may also include a wireless communication interface. The wired communication interface may be an ethernet interface, for example. The ethernet interface may be an optical interface, an electrical interface, or a combination thereof. The wireless communication interface may be a Wireless Local Area Network (WLAN) interface, a cellular network communication interface, or a combination thereof.

In particular implementations, processor 901 may include one or more CPUs such as CPU0 and CPU1 shown in fig. 9 as one embodiment.

In particular implementations, computer device 900 may include multiple processors, such as processor 901 and processor 905 shown in FIG. 9, as an example. Each of these processors may be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In particular implementations, computer device 900 may also include an output device 906 and an input device 907 as one embodiment. The output device 906, which is in communication with the processor 901, may display information in a variety of ways. For example, the output device 906 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 907 is in communication with the processor 901 and may receive user input in a variety of ways. For example, the input device 907 may be a mouse, keyboard, touch screen device, or sensing device, etc.

In some embodiments, the memory 903 is used for storing the program code 910 for executing the solution of the present application, and the processor 901 may execute the program code 910 stored in the memory 903. That is, the computer device 900 may implement the method shown in fig. 3 provided by the method embodiment through the processor 901 and the program code 910 in the memory 903.

The computer device 900 of the present embodiment may correspond to the computer device in the above-described method embodiments, and the processor 901, the communication interface 904, and the like in the computer device 900 may implement the functions of the computer device in the above-described method embodiments and/or various steps and methods implemented by the computer device. For brevity, no further description is provided herein.

The receiving unit 701 and the transmitting unit 703 in the processing device 700 correspond to the communication interface 904 in the computer device 900; the updating unit 702 in the computer device 700 may correspond to the processor 901 in the computer device 900.

The receiving unit 801 in the training device 800 corresponds to the communication interface 904 in the computer device 900; the retraining unit 802 and the processing unit 803 in the computer device 800 may correspond to the processor 901 in the computer device 900.

In another embodiment of the present application, a computer-readable storage medium is further provided, in which computer-executable instructions are stored, and when a processor of the device executes the computer-executable instructions, the device executes the steps of updating the computation graph and processing the computation graph, which are executed by the processor in fig. 3.

In another embodiment of the present application, there is also provided a computer program product comprising computer executable instructions stored in a computer readable storage medium; when the processor of the device executes the computer-executable instructions, the device executes the steps of the computation graph updating and computation graph processing method executed by the processor in fig. 3.

In another embodiment of the present application, a chip system is further provided, where the chip system includes at least one processor, and the processor is configured to support a decoding device to implement the steps of the computation graph updating and computation graph processing method executed by the processor in fig. 3. In one possible design, the system-on-chip may further include a memory, the memory being configured to store program instructions and data necessary for the decoding device. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Claims

1. A computation graph update method, comprising:

the method comprises the steps that a processing device receives a first calculation graph and an optimization operation set, wherein the first calculation graph is an initial calculation graph corresponding to a target neural network model, the first calculation graph comprises a plurality of initial operations, and the optimization operation set comprises a plurality of optimization operations;

the processing device determines an operation to be accelerated from the first calculation graph according to the optimization operations, and updates the first calculation graph according to the operation to be accelerated and the optimization operations to obtain a second calculation graph;

and the processing device sends the second calculation graph to a training device, and the second calculation graph is used for retraining the target neural network model by the training device.

2. The computational graph updating method according to claim 1, wherein the processing device determines an operation to be accelerated from the first computational graph according to the plurality of optimization operations, and updates the first computational graph according to the operation to be accelerated and the plurality of optimization operations, including:

the processing equipment determines a first initial operation from the plurality of initial operations according to the corresponding relation between the first optimization operation and the first initial operation, wherein the first initial operation is included in the operation to be accelerated;

the processing device inserts the first optimization operation into the first computational graph, the inserted first optimization operation being one operation set before or one operation set after the first initial operation, the operation set comprising one or more operations;

the processing device determines a second initial operation from the plurality of initial operations inserted into a first computation graph of the first optimization operation according to a corresponding relationship between the second optimization operation and the second initial operation, where the second initial operation is included in the operation to be accelerated, and in the optimization operation set, the second optimization operation is adjacent to the first optimization operation, and an arrangement sequence of the second optimization operation is after the first optimization operation;

the processing device inserts the second optimization operation into a first computational graph that includes the first optimization operation, the second optimization operation after insertion being one set of operations before or one set of operations after the second initial operation.

3. The computation graph update method according to claim 2, further comprising:

the processing device determines a third initial operation from the plurality of initial operations inserted into the first computation graph of the second optimization operation according to a corresponding relationship between the third initial operation and the third initial operation, where the third initial operation is included in the operation to be accelerated, and in the optimization operation set, the third optimization operation is adjacent to the second optimization operation, and an arrangement order of the third optimization operation is after the second optimization operation;

the processing device inserts the third optimization operation into a first computational graph that includes the first optimization operation and the second optimization operation, the inserted third optimization operation being one set of operations before or one set of operations after the third initial operation.

4. The computation graph update method according to claim 3,

the first optimization operation comprises a Weinuoglard winnowing operation, and the first initial operation comprises a matrix multiplication operation;

the second optimization operation comprises a sparse operation, the second initial operation comprises the matrix multiplication operation;

the third optimization operation comprises a quantization operation, and the third initial operation comprises the matrix multiplication operation and a point-by-point addition operation.

5. The computation graph update method according to any one of claims 1 to 3, wherein the kinds and arrangement orders of the optimization operations in the set of optimization operations are set in advance.

6. The computation graph update method according to any one of claims 1 to 4, wherein after the processing device transmits the second computation graph to a training device, the method further comprises:

and the processing equipment sends the optimization operation corresponding to the operation to be accelerated to the training equipment.

7. A method of computing graph processing, comprising:

the training device receives a second computational graph from a processing device, wherein the second computational graph is a computational graph obtained after updating a first computational graph based on an optimization operation set, the optimization operation set comprises a plurality of optimization operations, the optimization operations are used for determining an operation to be accelerated from a plurality of initial operations of the first computational graph and updating the first computational graph, and the first computational graph is an initial computational graph corresponding to a target neural network model;

and the training equipment retrains the second calculation graph to obtain the target neural network model.

8. The computational graph processing method of claim 7, wherein after the training device retrains the second computational graph, the method further comprises:

the training equipment receives an optimization operation corresponding to the operation to be accelerated from the processing equipment;

the training equipment executes the optimization operation on the weight of the retrained target neural network model to generate an optimized weight;

and the training equipment stores the optimized weight in storage equipment.

9. A processing device, comprising:

a receiving unit, configured to receive a first computation graph and an optimization operation set, where the first computation graph is an initial computation graph corresponding to a target neural network model, the first computation graph includes a plurality of initial operations, and the optimization operation set includes a plurality of optimization operations;

the updating unit is used for determining an operation to be accelerated from the first calculation map according to the plurality of optimization operations and updating the first calculation map according to the operation to be accelerated and the plurality of optimization operations to obtain a second calculation map;

and the sending unit is used for sending the second calculation graph to the training equipment, and the second calculation graph is used for retraining the target neural network model by the training equipment.

10. The processing device according to claim 9, wherein the updating unit comprises:

determining a first initial operation from the plurality of initial operations according to the corresponding relation between the first optimization operation and the first initial operation, wherein the first initial operation is included in the operation to be accelerated;

inserting the first optimization operation into the first computational graph, the inserted first optimization operation being one operation before or one operation after the first initial operation;

determining a second initial operation from the plurality of initial operations inserted into the first computation graph of the first optimization operation according to a corresponding relationship between the second optimization operation and the second initial operation, wherein the second initial operation is included in the operation to be accelerated, the second optimization operation is adjacent to the first optimization operation in the optimization operation set, and the arrangement sequence of the second optimization operation is after the first optimization operation;

inserting the second optimization operation into a first computational graph including the first optimization operation, the inserted second optimization operation being one operation before or one operation after the second initial operation.

11. The processing device of claim 10, wherein the update unit is further configured to:

determining a third initial operation from the plurality of initial operations inserted into the first computation graph of the second optimization operation according to a corresponding relationship between the third initial operation and the third initial operation, where the third initial operation is included in the operation to be accelerated, and in the optimization operation set, the third optimization operation is adjacent to the second optimization operation, and an arrangement order of the third optimization operation is located after the second optimization operation;

inserting the third optimization operation into a first computational graph that includes the first optimization operation and the second optimization operation, the inserted third optimization operation being one operation before or one operation after the third initial operation.

12. The processing apparatus according to claim 11,

13. The processing apparatus according to any of claims 9 to 11, wherein the kind and arrangement order of the optimization operations in the set of optimization operations are preset.

14. The processing device according to any of claims 9 to 12, wherein the sending unit is further configured to:

and sending the optimization operation corresponding to the operation to be accelerated to the training equipment.

15. An exercise apparatus, comprising:

a receiving unit, configured to receive a second computation graph from a processing device, where the second computation graph is a computation graph obtained by updating a first computation graph based on an optimization operation set, where the optimization operation set includes multiple optimization operations, and the multiple optimization operations are used to determine an operation to be accelerated from multiple initial operations of the first computation graph and update the first computation graph, and the first computation graph is an initial computation graph corresponding to a target neural network model;

and the retraining unit is used for retraining the second computational graph to obtain the target neural network model.

16. Training device according to claim 15, characterized in that it further comprises a processing unit, in particular for:

receiving an optimization operation corresponding to the operation to be accelerated from the processing equipment;

executing the optimization operation on the weight of the retrained target neural network model to generate an optimized weight;

and storing the optimized weight in a storage device.

17. A processing device, comprising: a processor, a memory, and a communication interface,

the processor is configured to execute the instructions stored in the memory to cause the network device to perform the method of any of claims 1-6.

18. An exercise apparatus, comprising: a processor, a memory, and a communication interface,

the processor is configured to execute the instructions stored in the memory to cause the network device to perform the method of any of claims 7 to 8.

19. A computer-readable storage medium, in which a computer program is stored which, when run on the computer, causes the computer to carry out the method according to any one of claims 1 to 8.

20. A computer program product, characterized in that when the computer program product is executed on a computer, the computer performs the method according to any of claims 1 to 8.

21. A chip system, comprising at least one processor configured to perform the method of any of claims 1-8.