WO2023125628A1 - Neural network model optimization method and apparatus, and computing device - Google Patents

Neural network model optimization method and apparatus, and computing device Download PDF

Info

Publication number
WO2023125628A1
WO2023125628A1 PCT/CN2022/142689 CN2022142689W WO2023125628A1 WO 2023125628 A1 WO2023125628 A1 WO 2023125628A1 CN 2022142689 W CN2022142689 W CN 2022142689W WO 2023125628 A1 WO2023125628 A1 WO 2023125628A1
Authority
WO
WIPO (PCT)
Prior art keywords
subgraph
sub
graph
neural network
network model
Prior art date
Application number
PCT/CN2022/142689
Other languages
French (fr)
Chinese (zh)
Inventor
袁熙昊
林菁
严一超
王兵
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023125628A1 publication Critical patent/WO2023125628A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of artificial intelligence, in particular to a neural network model optimization method, device and computing equipment.
  • Artificial Intelligence is a theory, method, technology and application system that uses computers to simulate and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain results. Artificial intelligence technology is widely used in machine learning (Machine Learning, ML), natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory. Processing data based on neural network models to realize application functions such as recognition is a key technology for artificial intelligence applications.
  • the cloud-side device can use the training set to train the neural network model, so that the neural network model has application functions such as recognition, and deploy the neural network model to at least one terminal (such as: smart phones, cameras, self-driving cars, etc.).
  • the terminal uses the configured neural network model to process the acquired application data (such as: image, voice, etc.) to realize application functions such as recognition.
  • the neural network model gradually presents a trend of complex structure and increasing parameters, which leads to higher and higher computing resources required by the neural network model to process data, and the processing time of data is getting higher and higher. long.
  • the present application provides a neural network model optimization method, device and computing equipment, thereby shortening the time for the neural network model to process data on the premise of ensuring the accuracy of the neural network model processing data.
  • a method for optimizing a neural network model is provided, and the method is executed by a computing device.
  • the method includes: the computing device obtains the neural network model to be optimized, searches for an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set, and replaces the first subgraph in the neural network model to be optimized with an equivalent price submap.
  • the equivalent subgraph and the first subgraph have the same output for the same input data, and the processing efficiency of the equivalent subgraph on the input data is greater than that of the first subgraph on the input data, and the subgraph set includes multiple subgraphs.
  • the neural network model optimization method provided by the embodiment of the present application is especially applicable, which can effectively improve the reasoning speed of the neural network model and shorten the time of the neural network model. Reasoning is time-consuming and improves user experience.
  • searching for an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set includes: determining a second subgraph corresponding to the first subgraph in the subgraph set, the first The second subgraph and the first subgraph have the same input data, and the output is also the same; it is determined that the computing resources used to execute the first subgraph in the computing device perform the data processing efficiency of the second subgraph higher than that of executing the first subgraph. Data processing efficiency; treat the second subgraph as an equivalent subgraph.
  • determining the second subgraph corresponding to the first subgraph in the subgraph set includes: inputting input data into the first subgraph, using computing resources to run the first subgraph, and outputting an operation result; Inputting input data into at least one subgraph in the subgraph set, and determining the subgraph identical to the running result as the second subgraph.
  • the method further includes: recording a mapping relationship between the first subgraph and the second subgraph to the subgraph set.
  • the sub-graph set includes a first mapping relationship between the first sub-graph and the second sub-graph; determining the second sub-graph corresponding to the first sub-graph in the sub-graph set includes: according to the first The mapping relationship determines the second subgraph corresponding to the first subgraph.
  • the subgraphs with equivalent relationships indicated by the subgraph set are matched with the subgraphs in the neural network model to be optimized.
  • the neural network model to be optimized includes multiple operators, multiple operators form multiple subgraphs, and at least two operators form a subgraph. If the computing device determines the first subgraph to be replaced in the neural network model to be optimized, replace the first subgraph with a second subgraph equivalent to the first subgraph in the subgraph set to obtain an optimized neural network model.
  • the optimized neural network model includes a second subgraph. When processing data based on computing resources, the duration of the optimized neural network model is shorter than the duration of the neural network model to be optimized. Among them, subgraphs with equivalence relations are used to output the same result according to the same input data.
  • the first subgraph in the neural network model to be optimized that is the same as the subgraph included in the subgraph set is determined according to the subgraph features.
  • Subgraph features include operator type, subgraph structure and operator parameters.
  • subgraph matching is performed according to subgraph features, and a subgraph identical to a subgraph included in the subgraph set is searched out from the neural network model to be optimized, which can effectively improve the accuracy of the search.
  • the method before determining the first subgraph to be replaced in the neural network model to be optimized according to the subgraph set, the method further includes: obtaining the operator set according to the neural network model of multiple application scenarios; The set of operators searches subgraphs with equivalence relations to generate subgraph sets. Therefore, a method for automatically searching equivalent subgraphs is provided, which automatically searches possible equivalent subgraphs based on operator sets, without omission, and saves manpower.
  • the subgraph in the neural network model to be optimized is replaced based on the equivalent subgraph. Since the replaced subgraph in the optimized neural network model is compatible with the computing resources of the optimized neural network model to process data, that is, computing resources Calculating the replaced subgraph can effectively utilize the computing power of computing resources, thereby significantly shortening the processing time of the neural network model on the premise of ensuring that the accuracy of the data processed by the neural network model is not lost.
  • This method can automatically optimize the neural network model. It is simple, intuitive, efficient, and highly scalable. It only needs to input the neural network model to quickly complete the optimization of the neural network model. It does not require any data and has no loss of accuracy. It is applicable to a wide range of scenarios.
  • the neural network model optimization method provided by the embodiment of the present application is especially applicable, which can effectively improve the reasoning speed of the neural network model and shorten the time of the neural network model. Reasoning is time-consuming and improves user experience.
  • the optimized neural network model can be deployed to at least one terminal, so that when the terminal processes application data according to the optimized neural network model, the data processing time is shortened and the terminal data processing performance is improved.
  • replacing the first sub-graph with the second sub-graph equivalent to the first sub-graph in the sub-graph set includes: if the computing power of the computing resources for computing the second sub-graph is higher than the computing power of computing resources for computing the first sub-graph, Replace the first subgraph with the second subgraph.
  • the computing power of the computing resource computing subgraph is related to the duration of the computing resource computing subgraph.
  • the computing power may be the data processing efficiency of computing resources for computing the first subgraph.
  • determining that the computing resources used to execute the first subgraph in the computing device have higher data processing efficiency when executing the second subgraph than the data processing efficiency when executing the first subgraph includes: computing The resource call cost function runs the first subgraph to record the first data processing efficiency; calculates the resource call cost function to run the second subgraph to record the second data processing efficiency; determines the execution by comparing the first data processing efficiency with the second data processing efficiency The data processing efficiency of the second sub-graph is higher than the data processing efficiency of the execution of the first sub-graph.
  • replacing the first subgraph with a second subgraph equivalent to the first subgraph in the subgraph set includes: respectively determining the duration of the second subgraph and the duration of the first subgraph based on the calculation resource utilization cost function, and the cost The function is used to calculate the duration of subgraphs with an equivalence relationship based on the same computing resource; if the duration of the second subgraph is less than the duration of the first subgraph, replace the first subgraph with the second subgraph.
  • the influence of subgraph replacement on the inference performance of the neural network model can be measured, and whether to perform subgraph replacement can be automatically decided to improve the accuracy of subgraph replacement, thereby ensuring that subgraph replacement based on the current hardware platform can bring performance promote.
  • the current hardware platform may refer to computing resources for processing data using an optimized neural network model.
  • the method further includes: recording the mapping relationship of the computing resource, the first sub-graph, and the second sub-graph to the sub-graph set.
  • the processing efficiency includes: determining according to the second mapping relationship that the computing resource used to execute the first subgraph has a higher data processing efficiency when executing the second subgraph than when executing the first subgraph.
  • the sub-graph set is also used to instruct computing resources to calculate the computing power correspondence between the second sub-graph and the first sub-graph.
  • Replacing the first subgraph with a second subgraph equivalent to the first subgraph in the subgraph set includes: replacing the first subgraph with the second subgraph determined according to the computing power correspondence. Therefore, since the computing power correspondence has already indicated the affinity between equivalent subgraphs and computing resources, the subgraph replacement of the neural network model to be optimized based on the computing power correspondence can effectively improve the speed of subgraph replacement and save subgraph replacement. duration.
  • the computing device can also automatically derive the computing power corresponding relationship of the equivalent subgraph that is compatible with the current hardware platform, and any hit computing power corresponding relationship is medium
  • the neural network model of the valence subgraph according to the method provided by the embodiment of the present application, can obtain reasoning performance improvement on the corresponding hardware platform, and can also guide the structural design of the next-generation hardware-friendly neural network model according to the corresponding relationship of computing power.
  • the computing device updates the operator set, that is, adds a new operator to the operator set, performs an equivalent subgraph search according to the updated operator set, and obtains an updated equivalent subgraph.
  • a neural network model optimization device in a second aspect, includes various modules for executing the neural network model optimization method in the first aspect or any possible design of the first aspect.
  • a third aspect provides a processor, the processor is configured to execute the operation steps of the neural network model optimization method in the first aspect or any possible design of the first aspect.
  • a computing device in a fourth aspect, includes at least one processor and a memory, and the memory is used to store a set of computer instructions; When the device executes the set of computer instructions, it executes the first aspect or the operation steps of the neural network model optimization method in any possible implementation manner of the first aspect.
  • a computer-readable storage medium including: computer software instructions; when the computer software instructions are run in the computing device, the computing device is made to execute the computer program described in the first aspect or any one of the possible implementation manners of the first aspect. Operational steps of the method.
  • a computer program product is provided.
  • the computing device executes the operation steps of the method described in the first aspect or any possible implementation manner of the first aspect.
  • a chip system in a seventh aspect, includes a processor, configured to implement the functions of the processor in the method of the first aspect above.
  • the chip system further includes a memory for storing program instructions and/or data.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • Fig. 1 is the structural representation of a kind of neural network provided by the present application
  • Fig. 2 is a schematic structural diagram of a convolutional neural network provided by the present application.
  • FIG. 3 is a schematic diagram of a system architecture provided by the present application.
  • FIG. 4 is a schematic diagram of a method for generating a sub-atlas provided by the present application.
  • FIG. 5 is a schematic diagram of a generation operator set provided by the present application.
  • FIG. 6 is a schematic diagram of generating an equivalent subgraph relationship provided by the present application.
  • FIG. 7 is a schematic diagram of a neural network model optimization method provided in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of sub-graph replacement provided by the embodiment of the present application.
  • FIG. 9 is a schematic diagram of another neural network model optimization method provided in the embodiment of the present application.
  • FIG. 10 is a schematic diagram of a generated computing power correspondence provided by the present application.
  • FIG. 11 is a schematic diagram of a neural network model optimization scenario provided by the present application.
  • FIG. 12 is a schematic structural diagram of a neural network model optimization device provided by the present application.
  • FIG. 13 is a schematic structural diagram of a computing device provided in the present application.
  • a neural network can be made up of neurons, which can refer to operational units that take x s and intercept 1 as input.
  • the output of the arithmetic unit satisfies the following formula (1).
  • W s is the weight of x s
  • b is the bias of the neuron.
  • f is the activation function of the neuron, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neuron into an output signal.
  • the output signal of the activation function can be used as the input of the next layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neurons, that is, the output of one neuron can be the input of another neuron.
  • each neuron can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neurons.
  • Weights characterize the strength of connections between different neurons. The weight determines the influence of the input on the output. A weight close to 0 means that changing the input does not change the output. Negative weights mean that increasing the input decreases the output.
  • the neural network 100 includes N processing layers, where N is an integer greater than or equal to 3.
  • the first layer of the neural network 100 is the input layer 110, which is responsible for receiving input signals
  • the last layer of the neural network 100 is the output layer 130, which is responsible for outputting the processing results of the neural network.
  • the other layers except the first layer and the last layer are intermediate layers 140, and these intermediate layers 140 together form a hidden layer 120, and each intermediate layer 140 in the hidden layer 120 can receive input signals and output signals.
  • the hidden layer 120 is responsible for the processing of the input signal.
  • Each layer represents a logical level of signal processing, and through multiple layers, data signals can be processed by multi-level logic.
  • the input signals of the neural network may be signals in various forms such as video signals, voice signals, text signals, image signals or temperature signals.
  • the image signal can be the scenery captured by the camera (image sensor), the environmental image captured by the monitoring equipment, and the facial image acquired by the access control system, etc.
  • the input signal of the neural network also includes various other computer-processable engineering signals, which will not be listed one by one here. If the neural network is used to carry out deep learning on the image signal, the quality of the image processed by the neural network can be improved.
  • Deep Neural Network also known as a multi-layer neural network
  • DNN can be understood as a neural network with multiple hidden layers.
  • the deep neural network is divided according to the position of different layers.
  • the neural network inside the deep neural network can be divided into three categories: input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer is connected to any neuron in the i+1-th layer.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
  • the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is also the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • Convolutional Neuron Network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of a convolutional layer and a subsampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer can output several feature maps, and the feature map can refer to the intermediate results during the operation of the convolutional neural network.
  • Neurons in the same feature map share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information that is independent of location. That is, the statistics for one part of the image are the same as for other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
  • multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolutional neural network 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional) and a neural network layer 230 .
  • the convolutional layer/pooling layer 220 may include layers 221 to 226, for example.
  • layer 221 may be a convolutional layer
  • layer 222 may be a pooling layer
  • layer 223 may be a convolutional layer
  • layer 224 may be a pooling layer
  • layer 225 may be a convolutional layer
  • the layer 226 may be, for example, a pooling layer.
  • layers 221 and 222 may be, for example, convolutional layers
  • layer 223 may be, for example, a pooling layer
  • layers 224 and 225 may be, for example, convolutional layers
  • layer 226 may be, for example, a pooling layer.
  • the output of a convolutional layer can be used as input to a subsequent pooling layer, or as input to another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators, and the convolution operators may also be called kernels.
  • the role of the convolution operator in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially be a weight matrix, which is usually predefined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels, depending on the value of the stride) along the horizontal direction on the input image. processing to complete the work of extracting specific features from the image.
  • the size of this weight matrix is related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix is extended to the full depth of the input image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row ⁇ column) are applied, That is, multiple matrices of the same shape.
  • the output of each weight matrix is stacked to form the depth dimension of the convolved image.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image. Do blurring etc.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple feature maps of the same size are combined to form the convolution operation. output.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer (such as layer 221 ) often extracts more general features, which can also be called low-level features.
  • the features extracted by the later convolutional layers (for example, layer 226) become more and more complex, such as high-level semantic features, and the higher the semantic features, the more suitable for unresolved issues.
  • Each layer from layer 221 to layer 226 as shown in the convolutional layer/pooling layer 220 in Figure 2 can be a layer of convolutional layer followed by a layer of pooling layer, or a multi-layer convolutional layer followed by a layer or multiple pooling layers.
  • the sole purpose of pooling layers is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 220 extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the neural network layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the neural network layer 230 may include a multi-layer hidden layer (layer 231, layer 232 to layer 23n as shown in FIG. 2 ) and an output layer 240, and the parameters contained in the multi-layer hidden layer may be determined according to specific tasks. Types of related training data are pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
  • the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
  • the output layer 240 After the multi-layer hidden layer in the neural network layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, which has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the forward propagation of the entire convolutional neural network 200 (as shown in Figure 2 is forward propagation from layer 210 to layer 240 direction) is completed, and the reverse propagation (as shown in Figure 2 is backward propagation from layer 240 to layer 210 direction) ) will begin to update the weight values and deviations of the above-mentioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the output result of the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in FIG. 2 is only an example of a convolutional neural network, and in specific applications, the convolutional neural network may also exist in the form of other network models.
  • the convolutional neural network can use the back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and updating the parameters in the initial super-resolution model by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
  • the above-mentioned neural network can also be called a neural network model.
  • the intermediate layers contained in the neural network can also be called operators. Operators are used to implement a unit calculation in a neural network. For example, an operator that implements convolutional layer calculations may be called a convolutional operator (conv).
  • the operator that realizes the calculation of the pooling layer may be called a pooling operator (pool).
  • the operator implementing the calculation of the activation layer may be called an activation operator (Relu).
  • the activation operator can also be called a linear rectification operator. At least two operators can form a subgraph.
  • the subgraph refers to the network structure composed of some intermediate layers in the neural network model.
  • the embodiment of the present application provides a method for optimizing a neural network model, especially a technology for optimizing a neural network model based on mutually equivalent subgraphs, that is, automatically searches for mathematically equivalent subgraphs based on multiple operators, and automatically Discover subgraphs that give full play to the computing power of the hardware, replace the equivalent subgraphs in the neural network model, and significantly shorten the processing time of the neural network model on the premise of ensuring the accuracy of the data processed by the neural network model without loss.
  • FIG. 3 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the system 300 includes an execution device 310 , a training device 320 , a database 330 , a terminal device 340 , a data storage system 350 and a data collection device 360 .
  • the execution device 310 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, a virtual reality (virtual reality, VR), an augmented reality (augmented reality, AR) device, a mixed reality (Mixed Reality, MR) device, an extended reality (Extended Reality, ER) devices, cameras, or vehicle-mounted terminals, etc., or edge devices (for example, boxes carrying chips with processing capabilities), etc.
  • a terminal such as a mobile phone terminal, a tablet computer, a notebook computer, a virtual reality (virtual reality, VR), an augmented reality (augmented reality, AR) device, a mixed reality (Mixed Reality, MR) device, an extended reality (Extended Reality, ER) devices, cameras, or vehicle-mounted terminals, etc., or edge devices (for example, boxes carrying chips with processing capabilities), etc.
  • the training device 320 may be a server or a cloud device or the like.
  • the training device 320 has strong computing capability, and can run the neural network model, perform calculations such as training the neural network model.
  • the executing device 310 and the training device 320 are different processors deployed on different physical devices (such as servers or servers in a cluster).
  • the execution device 310 may be a neural network model processor (neural network processing unit, NPU), a graphics processing unit (graphic processing unit, GPU), a central processing unit (central processing unit, CPU), other general processors, digital signal Processor (digital signal processing, DSP), application-specific integrated circuit (ASIC), field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices , discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the training device 320 may be a GPU, an NPU, a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in the solution of this application.
  • the data collection device 360 is used to collect training data and store the training data in the database 330 .
  • the training data may be data in at least one form of images, voice and text.
  • training data includes training images and objects in the training images.
  • the training device 320 is used to train the neural network model with the training data until the loss function in the neural network model converges, and the training of the neural network model is completed if the loss function value is less than a specific threshold, so that the neural network model reaches a certain accuracy. Alternatively, if all the training data in the database 330 are used for training, then the training of the neural network model is completed, so that the trained neural network model has functions such as recognition or classification. Furthermore, the training device 320 configures the trained neural network model 301 to the execution device 310 .
  • the execution device 310 is used to realize functions such as processing application data according to the trained neural network model 301 to realize recognition.
  • the training device 320 can configure the trained neural network model 301 to multiple execution devices 310 .
  • Each execution device 310 utilizes the trained neural network model 301 to implement functions such as recognition or classification.
  • the neural network model is used to identify road signs, driving reference objects, and obstacles on the road in the environment to ensure safe and accurate driving of the autonomous vehicle.
  • Signposts can contain graphical or textual signposts.
  • Driving reference objects can be buildings or plants. Obstacles on the road may include dynamic objects (eg animals) or stationary objects (eg stationary vehicles).
  • the neural network model is mainly used to identify objects (such as cars and users) in environments such as intersections or parks.
  • the neural network model is mainly used to recognize speech or text.
  • the training device 320 may also iteratively train the neural network model based on the training data maintained by the database 330 and the application data provided by the execution device 310 . Understandably, iterative training refers to any training after the first training of the neural network model. Since the training data maintained by the database 330 may be a full training set, including application data acquired in different application scenarios, the data in different application scenarios have different or dissimilar application scenario features (such as: environmental features, time features). If the training set in which the training device 320 iteratively trains the neural network model contains data of different or dissimilar application scenario characteristics, it will make it difficult for the neural network model to achieve better results in processing application data in different application scenarios.
  • the training device 320 completes the training of the neural network model, before deploying the trained neural network model 301 to the execution device 310, optimize the trained neural network model 301 based on the sub-atlas , that is to determine the subgraph equivalent to the subgraph in the neural network model 301, based on the data processing efficiency of the two equivalent subgraphs calculated by the execution device 310, replace the subgraph in the neural network model 301, and convert the optimized neural network Model 301 is deployed to execution device 310 . Therefore, on the premise of ensuring that the accuracy of processing data by the execution device 310 based on the neural network model 301 is not compromised, the duration of data processing by the neural network model 301 is significantly shortened.
  • the training data maintained in the database 330 may not all come from the data collection device 360, and may also be received from other devices.
  • the training device 320 does not necessarily train the neural network model based entirely on the training data maintained by the database 330, and may also obtain training data from the cloud or other places to train the neural network model.
  • the above description should not be used as a limitation to the embodiments of the present application.
  • the execution device 310 can be further subdivided into the architecture shown in FIG. Preprocessing module 313 .
  • the I/O interface 312 is used for data interaction with external devices.
  • a user can input data to the I/O interface 312 through the terminal device 340 .
  • Input data can include images or video.
  • the input data can also come from the database 330 .
  • the preprocessing module 313 is configured to perform preprocessing according to the input data received by the I/O interface 312 .
  • the execution device 310 When the execution device 310 preprocesses the input data, or in the execution device 310 computing module 311 performs calculation and other related processing, the execution device 310 can call the data, codes, etc. in the data storage system 350 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 350 .
  • the optimized neural network model stored in the execution device 310 may be applied to the execution device 310 .
  • the calculation module 311 inputs the application data into the optimized neural network model to obtain a processing result. Since the optimized neural network model is a model optimized by the training device 320 according to the sub-atlas, processing the application data by using the optimized neural network model can meet the accuracy and duration requirements of the user for data processing.
  • the I/O interface 312 returns the processing result to the terminal device 340, thereby providing it to the user, so that the user can view the processing result.
  • the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 312 .
  • the terminal device 340 can automatically send input data to the I/O interface 312. If the terminal device 340 is required to automatically send the input data to obtain authorization from the user, the user can set corresponding permissions in the terminal device 340.
  • the user can view the processing results output by the execution device 310 on the terminal device 340, and the specific presentation form may be specific ways such as display, sound, and action.
  • the terminal device 340 can also be used as a data collection terminal, collecting input data input to the I/O interface 312 as shown in the figure and output processing results of the I/O interface 312 as new sample data, and storing them in the database 330 .
  • the terminal device 340 may not be used for collection, but the I/O interface 312 will use the input data input to the I/O interface 312 as shown in the figure and the processing results of the output I/O interface 312 as new sample data Stored in database 330 .
  • Fig. 3 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in Fig. 3 does not constitute any limitation.
  • the data storage system 350 is The execution device 310 is an external memory, and in other cases, the data storage system 350 may also be placed in the execution device 310 .
  • the computing device obtains the neural network model to be optimized, searches for an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set, and replaces the first subgraph in the neural network model to be optimized with an equivalent subgraph .
  • the equivalent subgraph and the first subgraph have the same output for the same input data, and the processing efficiency of the equivalent subgraph on the input data is greater than that of the first subgraph on the input data, and the subgraph set includes multiple subgraphs.
  • FIG. 4 is a schematic diagram of a method for generating a sub-atlas provided by an embodiment of the present application.
  • the training device 320 in FIG. 3 is taken as an example for illustration.
  • the method includes the following steps.
  • Step 410 the training device 320 acquires operator sets according to neural network models of multiple application scenarios.
  • the training device 320 extracts operators from neural network models applied to different application scenarios, and removes repeated operators to form an operator set.
  • the operator set includes multiple operators, and each operator is used to implement different computing functions.
  • the operator set includes linear rectification operator (Relu), matrix transformation operator (Reshape), convolution operator (Conv), pooling operator (pool), maximum pooling operator (Maxpool) and matrix transposition Operator (Transpose), etc.
  • Application scenarios include but are not limited to target recognition and automatic driving scenarios.
  • the neural network model described in the embodiment of the present application may be a mainstream computer vision (computer vision, CV) model.
  • CV models include, for example, YOLO, AlenNet, Residual Network (ResNet) and Dense Convolutional Network (DenseNet).
  • the neural network model shown in (a) in Figure 5 contains 9 operators, wherein the 9 operators include 3 linear rectification operators, 3 convolution operators, 2 matrix transformation operators and 1 matrix transpose operator.
  • the training device 320 removes repeated operators among the nine operators, and the obtained operator set includes linear rectification operators, convolution operators, matrix transformation operators and matrix transposition operators.
  • the training device 320 may also obtain the operator set according to the neural network model given by the user. Alternatively, the training device 320 acquires a given set of operators.
  • step 420 the training device 320 searches for subgraphs with equivalence relations according to the operator set, so as to generate a subgraph set.
  • the training device 320 constructs multiple legal subgraphs by permuting and combining operators included in the operator set, or constructs multiple legal subgraphs by permuting and combining operators according to characteristics such as the number of operators, operator types, and operator parameters.
  • a legal subgraph can mean that the output of any operator in the subgraph conforms to the input of the operator connected to it.
  • the training device 320 searches the legal subgraphs for subgraphs with an equivalence relationship to generate a subgraph set, and the subgraph set includes multiple pairs of subgraphs with an equivalence relationship.
  • the training device 320 uses various methods including but not limited to subgraph hashing, random test case comparison output, mathematical equivalence analysis, etc. to determine whether any two legal subgraphs are equivalent, and if the two legal subgraphs are equivalent, output A pair of mutually equivalent subgraphs. Repeat this step to search mutually equivalent subgraphs from multiple legal subgraphs to form a subgraph set.
  • the training device 320 can generate the first mapping relationship according to step 410 and step 420, and the training device 320 can determine the subgraph to be replaced in the model to be optimized according to the first mapping relationship.
  • a subgraph with an equivalence relationship is used to output the same result based on the same input data, that is, inputting the same input data to two subgraphs with an equivalence relationship can output the same output data.
  • two mutually equivalent subgraphs may contain different operator types and subgraph structures, but the operator parameters need to be the same.
  • the operator set includes a linear rectification operator, a convolution operator, a matrix transformation operator, a matrix transposition operator, and a string concatenation operator (Concat).
  • the training device 320 searches the operator set, and obtains that subgraph 1 is equivalent to subgraph 2, that is, the two convolution operators in subgraph 1 are combined into one convolution operator.
  • the operator parameters of the two convolution operators in sub-figure 1 are the same, that is, the output dimension is 64, the input dimension is 16, and the convolution kernel is 3.
  • the operator parameters of the convolution operator in sub-figure 2 are output dimension 128, input dimension 16, and convolution kernel 3.
  • subgraph 1 contains more operators than subgraph 2
  • the execution device 310 takes more steps to calculate subgraph 1 than the steps for execution device 310 to calculate subgraph 1, and the execution device 310 calculates the duration of subgraph 1 It is longer than the execution device 310 calculates the duration of the sub-graph 1 .
  • the training device 320 can also optimize multiple legal subgraphs, delete redundant paths in the legal subgraphs, improve the accuracy of calculating subgraphs and shorten the duration of calculating subgraphs.
  • the training device 320 optimizes multiple legal subgraphs based on a pruning algorithm.
  • the automatic search for mutually equivalent subgraphs according to the operator set provided by the embodiment of this application can effectively save manpower and cover all possible mutually equivalent subgraphs. picture.
  • the computing device When the computing device replaces the subgraph in the neural network model to be optimized, it determines the second subgraph corresponding to the first subgraph in the subgraph set, and the second subgraph and the first subgraph aim at the same input data, and output Also the same. If the computing device determines that the computing resources used to execute the first sub-graph in the computing device have a higher data processing efficiency when executing the second sub-graph than when executing the first sub-graph, and use the second sub-graph as an equivalent sub-graph, The first subgraph is replaced by the second subgraph to realize the optimization of the neural network model to be optimized.
  • the neural network model optimization method will be described in detail below with reference to FIG. 7 to FIG. 10 .
  • FIG. 7 is a schematic diagram of a method for optimizing a neural network model provided by an embodiment of the present application.
  • the training device 320 in FIG. 3 is taken as an example for illustration. As shown in Fig. 7, the method includes the following steps.
  • Step 710 the training device 320 obtains the neural network model to be optimized.
  • the training device 320 can obtain the neural network model to be optimized from Internet open source models. Alternatively, the training device 320 uses the neural network model provided by the user as the neural network model to be optimized. Alternatively, the training device 320 uses the trained neural network model obtained through self-training as the neural network model to be optimized.
  • the neural network model to be optimized includes multiple operators, multiple operators form multiple subgraphs, and at least two operators form a subgraph. It can be understood that at least two continuous operators in the neural network model to be optimized form a subgraph. Different subgraphs in the neural network model to be optimized can be composed of different continuous operators.
  • Step 720 the training device 320 determines the first sub-graph to be replaced in the neural network model to be optimized according to the sub-graph set.
  • the training device 320 determines the first subgraph in the neural network model to be optimized that is the same as the subgraph included in the subgraph set according to the subgraph features.
  • Subgraph features include operator type, subgraph structure and operator parameters.
  • the operator type refers to the type of operator contained in the subgraph. For example, the operator type includes convolution, matrix transformation, matrix transpose, and linear rectification.
  • the subgraph structure refers to the connection mode of the operators contained in the subgraph.
  • Operator parameters refer to parameters such as weights of operators included in the subgraph.
  • the training device 320 matches the subgraph contained in the subgraph set with the subgraph contained in the neural network model to be optimized. If the operator type, subgraph structure and operator parameters of the two subgraphs are the same, then It is determined that the subgraph contained in the neural network model to be optimized is the same as the subgraph contained in the subgraph set, that is, the training device 320 determines a subgraph to be replaced from the neural network model to be optimized. The training device 320 traverses the subgraphs in the subgraph set, and determines all possible subgraphs to be replaced in the neural network model to be optimized.
  • the subgraphs with equivalence relations can be presented in the form of a table, as shown in Table 1.
  • sub-graph 1 is equivalent to sub-graph 2.
  • Subgraph 3 is equivalent to subgraph 4.
  • Table 1 only shows the storage form of the subgraph with the equivalence relationship in the storage device in the form of a table, and does not limit the storage form of the corresponding relationship in the storage device.
  • the corresponding relationship is in The storage form in the storage device may also be stored in other forms, which is not limited in this embodiment.
  • the training device 320 determines the first sub-graph to be replaced in the neural network model to be optimized according to the sub-graph set.
  • step 730 the training device 320 replaces the first subgraph with a second subgraph equivalent to the first subgraph in the subgraph set to obtain an optimized neural network model.
  • computing resources such as: processors
  • computing resources have different affinity for operators of different types, that is, different computing resources are suitable for Calculate operators of different operator types.
  • Affinity refers to the degree to which computing resource computing operators effectively use hardware computing power (abbreviation: computing power).
  • Operation ability is one of the basic components of mathematics ability, which refers to the ability to use the knowledge about operations to perform operations and reason to obtain the results of operations.
  • processor 1 is adapted to calculate a matrix transformation operator
  • processor 2 is adapted to calculate a matrix transpose operator. If processor 1 calculates the matrix transpose operator, processor 1 cannot effectively use the computing power of processor 1 to calculate the matrix transpose operator. Therefore, the computing power of the processor 1 to calculate the matrix transformation operator is higher than the computing power of the processor 1 to calculate the matrix transpose operator.
  • the computing power of the computing resource computing operator is related to the duration of the computing resource computing operator. If the computing power of computing resources can be effectively used when computing resource computing operators, the duration of computing resource computing operators will be shorter; if the computing power of computing resources cannot be effectively used when computing resource computing The duration is longer.
  • the execution device 310 calculates the computing power of the second sub-graph and the execution device 310 calculates the first sub-graph , and determine whether to replace the first subgraph with the second subgraph equivalent to the first subgraph in the subgraph set.
  • the execution device 310 may be a device that needs to deploy a neural network model to be optimized, and a resource that processes application data based on the neural network model to be optimized to implement application functions such as recognition.
  • the computing power of the computing resource to calculate the second sub-graph may refer to the data processing efficiency of the computing resource to calculate the second sub-graph.
  • the computing power of the computing resource to calculate the first sub-graph may refer to the data processing efficiency of the computing resource to calculate the first sub-graph.
  • the training device 320 determining the second sub-graph corresponding to the first sub-graph in the sub-graph set includes: determining the second sub-graph corresponding to the first sub-graph according to the second mapping relationship. Determining that the computing resources used to execute the first subgraph in the computing device have a higher data processing efficiency when executing the second subgraph than when executing the first subgraph, including: determining according to the second mapping relationship The data processing efficiency when the computing resource executes the second subgraph is higher than the data processing efficiency when executing the first subgraph.
  • the sub-graph set is used to indicate the computing power corresponding relationship of computing resources to calculate equivalent sub-graphs, that is, the second mapping relationship.
  • the training device 320 may determine whether to perform subgraph replacement based on the computing power correspondence.
  • the training device 320 executes step 731, that is, replacing the first subgraph with the second subgraph determined according to the computing power correspondence.
  • the computing power correspondence is used to represent the correspondence between computing resources, neural network models, sub-graphs, and computing power for running sub-graphs.
  • the computing power correspondence can be presented in the form of a table, as shown in Table 2.
  • computing resource 1 calculates the computing power of sub-graph 1 based on neural network model 1 It is lower than the computing power of computing resource 1 to calculate sub-graph 2 based on neural network model 1.
  • the computing power of computing resource 1 for calculating sub-graph 3 based on neural network model 1 is lower than the computing power of computing resource 1 for computing sub-graph 4 based on neural network model 1.
  • the training device 320 determines that the subgraph to be replaced in neural network model 1 is subgraph 1 according to the computing power correspondence, and subgraph 1 and subgraph 2 are a pair of equal Since the computing power of computing resource 1 to calculate sub-graph 1 based on neural network model 1 is lower than the computing power of computing resource 1 to calculate sub-graph 2 based on neural network model 1, the training device 320 can use neural network model 1 to Subfigure 1 is replaced with Subfigure 2.
  • Table 2 only shows the storage form of the computing power correspondence in the storage device in the form of a table, and does not limit the storage form of the computing power correspondence in the storage device.
  • the storage form in the storage device may also be stored in other forms, which is not limited in this embodiment.
  • the computing power of computing resources to calculate sub-graphs is related to the duration of computing resources to calculate sub-graphs.
  • the training device 320 may determine whether to perform subgraph replacement based on the computing resource utilization cost function.
  • the cost function is used to calculate the duration of subgraphs with equivalence relations based on the same computing resource.
  • the input data of the cost function includes operator type, subgraph structure, operator parameters and input parameters.
  • the output data of the cost function includes the duration of computing subgraphs with equivalence relations based on the same computing resource.
  • the subgraph set includes a first mapping relationship between the first subgraph and the second subgraph.
  • the training device 320 determining the second subgraph corresponding to the first subgraph in the subgraph set includes: determining the second subgraph corresponding to the first subgraph according to the first mapping relationship.
  • the training device 320 executes step 732 .
  • Step 732 Determine the duration of the second subgraph and the duration of the first subgraph respectively based on the computing resource utilization cost function.
  • Step 733 judging whether the duration of the second sub-picture is greater than the duration of the first sub-picture.
  • step 734 that is, replace the first sub-image with the second sub-image.
  • step 735 is performed, that is, the first sub-image is not replaced by the second sub-image.
  • the optimized neural network model includes a second subgraph, and when data is processed based on computing resources, the duration of the optimized neural network model is shorter than the duration of the neural network model to be optimized.
  • the sub-image set contains multiple sub-images.
  • the training device 320 can determine the subgraphs equivalent to the subgraphs in the model to be optimized in real time, that is, input the input data into the subgraphs in the model to be optimized and the subgraphs in the subgraph set, and determine the subgraphs with the same result as mutual Equivalent subgraphs.
  • the training device 320 determining the second sub-graph corresponding to the first sub-graph in the sub-graph set may also include: Step 736, the training device 320 inputs the input data to the first sub-graph, and the training device 320 runs the first sub-graph Graph, outputting the running result; the training device 320 inputs the input data to at least one sub-graph in the sub-graph set, and determines the same sub-graph as the running result as the second sub-graph. Furthermore, the training device 320 judges whether to replace the first sub-picture with the second sub-picture according to the second way.
  • the neural network model to be optimized includes a sub-graph 3, and the sub-graph 3 includes 2 matrix transformation operators and 1 matrix transposition operator, and 1 matrix transformation operator is connected to 1 matrix transposition operator operator, and then connect to a matrix transformation operator.
  • Subgraph 4 includes the convolution operator.
  • Subgraph 3 and subgraph 4 are a pair of mutually equivalent subgraphs. Replace subgraph 3 with subgraph 4, that is, replace the exchange (shuffle) operation in the distributed processing of big data with convolution, and obtain the optimized neural network model .
  • the training device 320 after the training device 320 determines to replace the first subgraph with the second subgraph according to the computing power correspondence, it can also determine the duration of the second subgraph and the duration of the first subgraph respectively according to the utilization cost function . If the time length of the second sub-picture is less than the time length of the first sub-picture, then replace the first sub-picture with the second sub-picture. As shown in FIG. 9 , the training device 320 may first execute step 731 , and then execute step 732 and step 733 , and step 734 or step 735 . Therefore, the training device 320 improves the accuracy of subgraph replacement through two judgments.
  • the computing power correspondence may also be deployed to the execution device 310, and the execution device 310 performs an optimization operation of subgraph replacement for the neural network model to be optimized according to the computing power correspondence. For example, the execution device 310 determines the first subgraph to be replaced in the neural network model to be optimized according to the computing power correspondence, and the second subgraph is equivalent to the first subgraph, then replaces the first subgraph with the second subgraph, Get the optimized neural network model.
  • the training device 320 can automatically save the equivalent subgraph that can bring performance benefits to computing resources to form the second mapping relationship, that is, the equivalent of hardware affinity
  • any model that hits the subgraph in the knowledge base can improve the reasoning performance by optimizing the neural network model to be optimized. As shown in (a) in FIG.
  • training device 320 can In neural network model 1, subgraph 1 is replaced with subgraph 2, and corresponding relationship 1 is generated, that is, computing resource 1, neural network model 1, subgraph 1 and subgraph 2.
  • training device 320 can In neural network model 2, subgraph 3 is replaced with subgraph 4, and a corresponding relationship 2 is generated, that is, computing resource 2, neural network model 2, subgraph 3, and subgraph 4.
  • the training device 320 can call the underlying AI chip and the operator interface provided by the operator adaptation layer, optimize the neural network model, and collect relevant operator and sub-graph performance data required by the system.
  • subgraph replacement may instead prolong the performance of the optimized neural network model when processing data based on computing resources. duration.
  • the equivalent subgraph designed according to the application is not necessarily applicable to different computing resources. For different computing resources, the equivalent subgraph needs to be re-analyzed and designed, and the equivalent subgraph cannot be reused.
  • the embodiment of the present application decides whether to perform subgraph replacement according to the corresponding relationship of computing power or the cost function, so as to ensure that each subgraph replacement can effectively bring low performance benefits even if it is based on different computing resources.
  • the neural network model optimization method provided by the embodiment of the present application is easy to use, and the process is fully automated. The user only needs to input the neural network model to be optimized, and the optimized neural network model can be obtained without any other operations. The optimization process is simple and efficient.
  • the training device 320 may replace the subgraph according to the methods of step 720 and step 730 above.
  • the training device 320 can continue to perform subgraph replacement based on the updated neural network model according to steps 720 and 730, traversing All possible replacement subgraphs until the optimized neural network model is obtained. It is understandable that the training device 320 performs subgraph replacement of the neural network model to be optimized according to step 720 and step 730, so as to obtain multiple updated neural network models, and the finally obtained optimized neural network model may be multiple updated neural network models.
  • the training device 320 obtains the optimized neural network model after performing subgraph replacement optimization on the neural network model to be optimized, and deploys the optimized neural network model to the execution device 310, and the execution device 310 processes application data based on the optimized neural network model.
  • the executing device 310 executes step 740 and step 750 .
  • Step 740 the training device 320 deploys the optimized neural network model to the executing device 310 .
  • step 750 the executing device 310 processes the application data based on the optimized neural network model to realize application functions such as recognition. Therefore, the duration of processing application data by the executing device 310 is reduced.
  • quantization techniques Compared with modifying the weight of the neural network model based on quantization technology, it can reduce the underlying calculation amount and achieve the acceleration effect.
  • quantization techniques generally require sample data to be corrected, otherwise it will cause a large loss of precision. In scenarios without any sample data, quantization techniques are not applicable.
  • Weight pruning is also called unstructured pruning. After the pruning is completed, it will cause sparsification. Generally, specific hardware that supports sparse computing is required, otherwise there will be no acceleration effect; channel pruning is also called structured pruning. Pruning will bring obvious loss of precision. After pruning, training data is needed to train the neural network model to improve the precision, which is not suitable for scenarios without training data.
  • the inference acceleration algorithm based on equivalent subgraph replacement provided by the embodiment of the present application can automatically search for equivalent subgraphs compatible with the hardware platform, and automatically perform subgraph replacement on the neural network model to be optimized to achieve inference acceleration without loss of accuracy.
  • the embodiment of this application does not require any data, and can achieve inference acceleration without any data, and has a wide range of application scenarios.
  • the application scenarios described in the embodiments of this application may include target detection, monitoring, automatic driving, speech recognition, product recommendation, machine translation, AI product classification, industrial quality inspection, and so on.
  • Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, and medical diagnosis.
  • the subject's data and information knowledge To put it figuratively, it is to install eyes (cameras/video cameras) and brains (algorithms) on computers to replace human eyes to identify and measure targets, so that computers can perceive the environment.
  • perception can be thought of as extracting information from sensory signals
  • computer vision can also be thought of as the science of how to make artificial systems "perceive" from images or multidimensional data.
  • computer vision is to use various imaging systems to replace the visual organs to obtain input information, and then use the computer to replace the brain to complete the processing and interpretation of these input information.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • Object detection methods can be applied in scenarios such as face detection, vehicle detection, pedestrian counting, automatic driving, security systems, and medical fields.
  • an autonomous vehicle recognizes objects in the surrounding environment during driving to adjust the speed and direction of the autonomous vehicle so that the autonomous vehicle can drive safely and avoid traffic accidents.
  • Objects may be other vehicles, traffic control devices, or other types of objects.
  • the security system a large number of users are identified to assist the staff to determine the target person as soon as possible.
  • the input data (such as image or video) is input to the neural network with target detection function, the neural network performs feature extraction on the input data, and the target detection is performed based on the extracted features, and the detection result is obtained.
  • the execution device 310 may have stored the optimized neural network model before executing step 750, that is, the execution device 310 processes the application data based on the optimized neural network model, therefore, the execution device 310 may read the optimized neural network model from the memory.
  • the network model processes application data based on the optimized neural network model.
  • the execution device 310 does not store the optimized neural network model, and needs to download the optimized neural network model from the server or optimize the neural network model by itself.
  • the server may refer to a cloud server.
  • FIG. 11 is a schematic structural diagram of a system 1100 provided in the present application.
  • the system 1100 may be an entity that provides cloud services to users by using basic resources.
  • System 1100 includes cloud data center 1110 .
  • the cloud data center 1110 includes a device resource pool (including computing resources 1111 , storage resources 1112 and network resources 1113 ) and a cloud service platform 1120 .
  • the computing resource 1111 included in the cloud data center 1110 may be a computing device (such as a server).
  • An interaction means 1131 may be deployed on the execution device 1130 .
  • the interaction means 1131 may be a browser or an application capable of message interaction with the cloud service platform 1120 .
  • the user can access the cloud service platform 1120 through the interaction device 1131, upload a request to the cloud data center 1110, and request to optimize the neural network model used in the automatic driving scene.
  • the cloud data center 1110 optimizes the neural network model requested by the user, and feeds back the optimized neural network model 301 to the execution device 1130 .
  • the execution device 1130 may be a smart terminal or an edge station.
  • the edge station can process the application data of the self-driving car and transmit the processing results to the self-driving car.
  • the processing results are used to instruct the autonomous vehicle to operate.
  • the execution device 1130 may also be an automatic driving vehicle, and the edge station deploys the optimized neural network model 301 to the automatic driving vehicle, and the automatic driving vehicle processes application data according to the optimized neural network model, and instructs the automatic driving vehicle to operate.
  • the computing device includes corresponding hardware structures and/or software modules for performing various functions.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software with reference to the units and method steps of the examples described in the embodiments disclosed in the present application. Whether a certain function is executed by hardware or by computer software driving hardware depends on the specific application scenario and design constraints of the technical solution.
  • the neural network model optimization method provided by this embodiment is described in detail above with reference to FIG. 1 to FIG. 11 , and the neural network model optimization device provided by this embodiment will be described below in conjunction with FIG. 12 .
  • FIG. 12 is a schematic structural diagram of a possible neural network model optimization device provided in this embodiment.
  • These neural network model optimization devices can be used to implement the functions of the training device 320 in the above method embodiments, and therefore can also achieve the beneficial effects of the above method embodiments.
  • the apparatus for optimizing the neural network model may be the training device 320 shown in FIG. 4 , FIG. 7 or FIG. 9 , or it may be a module (such as a chip) applied to a server.
  • the neural network model optimization apparatus 1200 includes a communication module 1210 , a module to be replaced 1220 , a replacement module 1230 and a storage module 1240 .
  • the neural network model optimization apparatus 1200 is used to realize the function of the training device 320 in the method embodiment shown in FIG. 4 , FIG. 7 or FIG. 9 above.
  • the communication module 1210 is used to obtain the neural network model to be optimized, and deploy the optimized neural network model to the execution device 310 .
  • the neural network model to be optimized includes a plurality of operators, the plurality of operators form a plurality of subgraphs, and at least two operators form a subgraph.
  • the communication module 1210 is used to execute step 710 and step 740 in FIG. 7 .
  • the module to be replaced 1220 is used to find an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set, the equivalent subgraph and the first subgraph are for the same input data, and output It is also the same, and the processing efficiency of the equivalent subgraph on the input data is greater than the processing efficiency of the first subgraph on the input data, and the subgraph set includes a plurality of subgraphs.
  • the module to be replaced 1220 is used to execute step 720 and step 730 in FIG. 7 .
  • the replacement module 1230 is used to replace the first subgraph in the neural network model to be optimized with the equivalent subgraph. For example, the replacement module 1230 is used to execute step 734 in FIG. 7 .
  • the to-be-replaced module 1220 is specifically configured to determine a second sub-graph corresponding to the first sub-graph in the sub-graph set, the second sub-graph and the first sub-graph are for the same input data, and the output is also the same ; determining that the computing resource used to execute the first sub-graph in the computing device has a data processing efficiency higher than that of executing the second sub-graph when executing the second sub-graph; subgraph as the equivalent subgraph.
  • the storage module 1240 may correspond to storing information such as sub-graph sets and operator sets in the above method embodiments.
  • the neural network model optimization apparatus 1200 may also include a search module 1250 .
  • the search module 1250 is configured to obtain an operator set according to the neural network models of multiple application scenarios; and, according to the operator set, search for the sub-graph with an equivalence relationship, so as to generate the sub-graph set.
  • the search module 1250 is used to execute step 410 to step 420 in FIG. 4 .
  • the neural network model optimization apparatus 1200 may also include an updating module 1260 .
  • the update module 1260 updates the operator set and the sub-graph set with the newly added operator.
  • the neural network model optimization device 1200 in the embodiment of the present application may be implemented by a graphics processing unit (graphics processing unit, GPU), a neural network processor (neural network processing unit, NPU), an application-specific integrated circuit (application-specific Integrated circuit, ASIC) implementation, or programmable logic device (programmable logic device, PLD) implementation
  • the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), generic array logic (GAL), or any combination thereof.
  • the neural network model optimization method shown in FIG. 4 , FIG. 7 or FIG. 9 can also be realized by software
  • the neural network model optimization device 1200 and its modules can also be software modules.
  • the neural network model optimization device 1200 may correspond to the implementation of the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the various units in the neural network model optimization device 1200 are respectively in order to realize Fig. 4 , the corresponding flow of each method in FIG. 7 or FIG. 9 , for the sake of brevity, details are not repeated here.
  • FIG. 13 is a schematic structural diagram of a computing device 1300 provided in this embodiment.
  • the computing device 1300 includes a processor 1310, a bus 1320, a memory 1330, a memory unit 1350 (also referred to as a main memory unit), and a communication interface 1340.
  • the processor 1310 , memory 1330 , memory unit 1350 and communication interface 1340 are connected through a bus 1320 .
  • the processor 1310 may be a CPU, and the processor 1310 may also be other general-purpose processors, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components wait.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the processor can also be a GPU, NPU, microprocessor, ASIC, or one or more integrated circuits used to control the program execution of the solution of this application.
  • the communication interface 1340 is used to realize communication between the computing device 1300 and external devices or devices. In this embodiment, the communication interface 1340 is used for data interaction with other computing devices.
  • Bus 1320 may include a path for communicating information between the components described above (eg, processor 1310, memory unit 1350, and storage 1330).
  • the bus 1320 may also include a power bus, a control bus, a status signal bus, and the like.
  • the various buses are labeled as bus 1320 in the figure.
  • the bus 1320 can be a peripheral component interconnection standard (Peripheral Component Interconnect Express, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, unified bus (unified bus, Ubus or UB), computer fast link ( compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • PCIe peripheral component interconnection standard
  • EISA extended industry standard architecture
  • unified bus unified bus, Ubus or UB
  • CXL compute express link
  • CIX cache coherent interconnect for accelerators
  • computing device 1300 may include multiple processors.
  • the processor may be a multi-CPU processor.
  • a processor herein may refer to one or more devices, circuits, and/or computing units for processing data (eg, computer program instructions).
  • the processor 1310 may call the sub-graph set stored in the memory 1330, determine the first sub-graph to be replaced in the neural network model to be optimized according to the sub-atlas, and use the sub-graph equivalent to the first sub-graph in the sub-atlas
  • the second subgraph replaces the first subgraph to obtain an optimized neural network model, and when data is processed based on computing resources, the duration of the optimized neural network model is shorter than the duration of the to-be-optimized neural network model.
  • the computing device 1300 includes only one processor 1310 and one memory 1330 as an example.
  • the processor 1310 and the memory 1330 are respectively used to indicate a type of device or device.
  • the quantity of each type of device or equipment can be determined according to business needs.
  • the memory unit 1350 may correspond to a storage medium for storing information such as a sub-atlas in the foregoing method embodiments.
  • the memory unit 1350 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the storage 1330 is used to store data, and may be a solid-state hard disk or a mechanical hard disk.
  • the above-mentioned computing device 1300 may be a general-purpose device or a special-purpose device.
  • the computing device 1300 may be a mobile phone terminal, a tablet computer, a notebook computer, a VR device, an AR device, a mixed reality (Mixed Reality, MR) device or an extended reality (Extended Reality, ER) device, a vehicle terminal, etc., and may also be an edge equipment (eg, a box carrying a chip with processing power), etc.
  • computing device 1300 may also be a server or other devices with computing capabilities.
  • the computing device 1300 may correspond to the neural network model optimization apparatus 1200 in this embodiment, and may correspond to executing the corresponding subject in FIG. 4 , FIG. 7 or FIG. 9 , and the neural network model optimization
  • the above and other operations and/or functions of each module in the apparatus 1200 are for realizing the corresponding processes in FIG. 4 , FIG. 7 or FIG. 9 , and for the sake of brevity, details are not repeated here.
  • the method steps in this embodiment may be implemented by means of hardware, and may also be implemented by means of a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or known in the art any other form of storage medium.
  • An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
  • the storage medium may also be a component of the processor.
  • the processor and storage medium can be located in the ASIC. Additionally, the ASIC may reside in a computing device. Certainly, the processor and the storage medium may also exist in the network device or the terminal device as discrete components.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product comprises one or more computer programs or instructions. When the computer program or instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are executed in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable devices.
  • the computer program or instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program or instructions may be downloaded from a website, computer, A server or data center transmits to another website site, computer, server or data center by wired or wireless means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrating one or more available media. Described usable medium can be magnetic medium, for example, floppy disk, hard disk, magnetic tape; It can also be optical medium, for example, digital video disc (digital video disc, DVD); It can also be semiconductor medium, for example, solid state drive (solid state drive) , SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed are a neural network model optimization method and apparatus, and a computing device, which relate to the field of artificial intelligence. The method comprises: a computing device acquiring a neural network model to be optimized, and matching a sub-graph included in a sub-graph set with a sub-graph possibly formed by an operator included in said neural network model; and if a first sub-graph in said neural network model is matched by means of the computing device, replacing the first sub-graph with a second sub-graph, which is equivalent to the first sub-graph, in the sub-graph set to obtain an optimized neural network model, wherein the optimized neural network model includes the second sub-graph, and the processing efficiency of the second sub-graph on input data is higher than the processing efficiency of the first sub-graph on the input data. In this way, insofar as it is ensured that the precision of processing data by the neural network model is lossless, the duration in which the neural network model processes the data is significantly shortened.

Description

神经网络模型优化方法、装置及计算设备Neural network model optimization method, device and computing equipment
本申请要求于2021年12月31日提交中国专利局、申请号为202111673491.2、发明名称为“神经网络模型优化方法、装置及计算设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111673491.2 and the title of the invention "Neural Network Model Optimization Method, Device and Computing Equipment" filed with the China Patent Office on December 31, 2021, the entire contents of which are hereby incorporated by reference In this application.
技术领域technical field
本申请涉及人工智能领域,尤其涉及一种神经网络模型优化方法、装置及计算设备。The present application relates to the field of artificial intelligence, in particular to a neural network model optimization method, device and computing equipment.
背景技术Background technique
人工智能(Artificial Intelligence,AI)是利用计算机模拟和扩展人的智能,感知环境、获取知识并使用知识获得结果的理论、方法、技术及应用系统。人工智能技术广泛应用于机器学习(Machine Learning,ML)、自然语言处理、计算机视觉、决策与推理、人机交互、推荐与搜索和AI基础理论等领域。基于神经网络模型处理数据实现识别等应用功能是人工智能应用的关键技术。Artificial Intelligence (AI) is a theory, method, technology and application system that uses computers to simulate and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain results. Artificial intelligence technology is widely used in machine learning (Machine Learning, ML), natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory. Processing data based on neural network models to realize application functions such as recognition is a key technology for artificial intelligence applications.
通常,云侧设备可以采用训练集对神经网络模型进行训练,使神经网络模型具备识别等应用功能,并将神经网络模型部署到至少一个终端(如:智能手机、摄像头、自动驾驶汽车等)。终端利用配置的神经网络模型对获取到的应用数据(如:图像、语音等)进行处理实现识别等应用功能。为了提高神经网络模型处理数据的精度,神经网络模型逐渐呈现结构复杂化和参数量增多的趋势,导致神经网络模型处理数据所需的计算资源算力越来越高,以及处理数据时长越来越长。Usually, the cloud-side device can use the training set to train the neural network model, so that the neural network model has application functions such as recognition, and deploy the neural network model to at least one terminal (such as: smart phones, cameras, self-driving cars, etc.). The terminal uses the configured neural network model to process the acquired application data (such as: image, voice, etc.) to realize application functions such as recognition. In order to improve the accuracy of the neural network model processing data, the neural network model gradually presents a trend of complex structure and increasing parameters, which leads to higher and higher computing resources required by the neural network model to process data, and the processing time of data is getting higher and higher. long.
发明内容Contents of the invention
本申请提供了神经网络模型优化方法、装置及计算设备,由此在确保神经网络模型处理数据的精度的前提下,缩短神经网络模型处理数据的时长。The present application provides a neural network model optimization method, device and computing equipment, thereby shortening the time for the neural network model to process data on the premise of ensuring the accuracy of the neural network model processing data.
第一方面,提供了一种神经网络模型优化方法,方法由计算设备执行。方法包括:计算设备获取到待优化神经网络模型,在子图集中查找待优化神经网络模型中的第一子图的等价子图,将待优化神经网络模型中的第一子图替换为等价子图。等价子图与第一子图针对相同的输入数据,输出也相同,且等价子图对输入数据的处理效率大于第一子图对输入数据的处理效率,子图集中包括多个子图。In a first aspect, a method for optimizing a neural network model is provided, and the method is executed by a computing device. The method includes: the computing device obtains the neural network model to be optimized, searches for an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set, and replaces the first subgraph in the neural network model to be optimized with an equivalent price submap. The equivalent subgraph and the first subgraph have the same output for the same input data, and the processing efficiency of the equivalent subgraph on the input data is greater than that of the first subgraph on the input data, and the subgraph set includes multiple subgraphs.
如此,对待优化神经网络模型中的子图进行子图替换后,由于等价子图对输入数据的处理效率大于第一子图对输入数据的处理效率,从而,在确保神经网络模型处理数据的精度无损的前提下,显著地缩短神经网络模型处理数据的时长。该方法可以实现自动优化神经网络模型,简单、直观、高效、可扩展性强,只需输入神经网络模型即可快速对神经网络模型完成优化,不需任何数据且精度无损,适用场景广泛。在一些时延敏感场景,如目标识别、自动驾驶、车牌识别、目标检测等场景,本申请实施例提供的神经网络模型优化方法尤其适用,能够有效地提升神经网络模型推理速度,缩短神经网络模型推理耗时,提升用户体验。In this way, after replacing the subgraph in the neural network model to be optimized, since the processing efficiency of the equivalent subgraph to the input data is greater than the processing efficiency of the first subgraph to the input data, thus ensuring that the neural network model processes the data Under the premise of no loss of accuracy, the time for the neural network model to process data is significantly shortened. This method can automatically optimize the neural network model. It is simple, intuitive, efficient, and highly scalable. It only needs to input the neural network model to quickly complete the optimization of the neural network model. It does not require any data and has no loss of accuracy. It is applicable to a wide range of scenarios. In some time-delay-sensitive scenarios, such as target recognition, automatic driving, license plate recognition, target detection and other scenarios, the neural network model optimization method provided by the embodiment of the present application is especially applicable, which can effectively improve the reasoning speed of the neural network model and shorten the time of the neural network model. Reasoning is time-consuming and improves user experience.
在一种可能的实现方式中,在子图集中查找待优化神经网络模型中的第一子图的等价子图包括:在子图集中确定与第一子图对应的第二子图,第二子图与第一子图针对相同的输入数据,输出也相同;确定计算设备中用于执行第一子图的计算资源执行第二子图时的数据处 理效率高于执行第一子图时数据处理效率;将第二子图作为等价子图。In a possible implementation manner, searching for an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set includes: determining a second subgraph corresponding to the first subgraph in the subgraph set, the first The second subgraph and the first subgraph have the same input data, and the output is also the same; it is determined that the computing resources used to execute the first subgraph in the computing device perform the data processing efficiency of the second subgraph higher than that of executing the first subgraph. Data processing efficiency; treat the second subgraph as an equivalent subgraph.
在另一种可能的实现方式中,在子图集中确定与第一子图对应的第二子图包括:输入输入数据至第一子图,通过计算资源运行第一子图,输出运行结果;输入输入数据至子图集中的至少一个子图,确定与运行结果相同的子图为第二子图。In another possible implementation manner, determining the second subgraph corresponding to the first subgraph in the subgraph set includes: inputting input data into the first subgraph, using computing resources to run the first subgraph, and outputting an operation result; Inputting input data into at least one subgraph in the subgraph set, and determining the subgraph identical to the running result as the second subgraph.
在另一种可能的实现方式中,方法还包括:记录第一子图与第二子图的映射关系至子图集。In another possible implementation manner, the method further includes: recording a mapping relationship between the first subgraph and the second subgraph to the subgraph set.
在另一种可能的实现方式中,子图集中包括第一子图与第二子图的第一映射关系;在子图集中确定与第一子图对应的第二子图包括:根据第一映射关系确定与第一子图对应的第二子图。In another possible implementation manner, the sub-graph set includes a first mapping relationship between the first sub-graph and the second sub-graph; determining the second sub-graph corresponding to the first sub-graph in the sub-graph set includes: according to the first The mapping relationship determines the second subgraph corresponding to the first subgraph.
用子图集指示的具有等价关系的子图,即第一映射关系,与待优化神经网络模型中的子图进行匹配。待优化神经网络模型包含多个算子,多个算子构成多个子图,至少两个算子构成一个子图。若计算设备确定到待优化神经网络模型中待替换的第一子图,用子图集中与第一子图等价的第二子图替换第一子图,得到优化后神经网络模型。优化后神经网络模型包含第二子图。基于计算资源处理数据时,优化后神经网络模型的时长小于待优化神经网络模型的时长。其中,具有等价关系的子图用于根据同一输入数据输出相同结果。The subgraphs with equivalent relationships indicated by the subgraph set, that is, the first mapping relationship, are matched with the subgraphs in the neural network model to be optimized. The neural network model to be optimized includes multiple operators, multiple operators form multiple subgraphs, and at least two operators form a subgraph. If the computing device determines the first subgraph to be replaced in the neural network model to be optimized, replace the first subgraph with a second subgraph equivalent to the first subgraph in the subgraph set to obtain an optimized neural network model. The optimized neural network model includes a second subgraph. When processing data based on computing resources, the duration of the optimized neural network model is shorter than the duration of the neural network model to be optimized. Among them, subgraphs with equivalence relations are used to output the same result according to the same input data.
示例地,根据子图特征确定待优化神经网络模型中与子图集包含的子图相同的第一子图。子图特征包含算子类型、子图结构和算子参数。如此,根据子图特征进行子图匹配,从待优化神经网络模型中搜索出与子图集包含的子图相同的子图,能够有效地提高搜索的准确性。Exemplarily, the first subgraph in the neural network model to be optimized that is the same as the subgraph included in the subgraph set is determined according to the subgraph features. Subgraph features include operator type, subgraph structure and operator parameters. In this way, subgraph matching is performed according to subgraph features, and a subgraph identical to a subgraph included in the subgraph set is searched out from the neural network model to be optimized, which can effectively improve the accuracy of the search.
在另一种可能的实现方式中,在根据子图集确定待优化神经网络模型中待替换的第一子图之前,方法还包括:根据多个应用场景的神经网络模型获取算子集;根据算子集搜索具有等价关系的子图,以生成子图集。从而,提供了一种等价子图自动搜索方法,基于算子集自动搜索可能的等价子图,无遗漏,节省人力。In another possible implementation, before determining the first subgraph to be replaced in the neural network model to be optimized according to the subgraph set, the method further includes: obtaining the operator set according to the neural network model of multiple application scenarios; The set of operators searches subgraphs with equivalence relations to generate subgraph sets. Therefore, a method for automatically searching equivalent subgraphs is provided, which automatically searches possible equivalent subgraphs based on operator sets, without omission, and saves manpower.
如此,基于等价子图对待优化神经网络模型中的子图进行子图替换,由于优化后神经网络模型中替换后的子图亲和利用优化后神经网络模型处理数据的计算资源,即计算资源计算替换后的子图能够有效利用计算资源的运算能力,从而,在确保神经网络模型处理数据的精度无损的前提下,显著地缩短神经网络模型处理数据的时长。该方法可以实现自动优化神经网络模型,简单、直观、高效、可扩展性强,只需输入神经网络模型即可快速对神经网络模型完成优化,不需任何数据且精度无损,适用场景广泛。在一些时延敏感场景,如目标识别、自动驾驶、车牌识别、目标检测等场景,本申请实施例提供的神经网络模型优化方法尤其适用,能够有效地提升神经网络模型推理速度,缩短神经网络模型推理耗时,提升用户体验。In this way, the subgraph in the neural network model to be optimized is replaced based on the equivalent subgraph. Since the replaced subgraph in the optimized neural network model is compatible with the computing resources of the optimized neural network model to process data, that is, computing resources Calculating the replaced subgraph can effectively utilize the computing power of computing resources, thereby significantly shortening the processing time of the neural network model on the premise of ensuring that the accuracy of the data processed by the neural network model is not lost. This method can automatically optimize the neural network model. It is simple, intuitive, efficient, and highly scalable. It only needs to input the neural network model to quickly complete the optimization of the neural network model. It does not require any data and has no loss of accuracy. It is applicable to a wide range of scenarios. In some time-delay-sensitive scenarios, such as target recognition, automatic driving, license plate recognition, target detection and other scenarios, the neural network model optimization method provided by the embodiment of the present application is especially applicable, which can effectively improve the reasoning speed of the neural network model and shorten the time of the neural network model. Reasoning is time-consuming and improves user experience.
在计算设备得到优化后神经网络模型后,可以将优化后神经网络模型部署到至少一个终端,使终端根据优化后神经网络模型处理应用数据时,缩短数据处理时长,提升了终端数据处理性能。After the computing device obtains the optimized neural network model, the optimized neural network model can be deployed to at least one terminal, so that when the terminal processes application data according to the optimized neural network model, the data processing time is shortened and the terminal data processing performance is improved.
其中,用子图集中与第一子图等价的第二子图替换第一子图,包括:若计算资源计算第二子图的算力高于计算资源计算第一子图的算力,用第二子图替换第一子图。计算资源计算子图的算力与计算资源计算子图的时长有关。算力可以是计算资源计算第一子图的数据处理效率。Wherein, replacing the first sub-graph with the second sub-graph equivalent to the first sub-graph in the sub-graph set includes: if the computing power of the computing resources for computing the second sub-graph is higher than the computing power of computing resources for computing the first sub-graph, Replace the first subgraph with the second subgraph. The computing power of the computing resource computing subgraph is related to the duration of the computing resource computing subgraph. The computing power may be the data processing efficiency of computing resources for computing the first subgraph.
在另一种可能的实现方式中,确定计算设备中用于执行第一子图的计算资源执行第二子图时的数据的处理效率高于执行第一子图时的数据处理效率包括:计算资源调用代价函数运行第一子图,记录第一数据处理效率;计算资源调用代价函数运行第二子图,记录第二数据处理效率;通过比较第一数据处理效率与第二数据处理效率确定执行第二子图时的数据的处 理效率高于执行第一子图时的数据处理效率。In another possible implementation manner, determining that the computing resources used to execute the first subgraph in the computing device have higher data processing efficiency when executing the second subgraph than the data processing efficiency when executing the first subgraph includes: computing The resource call cost function runs the first subgraph to record the first data processing efficiency; calculates the resource call cost function to run the second subgraph to record the second data processing efficiency; determines the execution by comparing the first data processing efficiency with the second data processing efficiency The data processing efficiency of the second sub-graph is higher than the data processing efficiency of the execution of the first sub-graph.
示例地,用子图集中与第一子图等价的第二子图替换第一子图,包括:基于计算资源利用代价函数分别确定第二子图的时长和第一子图的时长,代价函数用于基于同一计算资源计算具有等价关系的子图的时长;若第二子图的时长小于第一子图的时长,用第二子图替换第一子图。Exemplarily, replacing the first subgraph with a second subgraph equivalent to the first subgraph in the subgraph set includes: respectively determining the duration of the second subgraph and the duration of the first subgraph based on the calculation resource utilization cost function, and the cost The function is used to calculate the duration of subgraphs with an equivalence relationship based on the same computing resource; if the duration of the second subgraph is less than the duration of the first subgraph, replace the first subgraph with the second subgraph.
如此,基于代价函数衡量子图替换前后对神经网络模型推理性能的影响,自动决策是否进行子图替换,提高子图替换的准确性,从而,能够保证基于当前硬件平台子图替换能够带来性能提升。应理解,当前硬件平台可以是指利用优化后神经网络模型处理数据的计算资源。In this way, based on the cost function, the influence of subgraph replacement on the inference performance of the neural network model can be measured, and whether to perform subgraph replacement can be automatically decided to improve the accuracy of subgraph replacement, thereby ensuring that subgraph replacement based on the current hardware platform can bring performance promote. It should be understood that the current hardware platform may refer to computing resources for processing data using an optimized neural network model.
在另一种可能的实现方式中,方法还包括:记录计算资源、第一子图、第二子图的映射关系至子图集。In another possible implementation manner, the method further includes: recording the mapping relationship of the computing resource, the first sub-graph, and the second sub-graph to the sub-graph set.
在另一种可能的实现方式中,子图集中包括计算资源、第一子图、第二子图的第二映射关系;在子图集中确定与第一子图对应的第二子图包括:根据第二映射关系确定第一子图对应的第二子图;确定计算设备中用于执行第一子图的计算资源执行第二子图时的数据处理效率高于执行第一子图时数据处理效率包括:根据第二映射关系确定用于执行第一子图的计算资源执行第二子图时的数据处理效率高于执行第一子图时数据处理效率。In another possible implementation manner, the sub-graph set includes a second mapping relationship between computing resources, the first sub-graph, and the second sub-graph; determining the second sub-graph corresponding to the first sub-graph in the sub-graph set includes: Determine the second subgraph corresponding to the first subgraph according to the second mapping relationship; determine that the computing resource used to execute the first subgraph in the computing device executes the data processing efficiency of the second subgraph is higher than the data processing efficiency when executing the first subgraph The processing efficiency includes: determining according to the second mapping relationship that the computing resource used to execute the first subgraph has a higher data processing efficiency when executing the second subgraph than when executing the first subgraph.
示例地,子图集还用于指示计算资源计算第二子图和第一子图的算力对应关系。用子图集中与第一子图等价的第二子图替换第一子图,包括:根据算力对应关系所确定的第二子图替换第一子图。从而,由于算力对应关系已经指示了等价子图对计算资源亲和关系,基于算力对应关系对待优化神经网络模型进行子图替换,可以有效地提高子图替换的速度,节省子图替换的时长。Exemplarily, the sub-graph set is also used to instruct computing resources to calculate the computing power correspondence between the second sub-graph and the first sub-graph. Replacing the first subgraph with a second subgraph equivalent to the first subgraph in the subgraph set includes: replacing the first subgraph with the second subgraph determined according to the computing power correspondence. Therefore, since the computing power correspondence has already indicated the affinity between equivalent subgraphs and computing resources, the subgraph replacement of the neural network model to be optimized based on the computing power correspondence can effectively improve the speed of subgraph replacement and save subgraph replacement. duration.
在另一种可能的实现方式中,在对待优化神经网络模型进行优化过程中,计算设备还可自动导出当前硬件平台亲和的等价子图的算力对应关系,任何命中算力对应关系中等价子图的神经网络模型,根据本申请实施例提供的方法,在对应的硬件平台上均可获得推理性能提升,也可根据算力对应关系指导下一代硬件亲和的神经网络模型结构设计。In another possible implementation, during the optimization process of the neural network model to be optimized, the computing device can also automatically derive the computing power corresponding relationship of the equivalent subgraph that is compatible with the current hardware platform, and any hit computing power corresponding relationship is medium The neural network model of the valence subgraph, according to the method provided by the embodiment of the present application, can obtain reasoning performance improvement on the corresponding hardware platform, and can also guide the structural design of the next-generation hardware-friendly neural network model according to the corresponding relationship of computing power.
在另一种可能的实现方式中,计算设备更新算子集,即在算子集中增加新的算子,根据更新后算子集进行等价子图搜索,得到更新后等价子图。In another possible implementation manner, the computing device updates the operator set, that is, adds a new operator to the operator set, performs an equivalent subgraph search according to the updated operator set, and obtains an updated equivalent subgraph.
第二方面,提供了一种神经网络模型优化装置,所述装置包括用于执行第一方面或第一方面任一种可能设计中的神经网络模型优化方法的各个模块。In a second aspect, a neural network model optimization device is provided, and the device includes various modules for executing the neural network model optimization method in the first aspect or any possible design of the first aspect.
第三方面,提供了一种处理器,所述处理器用于执行第一方面或第一方面任一种可能设计中的神经网络模型优化方法的操作步骤。A third aspect provides a processor, the processor is configured to execute the operation steps of the neural network model optimization method in the first aspect or any possible design of the first aspect.
第四方面,提供一种计算设备,该计算设备包括至少一个处理器和存储器,存储器用于存储一组计算机指令;当处理器作为第一方面或第一方面任一种可能实现方式中的执行设备执行所述一组计算机指令时,执行第一方面或第一方面任一种可能实现方式中的神经网络模型优化方法的操作步骤。In a fourth aspect, there is provided a computing device, the computing device includes at least one processor and a memory, and the memory is used to store a set of computer instructions; When the device executes the set of computer instructions, it executes the first aspect or the operation steps of the neural network model optimization method in any possible implementation manner of the first aspect.
第五方面,提供一种计算机可读存储介质,包括:计算机软件指令;当计算机软件指令在计算设备中运行时,使得计算设备执行如第一方面或第一方面任意一种可能的实现方式中所述方法的操作步骤。In a fifth aspect, there is provided a computer-readable storage medium, including: computer software instructions; when the computer software instructions are run in the computing device, the computing device is made to execute the computer program described in the first aspect or any one of the possible implementation manners of the first aspect. Operational steps of the method.
第六方面,提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算设备执行如第一方面或第一方面任意一种可能的实现方式中所述方法的操作步骤。In a sixth aspect, a computer program product is provided. When the computer program product is run on a computer, the computing device executes the operation steps of the method described in the first aspect or any possible implementation manner of the first aspect.
第七方面,提供一种芯片系统,该芯片系统包括处理器,用于实现上述第一方面的方法中处理器的功能。在一种可能的设计中,所述芯片系统还包括存储器,用于保存程序指令和/ 或数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。In a seventh aspect, a chip system is provided, and the chip system includes a processor, configured to implement the functions of the processor in the method of the first aspect above. In a possible design, the chip system further includes a memory for storing program instructions and/or data. The system-on-a-chip may consist of chips, or may include chips and other discrete devices.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided in the foregoing aspects, the present application may further be combined to provide more implementation manners.
附图说明Description of drawings
图1为本申请提供的一种神经网络的结构示意图;Fig. 1 is the structural representation of a kind of neural network provided by the present application;
图2为本申请提供的一种卷积神经网络的结构示意图;Fig. 2 is a schematic structural diagram of a convolutional neural network provided by the present application;
图3为本申请提供的一种系统架构示意图;FIG. 3 is a schematic diagram of a system architecture provided by the present application;
图4为本申请提供的一种生成子图集的方法示意图;FIG. 4 is a schematic diagram of a method for generating a sub-atlas provided by the present application;
图5为本申请提供的一种生成算子集的示意图;FIG. 5 is a schematic diagram of a generation operator set provided by the present application;
图6为本申请提供的一种生成等价子图关系的示意图;FIG. 6 is a schematic diagram of generating an equivalent subgraph relationship provided by the present application;
图7为本申请实施例提供的一种神经网络模型优化方法的示意图;FIG. 7 is a schematic diagram of a neural network model optimization method provided in an embodiment of the present application;
图8为本申请实施例提供的一种子图替换的示意图;FIG. 8 is a schematic diagram of sub-graph replacement provided by the embodiment of the present application;
图9为本申请实施例提供的另一种神经网络模型优化方法的示意图;FIG. 9 is a schematic diagram of another neural network model optimization method provided in the embodiment of the present application;
图10为本申请提供的一种生成算力对应关系的示意图;FIG. 10 is a schematic diagram of a generated computing power correspondence provided by the present application;
图11为本申请提供的一种神经网络模型优化的场景示意图;FIG. 11 is a schematic diagram of a neural network model optimization scenario provided by the present application;
图12为本申请提供的一种神经网络模型优化装置的结构示意图;FIG. 12 is a schematic structural diagram of a neural network model optimization device provided by the present application;
图13为本申请提供的一种计算设备的结构示意图。FIG. 13 is a schematic structural diagram of a computing device provided in the present application.
具体实施方式Detailed ways
为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。For ease of understanding, the following first introduces related terms and neural network related concepts involved in the embodiments of the present application.
(1)神经网络(1) neural network
神经网络可以是由神经元组成的,神经元可以是指以x s和截距1为输入的运算单元。该运算单元的输出满足如下公式(1)。 A neural network can be made up of neurons, which can refer to operational units that take x s and intercept 1 as input. The output of the arithmetic unit satisfies the following formula (1).
Figure PCTCN2022142689-appb-000001
Figure PCTCN2022142689-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经元的偏置。f为神经元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经元联结在一起形成的网络,即一个神经元的输出可以是另一个神经元的输入。每个神经元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经元组成的区域。权重表征不同神经元之间连接的强度。权重决定着输入对输出的影响力。权重近于0意味着改变输入不改变输出。负权重意味着增加输入降低输出。 Among them, s=1, 2, ... n, n is a natural number greater than 1, W s is the weight of x s , and b is the bias of the neuron. f is the activation function of the neuron, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neuron into an output signal. The output signal of the activation function can be used as the input of the next layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple above-mentioned single neurons, that is, the output of one neuron can be the input of another neuron. The input of each neuron can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neurons. Weights characterize the strength of connections between different neurons. The weight determines the influence of the input on the output. A weight close to 0 means that changing the input does not change the output. Negative weights mean that increasing the input decreases the output.
如图1所示,为本申请实施例提供的一种神经网络的结构示意图。神经网络100包括N个处理层,N为大于或等于3的整数。神经网络100的第一层为输入层110,负责接收输入信号,神经网络100的最后一层为输出层130,负责输出神经网络的处理结果。除去第一层和最后一层的其他层为中间层140,这些中间层140共同组成隐藏层120,隐藏层120中的每一层中间层140既可以接收输入信号,也可以输出信号。隐藏层120负责输入信号的处理过程。每一层代表了信号处理的一个逻辑级别,通过多个层,数据信号可经过多级逻辑的处理。As shown in FIG. 1 , it is a schematic structural diagram of a neural network provided in the embodiment of the present application. The neural network 100 includes N processing layers, where N is an integer greater than or equal to 3. The first layer of the neural network 100 is the input layer 110, which is responsible for receiving input signals, and the last layer of the neural network 100 is the output layer 130, which is responsible for outputting the processing results of the neural network. The other layers except the first layer and the last layer are intermediate layers 140, and these intermediate layers 140 together form a hidden layer 120, and each intermediate layer 140 in the hidden layer 120 can receive input signals and output signals. The hidden layer 120 is responsible for the processing of the input signal. Each layer represents a logical level of signal processing, and through multiple layers, data signals can be processed by multi-level logic.
在一些可行的实施例中该神经网络的输入信号可以是视频信号、语音信号、文本信号、图像信号或温度信号等各种形式的信号。图像信号可以是相机(图像传感器)拍摄的风景、监控设备捕捉的环境图像以及门禁系统获取的面部图像等。该神经网络的输入信号还包括其 他各种计算机可处理的工程信号,在此不再一一列举。若利用神经网络对图像信号进行深度学习,可以提高神经网络处理图像的质量。In some feasible embodiments, the input signals of the neural network may be signals in various forms such as video signals, voice signals, text signals, image signals or temperature signals. The image signal can be the scenery captured by the camera (image sensor), the environmental image captured by the monitoring equipment, and the facial image acquired by the access control system, etc. The input signal of the neural network also includes various other computer-processable engineering signals, which will not be listed one by one here. If the neural network is used to carry out deep learning on the image signal, the quality of the image processed by the neural network can be improved.
(2)深度神经网络(2) Deep Neural Network
深度神经网络(Deep Neural Network,DNN)也称多层神经网络,可以理解为具有多层隐藏层的神经网络。按照不同层的位置对深度神经网络进行划分,深度神经网络内部的神经网络可以分为三类:输入层、隐藏层和输出层。一般来说第一层是输入层,最后一层是输出层,中间的层都是隐藏层。层与层之间是全连接的,也就是说,第i层的任意一个神经元与第i+1层的任意一个神经元相连。Deep Neural Network (DNN), also known as a multi-layer neural network, can be understood as a neural network with multiple hidden layers. The deep neural network is divided according to the position of different layers. The neural network inside the deep neural network can be divided into three categories: input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer is connected to any neuron in the i+1-th layer.
虽然深度神经网络看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022142689-appb-000002
其中,
Figure PCTCN2022142689-appb-000003
是输入向量,
Figure PCTCN2022142689-appb-000004
是输出向量,
Figure PCTCN2022142689-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022142689-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2022142689-appb-000007
由于深度神经网络的层数多,系数W和偏移向量
Figure PCTCN2022142689-appb-000008
的数量也比较多。这些参数在深度神经网络中的定义如下所述:以系数W为例:假设在一个三层的深度神经网络中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022142689-appb-000009
其中,上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
Although the deep neural network looks complicated, it is actually not complicated in terms of the work of each layer. Simply put, it is the following linear relationship expression:
Figure PCTCN2022142689-appb-000002
in,
Figure PCTCN2022142689-appb-000003
is the input vector,
Figure PCTCN2022142689-appb-000004
is the output vector,
Figure PCTCN2022142689-appb-000005
Is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just an input vector
Figure PCTCN2022142689-appb-000006
After such a simple operation to get the output vector
Figure PCTCN2022142689-appb-000007
Due to the large number of layers in the deep neural network, the coefficient W and the offset vector
Figure PCTCN2022142689-appb-000008
The number is also higher. The definition of these parameters in the deep neural network is as follows: Take the coefficient W as an example: Assume that in a three-layer deep neural network, the 4th neuron of the second layer to the 2nd neuron of the third layer The linear coefficients are defined as
Figure PCTCN2022142689-appb-000009
Among them, the superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the index 2 of the third layer output and the index 4 of the second layer input.
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022142689-appb-000010
In summary, the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
Figure PCTCN2022142689-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐藏层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。It should be noted that the input layer has no W parameter. In deep neural networks, more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is also the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
(3)卷积神经网络(3) Convolutional neural network
卷积神经网络(Convolutional Neuron Network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者特征图(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层可以输出若干个特征图,特征图可以是指卷积神经网络运算过程中的中间结果。同一特征图的神经元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。也就是,图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。Convolutional Neuron Network (CNN) is a deep neural network with a convolutional structure. A convolutional neural network consists of a feature extractor consisting of a convolutional layer and a subsampling layer. The feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or feature map. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can only be connected to some adjacent neurons. A convolutional layer can output several feature maps, and the feature map can refer to the intermediate results during the operation of the convolutional neural network. Neurons in the same feature map share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information that is independent of location. That is, the statistics for one part of the image are the same as for other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。The convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
示例地,如图2所示,为本申请实施例提供的一种卷积神经网络的结构示意图。卷积神经网络200可以包括输入层210、卷积层/池化层220(其中池化层为可选的)和神经网络层230。Exemplarily, as shown in FIG. 2 , it is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application. The convolutional neural network 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional) and a neural network layer 230 .
卷积层/池化层220例如可以包括层221至层226。在一种示例中,层221例如可以为卷积层,层222例如可以为池化层,层223例如可以为卷积层,层224例如可以为池化层,层 225例如可以为卷积层,层226例如可以为池化层。在另一种示例中,层221和层222例如可以为卷积层,层223例如可以为池化层,层224和层225例如可以为卷积层,层226例如可以为池化层。卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。The convolutional layer/pooling layer 220 may include layers 221 to 226, for example. In one example, layer 221 may be a convolutional layer, layer 222 may be a pooling layer, layer 223 may be a convolutional layer, layer 224 may be a pooling layer, and layer 225 may be a convolutional layer, for example. , the layer 226 may be, for example, a pooling layer. In another example, layers 221 and 222 may be, for example, convolutional layers, layer 223 may be, for example, a pooling layer, layers 224 and 225 may be, for example, convolutional layers, and layer 226 may be, for example, a pooling layer. The output of a convolutional layer can be used as input to a subsequent pooling layer, or as input to another convolutional layer to continue the convolution operation.
将以卷积层221为例,介绍一层卷积层的内部工作原理。Taking the convolutional layer 221 as an example, the inner working principle of one convolutional layer will be introduced.
卷积层221可以包括很多个卷积算子,卷积算子也可称为核。卷积算子在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器。卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义。在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素,这取决于步长(stride)的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小与图像的大小相关。需要注意的是,权重矩阵的纵深维度(depthdimension)和输入图像的纵深维度是相同的。在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,与一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。The convolution layer 221 may include many convolution operators, and the convolution operators may also be called kernels. The role of the convolution operator in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can essentially be a weight matrix, which is usually predefined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels, depending on the value of the stride) along the horizontal direction on the input image. processing to complete the work of extracting specific features from the image. The size of this weight matrix is related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix is extended to the full depth of the input image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row×column) are applied, That is, multiple matrices of the same shape. The output of each weight matrix is stacked to form the depth dimension of the convolved image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image. Do blurring etc. The multiple weight matrices have the same size (row×column), and the feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple feature maps of the same size are combined to form the convolution operation. output.
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如层221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征。随着卷积神经网络200深度的加深,越往后的卷积层(例如层226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。When the convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (such as layer 221 ) often extracts more general features, which can also be called low-level features. As the depth of the convolutional neural network 200 deepens, the features extracted by the later convolutional layers (for example, layer 226) become more and more complex, such as high-level semantic features, and the higher the semantic features, the more suitable for unresolved issues.
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层。在如图2中卷积层/池化层220所示例的层221至层226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after convolutional layers. Each layer from layer 221 to layer 226 as shown in the convolutional layer/pooling layer 220 in Figure 2 can be a layer of convolutional layer followed by a layer of pooling layer, or a multi-layer convolutional layer followed by a layer or multiple pooling layers. In image processing, the sole purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling. The maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层隐藏层(如图2所示的层231、层232至层23n)以及输出层240,该多层隐藏层中所包含 的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别、图像分类和图像超分辨率重建等等。After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 220 extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the neural network layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the neural network layer 230 may include a multi-layer hidden layer (layer 231, layer 232 to layer 23n as shown in FIG. 2 ) and an output layer 240, and the parameters contained in the multi-layer hidden layer may be determined according to specific tasks. Types of related training data are pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
在神经网络层230中的多层隐藏层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由层210至层240方向的传播为前向传播)完成,反向传播(如图2由层240至层210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。After the multi-layer hidden layer in the neural network layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, which has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error. Once The forward propagation of the entire convolutional neural network 200 (as shown in Figure 2 is forward propagation from layer 210 to layer 240 direction) is completed, and the reverse propagation (as shown in Figure 2 is backward propagation from layer 240 to layer 210 direction) ) will begin to update the weight values and deviations of the above-mentioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the output result of the convolutional neural network 200 through the output layer and the ideal result.
需要说明的是,如图2所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。It should be noted that the convolutional neural network 200 shown in FIG. 2 is only an example of a convolutional neural network, and in specific applications, the convolutional neural network may also exist in the form of other network models.
(4)损失函数(4) Loss function
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training the deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value that you really want to predict, you can compare the predicted value of the network with the target value you really want, and then according to the difference between the two Different situations to update the weight vector of each layer of neural network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network), for example, if the network's prediction value is high , just adjust the weight vector to make it predict lower, and keep adjusting until the deep neural network can predict the real desired target value or a value very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing the loss as much as possible.
(5)反向传播算法(5) Back propagation algorithm
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。The convolutional neural network can use the back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and updating the parameters in the initial super-resolution model by backpropagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
上述神经网络也可以称为神经网络模型。神经网络包含的中间层也可以称为算子。算子用于实现神经网络中一个单元计算。例如,实现卷积层计算的算子可以称为卷积算子(conv)。实现池化层计算的算子可以称为池化算子(pool)。实现激活层计算的算子可以称为激活算子(Relu)。激活算子也可以称为线性整流算子。至少两个算子可以构成一个子图。子图是指神经网络模型中部分中间层组成的网络结构。The above-mentioned neural network can also be called a neural network model. The intermediate layers contained in the neural network can also be called operators. Operators are used to implement a unit calculation in a neural network. For example, an operator that implements convolutional layer calculations may be called a convolutional operator (conv). The operator that realizes the calculation of the pooling layer may be called a pooling operator (pool). The operator implementing the calculation of the activation layer may be called an activation operator (Relu). The activation operator can also be called a linear rectification operator. At least two operators can form a subgraph. The subgraph refers to the network structure composed of some intermediate layers in the neural network model.
本申请实施例提供一种神经网络模型优化方法,尤其是提供一种基于相互等价的子图对神经网络模型进行优化的技术,即基于多个算子自动搜索数学等价的子图,自动发现充分发挥硬件算力的子图,对神经网络模型中与其等价的子图进行替换,在确保神经网络模型处理数据的精度无损的前提下,显著地缩短神经网络模型处理数据的时长。The embodiment of the present application provides a method for optimizing a neural network model, especially a technology for optimizing a neural network model based on mutually equivalent subgraphs, that is, automatically searches for mathematically equivalent subgraphs based on multiple operators, and automatically Discover subgraphs that give full play to the computing power of the hardware, replace the equivalent subgraphs in the neural network model, and significantly shorten the processing time of the neural network model on the premise of ensuring the accuracy of the data processed by the neural network model without loss.
下面结合附图对本申请实施例的实施方式进行详细描述。The implementation of the embodiments of the present application will be described in detail below in conjunction with the accompanying drawings.
图3为本申请实施例提供的一种系统架构示意图。如图3所示,系统300包括执行设备310、训练设备320、数据库330、终端设备340、数据存储系统350和数据采集设备360。FIG. 3 is a schematic diagram of a system architecture provided by an embodiment of the present application. As shown in FIG. 3 , the system 300 includes an execution device 310 , a training device 320 , a database 330 , a terminal device 340 , a data storage system 350 and a data collection device 360 .
执行设备310可以是终端,如手机终端,平板电脑,笔记本电脑,虚拟现实(virtual reality,VR)、增强现实(augmented reality,AR)设备、混合现实(Mixed Reality,MR)设备、扩展现实(Extended Reality,ER)设备、摄像头或车载终端等,还可以是边缘设备 (例如,携带具有处理能力芯片的盒子)等。The execution device 310 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, a virtual reality (virtual reality, VR), an augmented reality (augmented reality, AR) device, a mixed reality (Mixed Reality, MR) device, an extended reality (Extended Reality, ER) devices, cameras, or vehicle-mounted terminals, etc., or edge devices (for example, boxes carrying chips with processing capabilities), etc.
训练设备320可以是服务器或者云端设备等。训练设备320具备较强的计算能力,能够运行神经网络模型,对神经网络模型进行训练等计算。The training device 320 may be a server or a cloud device or the like. The training device 320 has strong computing capability, and can run the neural network model, perform calculations such as training the neural network model.
作为一种可能的实施例,执行设备310和训练设备320是部署在不同物理设备(如:服务器或集群中的服务器)上的不同处理器。例如,执行设备310可以是神经网络模型处理器(neural network processing unit,NPU)、图形处理单元(graphic processing unit,GPU)、中央处理器(central processing unit,CPU)、其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。训练设备320可以是GPU、NPU、微处理器、特定应用集成电路(application-specific integrated circuit,ASIC)、或一个或多个用于控制本申请方案程序执行的集成电路。As a possible embodiment, the executing device 310 and the training device 320 are different processors deployed on different physical devices (such as servers or servers in a cluster). For example, the execution device 310 may be a neural network model processor (neural network processing unit, NPU), a graphics processing unit (graphic processing unit, GPU), a central processing unit (central processing unit, CPU), other general processors, digital signal Processor (digital signal processing, DSP), application-specific integrated circuit (ASIC), field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices , discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. The training device 320 may be a GPU, an NPU, a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in the solution of this application.
数据采集设备360用于采集训练数据,并将训练数据存入数据库330。训练数据可以是图像、语音和文字中至少一种形式的数据。例如,训练数据包括训练图像和训练图像中的目标。The data collection device 360 is used to collect training data and store the training data in the database 330 . The training data may be data in at least one form of images, voice and text. For example, training data includes training images and objects in the training images.
训练设备320用于利用训练数据对神经网络模型进行训练,直到神经网络模型中的损失函数收敛,且损失函数值小于特定阈值则神经网络模型训练完成,从而使得神经网络模型达到一定精度。或者,数据库330中所有的训练数据被用于训练,则神经网络模型训练完成,使训练完成的神经网络模型具有识别或分类等功能。进而,训练设备320将训练完成的神经网络模型301配置到执行设备310。执行设备310用于实现根据训练完成的神经网络模型301处理应用数据实现识别等功能。The training device 320 is used to train the neural network model with the training data until the loss function in the neural network model converges, and the training of the neural network model is completed if the loss function value is less than a specific threshold, so that the neural network model reaches a certain accuracy. Alternatively, if all the training data in the database 330 are used for training, then the training of the neural network model is completed, so that the trained neural network model has functions such as recognition or classification. Furthermore, the training device 320 configures the trained neural network model 301 to the execution device 310 . The execution device 310 is used to realize functions such as processing application data according to the trained neural network model 301 to realize recognition.
在一些实施例中,训练设备320可以将训练完成的神经网络模型301配置到多个执行设备310。每个执行设备310利用训练完成的神经网络模型301实现识别或分类等功能。In some embodiments, the training device 320 can configure the trained neural network model 301 to multiple execution devices 310 . Each execution device 310 utilizes the trained neural network model 301 to implement functions such as recognition or classification.
例如,对于自动驾驶场景,自动驾驶汽车依据预定路线行驶过程中,利用神经网络模型主要对环境中的路标、行驶参照物和道路上的障碍物等进行识别,以确保自动驾驶汽车安全准确地行驶。路标可以包含图形路标或文字路标。行驶参照物可以是建筑物或植物。道路上的障碍物可以包括动态物体(如:动物)或静止物体(如:静止的车辆)。For example, for autonomous driving scenarios, when an autonomous vehicle is driving along a predetermined route, the neural network model is used to identify road signs, driving reference objects, and obstacles on the road in the environment to ensure safe and accurate driving of the autonomous vehicle. . Signposts can contain graphical or textual signposts. Driving reference objects can be buildings or plants. Obstacles on the road may include dynamic objects (eg animals) or stationary objects (eg stationary vehicles).
又如,对于监控场景,利用神经网络模型主要对路口或园区等环境中目标(如:汽车和用户)进行识别。As another example, for monitoring scenarios, the neural network model is mainly used to identify objects (such as cars and users) in environments such as intersections or parks.
又如,对于自然语言处理场景,利用神经网络模型主要对语音或文本进行识别。As another example, for natural language processing scenarios, the neural network model is mainly used to recognize speech or text.
为了提高神经网络模型处理数据的精度,训练设备320还可以基于数据库330维护的训练数据和执行设备310提供的应用数据对神经网络模型进行迭代训练。可理解的,迭代训练是指对神经网络模型首次训练后的任意一次训练。由于数据库330维护的训练数据可能是一个全量训练集,包含了不同应用场景下获取的应用数据,不同应用场景下数据具有不同或不相似的应用场景特征(如:环境特征,时间特征)。如果训练设备320对神经网络模型进行迭代训练的训练集中包含不同或不相似的应用场景特征的数据,反而使神经网络模型处理不同应用场景下的应用数据的精度难以达到较优的效果。In order to improve the accuracy of data processing by the neural network model, the training device 320 may also iteratively train the neural network model based on the training data maintained by the database 330 and the application data provided by the execution device 310 . Understandably, iterative training refers to any training after the first training of the neural network model. Since the training data maintained by the database 330 may be a full training set, including application data acquired in different application scenarios, the data in different application scenarios have different or dissimilar application scenario features (such as: environmental features, time features). If the training set in which the training device 320 iteratively trains the neural network model contains data of different or dissimilar application scenario characteristics, it will make it difficult for the neural network model to achieve better results in processing application data in different application scenarios.
本申请实施例提供的神经网络模型优化方法,训练设备320训练完成神经网络模型后,向执行设备310部署训练完成的神经网络模型301之前,基于子图集对训练完成的神经网络模型301进行优化,即确定与神经网络模型301中子图等价的子图,基于执行设备310计算 等价的两个子图的数据处理效率,对神经网络模型301中子图进行替换,将优化后的神经网络模型301部署到执行设备310。从而,在确保执行设备310基于神经网络模型301处理数据的精度无损的前提下,显著地缩短神经网络模型301处理数据的时长。In the neural network model optimization method provided in the embodiment of the present application, after the training device 320 completes the training of the neural network model, before deploying the trained neural network model 301 to the execution device 310, optimize the trained neural network model 301 based on the sub-atlas , that is to determine the subgraph equivalent to the subgraph in the neural network model 301, based on the data processing efficiency of the two equivalent subgraphs calculated by the execution device 310, replace the subgraph in the neural network model 301, and convert the optimized neural network Model 301 is deployed to execution device 310 . Therefore, on the premise of ensuring that the accuracy of processing data by the execution device 310 based on the neural network model 301 is not compromised, the duration of data processing by the neural network model 301 is significantly shortened.
需要说明的是,在实际的应用中,数据库330中维护的训练数据不一定都来自于数据采集设备360,也有可能是从其他设备接收得到的。另外,训练设备320也不一定完全基于数据库330维护的训练数据训练神经网络模型,也有可能从云端或其他地方获取训练数据训练神经网络模型。上述描述不应该作为对本申请实施例的限定。It should be noted that, in practical applications, the training data maintained in the database 330 may not all come from the data collection device 360, and may also be received from other devices. In addition, the training device 320 does not necessarily train the neural network model based entirely on the training data maintained by the database 330, and may also obtain training data from the cloud or other places to train the neural network model. The above description should not be used as a limitation to the embodiments of the present application.
进一步地,根据执行设备310所执行的功能,还可以进一步将执行设备310细分为如图3所示的架构,如图所示,执行设备310配置有计算模块311、I/O接口312和预处理模块313。Further, according to the functions performed by the execution device 310, the execution device 310 can be further subdivided into the architecture shown in FIG. Preprocessing module 313 .
I/O接口312用于与外部设备进行数据交互。用户可以通过终端设备340向I/O接口312输入数据。输入数据可以包括图像或视频。另外,输入数据也可以来自数据库330。The I/O interface 312 is used for data interaction with external devices. A user can input data to the I/O interface 312 through the terminal device 340 . Input data can include images or video. In addition, the input data can also come from the database 330 .
预处理模块313用于根据I/O接口312接收到的输入数据进行预处理。The preprocessing module 313 is configured to perform preprocessing according to the input data received by the I/O interface 312 .
在执行设备310对输入数据进行预处理,或者在执行设备310的计算模块311执行计算等相关的处理过程中,执行设备310可以调用数据存储系统350中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据和指令等存入数据存储系统350中。When the execution device 310 preprocesses the input data, or in the execution device 310 computing module 311 performs calculation and other related processing, the execution device 310 can call the data, codes, etc. in the data storage system 350 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 350 .
例如,执行设备310存储的优化后神经网络模型可以应用于执行设备310。执行设备310获取到应用数据后,计算模块311将应用数据输入优化后神经网络模型得到处理结果。由于优化后神经网络模型是由训练设备320依据子图集优化后的模型,因此,利用优化后神经网络模型对应用数据进行处理,可以满足用户对数据处理的精度需求和时长需求。For example, the optimized neural network model stored in the execution device 310 may be applied to the execution device 310 . After the execution device 310 obtains the application data, the calculation module 311 inputs the application data into the optimized neural network model to obtain a processing result. Since the optimized neural network model is a model optimized by the training device 320 according to the sub-atlas, processing the application data by using the optimized neural network model can meet the accuracy and duration requirements of the user for data processing.
最后,I/O接口312将处理结果返回给终端设备340,从而提供给用户,以便用户查看处理结果。Finally, the I/O interface 312 returns the processing result to the terminal device 340, thereby providing it to the user, so that the user can view the processing result.
在图3所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口312提供的界面进行操作。另一种情况下,终端设备340可以自动地向I/O接口312发送输入数据,如果要求终端设备340自动发送输入数据需要获得用户的授权,则用户可以在终端设备340中设置相应权限。用户可以在终端设备340查看执行设备310输出的处理结果,具体的呈现形式可以是显示、声音、动作等具体方式。终端设备340也可以作为数据采集端,采集如图所示输入I/O接口312的输入数据及输出I/O接口312的处理结果作为新的样本数据,并存入数据库330。当然,也可以不经过终端设备340进行采集,而是由I/O接口312将如图所示输入I/O接口312的输入数据及输出I/O接口312的处理结果,作为新的样本数据存入数据库330。In the situation shown in FIG. 3 , the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 312 . In another case, the terminal device 340 can automatically send input data to the I/O interface 312. If the terminal device 340 is required to automatically send the input data to obtain authorization from the user, the user can set corresponding permissions in the terminal device 340. The user can view the processing results output by the execution device 310 on the terminal device 340, and the specific presentation form may be specific ways such as display, sound, and action. The terminal device 340 can also be used as a data collection terminal, collecting input data input to the I/O interface 312 as shown in the figure and output processing results of the I/O interface 312 as new sample data, and storing them in the database 330 . Of course, the terminal device 340 may not be used for collection, but the I/O interface 312 will use the input data input to the I/O interface 312 as shown in the figure and the processing results of the output I/O interface 312 as new sample data Stored in database 330 .
图3仅是本申请实施例提供的一种系统架构的示意图,图3中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3中,数据存储系统350相对执行设备310是外部存储器,在其它情况下,也可以将数据存储系统350置于执行设备310中。Fig. 3 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in Fig. 3 does not constitute any limitation. For example, in Fig. 3, the data storage system 350 is The execution device 310 is an external memory, and in other cases, the data storage system 350 may also be placed in the execution device 310 .
计算设备获取到待优化神经网络模型,在子图集中查找待优化神经网络模型中的第一子图的等价子图,将待优化神经网络模型中的第一子图替换为等价子图。等价子图与第一子图针对相同的输入数据,输出也相同,且等价子图对输入数据的处理效率大于第一子图对输入数据的处理效率,子图集中包括多个子图。The computing device obtains the neural network model to be optimized, searches for an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set, and replaces the first subgraph in the neural network model to be optimized with an equivalent subgraph . The equivalent subgraph and the first subgraph have the same output for the same input data, and the processing efficiency of the equivalent subgraph on the input data is greater than that of the first subgraph on the input data, and the subgraph set includes multiple subgraphs.
接下来,结合图4至图10对本申请实施例提供的神经网络模型优化进行详细阐述。图4为本申请实施例提供的一种生成子图集的方法示意图。在这里由图3中的训练设备320为例进行说明。如图4所示,该方法包括以下步骤。Next, the optimization of the neural network model provided by the embodiment of the present application will be described in detail with reference to FIG. 4 to FIG. 10 . FIG. 4 is a schematic diagram of a method for generating a sub-atlas provided by an embodiment of the present application. Here, the training device 320 in FIG. 3 is taken as an example for illustration. As shown in Figure 4, the method includes the following steps.
步骤410、训练设备320根据多个应用场景的神经网络模型获取算子集。Step 410, the training device 320 acquires operator sets according to neural network models of multiple application scenarios.
训练设备320从应用于不同应用场景的神经网络模型中提取算子,并去掉重复算子形成算子集,算子集包含多个算子,每个算子用于实现不同的计算功能。例如,算子集包括线性整流算子(Relu)、矩阵变换算子(Reshape)、卷积算子(Conv)、池化算子(pool)、最大池化算子(Maxpool)和矩阵转置算子(Transpose)等。应用场景包括但不限于目标识别和自动驾驶场景等。本申请实施例所述的神经网络模型可以是主流的计算机视觉(computer vision,CV)模型。CV模型例如包括YOLO、AlenNet、残差网络(Residual Network,ResNet)和稠密卷积网络(Dense Convolutional Network,DenseNet)。The training device 320 extracts operators from neural network models applied to different application scenarios, and removes repeated operators to form an operator set. The operator set includes multiple operators, and each operator is used to implement different computing functions. For example, the operator set includes linear rectification operator (Relu), matrix transformation operator (Reshape), convolution operator (Conv), pooling operator (pool), maximum pooling operator (Maxpool) and matrix transposition Operator (Transpose), etc. Application scenarios include but are not limited to target recognition and automatic driving scenarios. The neural network model described in the embodiment of the present application may be a mainstream computer vision (computer vision, CV) model. CV models include, for example, YOLO, AlenNet, Residual Network (ResNet) and Dense Convolutional Network (DenseNet).
示例地,如图5中的(a)所示的神经网络模型,该神经网络模型包含9个算子,其中,9个算子包含了3个线性整流算子、3个卷积算子、2个矩阵变换算子和1个矩阵转置算子。如图5中的(b)所示,训练设备320去掉9算子中重复算子,得到的算子集包括线性整流算子、卷积算子、矩阵变换算子和矩阵转置算子。Exemplarily, the neural network model shown in (a) in Figure 5 contains 9 operators, wherein the 9 operators include 3 linear rectification operators, 3 convolution operators, 2 matrix transformation operators and 1 matrix transpose operator. As shown in (b) in FIG. 5 , the training device 320 removes repeated operators among the nine operators, and the obtained operator set includes linear rectification operators, convolution operators, matrix transformation operators and matrix transposition operators.
可选地,训练设备320也可以根据用户给定的神经网络模型获取算子集。或者,训练设备320获取用给定的算子集。Optionally, the training device 320 may also obtain the operator set according to the neural network model given by the user. Alternatively, the training device 320 acquires a given set of operators.
步骤420、训练设备320根据算子集搜索具有等价关系的子图,以生成子图集。In step 420, the training device 320 searches for subgraphs with equivalence relations according to the operator set, so as to generate a subgraph set.
训练设备320基于算子集包含的算子进行排列组合构建多个合法子图,或者,根据算子数量、算子类型和算子参数等特征进行排列组合构建多个合法子图。合法子图可以是指子图中任意一个算子的输出符合与其连接的算子的输入。The training device 320 constructs multiple legal subgraphs by permuting and combining operators included in the operator set, or constructs multiple legal subgraphs by permuting and combining operators according to characteristics such as the number of operators, operator types, and operator parameters. A legal subgraph can mean that the output of any operator in the subgraph conforms to the input of the operator connected to it.
进而,训练设备320从合法子图中搜索具有等价关系的子图,以生成子图集,子图集包括多对具有等价关系的子图。例如,训练设备320采用包括但不限于子图哈希、随机测试用例比较输出、数学等价分析等各种方法判断任意两个合法子图是否等价,若两个合法子图等价则输出一对相互等价的子图。重复此步骤,从多个合法子图中搜索相互等价的子图形成子图集。可理解的,训练设备320根据步骤410和步骤420可以生成第一映射关系,训练设备320可以根据第一映射关系确定待优化模型中待替换的子图。具有等价关系的子图用于根据同一输入数据输出相同结果,即对具有等价关系的两个子图输入相同的输入数据可以输出相同的输出数据。需要说明的是,相互等价的两个子图包含的算子类型和子图结构可能不同,但算子参数需要相同。Furthermore, the training device 320 searches the legal subgraphs for subgraphs with an equivalence relationship to generate a subgraph set, and the subgraph set includes multiple pairs of subgraphs with an equivalence relationship. For example, the training device 320 uses various methods including but not limited to subgraph hashing, random test case comparison output, mathematical equivalence analysis, etc. to determine whether any two legal subgraphs are equivalent, and if the two legal subgraphs are equivalent, output A pair of mutually equivalent subgraphs. Repeat this step to search mutually equivalent subgraphs from multiple legal subgraphs to form a subgraph set. Understandably, the training device 320 can generate the first mapping relationship according to step 410 and step 420, and the training device 320 can determine the subgraph to be replaced in the model to be optimized according to the first mapping relationship. A subgraph with an equivalence relationship is used to output the same result based on the same input data, that is, inputting the same input data to two subgraphs with an equivalence relationship can output the same output data. It should be noted that two mutually equivalent subgraphs may contain different operator types and subgraph structures, but the operator parameters need to be the same.
示例地,如图6所示,假设算子集包含线性整流算子、卷积算子、矩阵变换算子、矩阵转置算子和字符串连接算子(Concat)。训练设备320对该算子集进行搜索,得到子图1与子图2等价,即将子图1中两个卷积算子合并为一个卷积算子。子图1中两个卷积算子的算子参数相同,即输出维度为64,输入维度16,卷积核为3。子图2中的卷积算子的算子参数为输出维度为128,输入维度16,卷积核为3。但是,子图1包含的算子多于子图2包含的算子,则执行设备310计算子图1的步骤多于执行设备310计算子图1的步骤,执行设备310计算子图1的时长多于执行设备310计算子图1的时长。For example, as shown in FIG. 6 , it is assumed that the operator set includes a linear rectification operator, a convolution operator, a matrix transformation operator, a matrix transposition operator, and a string concatenation operator (Concat). The training device 320 searches the operator set, and obtains that subgraph 1 is equivalent to subgraph 2, that is, the two convolution operators in subgraph 1 are combined into one convolution operator. The operator parameters of the two convolution operators in sub-figure 1 are the same, that is, the output dimension is 64, the input dimension is 16, and the convolution kernel is 3. The operator parameters of the convolution operator in sub-figure 2 are output dimension 128, input dimension 16, and convolution kernel 3. However, if subgraph 1 contains more operators than subgraph 2, the execution device 310 takes more steps to calculate subgraph 1 than the steps for execution device 310 to calculate subgraph 1, and the execution device 310 calculates the duration of subgraph 1 It is longer than the execution device 310 calculates the duration of the sub-graph 1 .
可选地,训练设备320还可以对多个合法子图进行优化,删除合法子图中冗余的路径,提高计算子图的准确性以及缩短计算子图的时长。例如训练设备320基于剪枝算法对多个合法子图进行优化。Optionally, the training device 320 can also optimize multiple legal subgraphs, delete redundant paths in the legal subgraphs, improve the accuracy of calculating subgraphs and shorten the duration of calculating subgraphs. For example, the training device 320 optimizes multiple legal subgraphs based on a pruning algorithm.
相对根据领域专家的经验设计的相互等价的子图,本申请实施例提供的根据算子集自动搜索相互等价的子图,有效地节省人力且可覆盖到全部可能的相互等价的子图。Compared with mutually equivalent subgraphs designed according to the experience of domain experts, the automatic search for mutually equivalent subgraphs according to the operator set provided by the embodiment of this application can effectively save manpower and cover all possible mutually equivalent subgraphs. picture.
计算设备对待优化神经网络模型中的子图进行子图替换时,在子图集中确定与第一子图对应的第二子图,第二子图与第一子图针对相同的输入数据,输出也相同。若计算设备确定计算设备中用于执行第一子图的计算资源执行第二子图时的数据处理效率高于执行第一子图 时数据处理效率,将第二子图作为等价子图,用第二子图替换第一子图,实现对待优化神经网络模型的优化。下面结合图7至图10对神经网络模型优化方法进行详细阐述。When the computing device replaces the subgraph in the neural network model to be optimized, it determines the second subgraph corresponding to the first subgraph in the subgraph set, and the second subgraph and the first subgraph aim at the same input data, and output Also the same. If the computing device determines that the computing resources used to execute the first sub-graph in the computing device have a higher data processing efficiency when executing the second sub-graph than when executing the first sub-graph, and use the second sub-graph as an equivalent sub-graph, The first subgraph is replaced by the second subgraph to realize the optimization of the neural network model to be optimized. The neural network model optimization method will be described in detail below with reference to FIG. 7 to FIG. 10 .
图7为本申请实施例提供的一种神经网络模型优化方法的示意图。在这里由图3中的训练设备320为例进行说明。如图7所示,该方法包括以下步骤。FIG. 7 is a schematic diagram of a method for optimizing a neural network model provided by an embodiment of the present application. Here, the training device 320 in FIG. 3 is taken as an example for illustration. As shown in Fig. 7, the method includes the following steps.
步骤710、训练设备320获取待优化神经网络模型。Step 710, the training device 320 obtains the neural network model to be optimized.
训练设备320可以从互联网开源模型中获取待优化神经网络模型。或者,训练设备320将用户提供的神经网络模型作为待优化神经网络模型。或者,训练设备320将自行训练得到的训练后神经网络模型作为待优化神经网络模型。The training device 320 can obtain the neural network model to be optimized from Internet open source models. Alternatively, the training device 320 uses the neural network model provided by the user as the neural network model to be optimized. Alternatively, the training device 320 uses the trained neural network model obtained through self-training as the neural network model to be optimized.
待优化神经网络模型包含多个算子,多个算子构成多个子图,至少两个算子构成一个子图。可理解的是,待优化神经网络模型中至少两个连续的算子构成一个子图。待优化神经网络模型中不同的子图可以由不同连续的算子构成。The neural network model to be optimized includes multiple operators, multiple operators form multiple subgraphs, and at least two operators form a subgraph. It can be understood that at least two continuous operators in the neural network model to be optimized form a subgraph. Different subgraphs in the neural network model to be optimized can be composed of different continuous operators.
步骤720、训练设备320根据子图集确定待优化神经网络模型中待替换的第一子图。Step 720, the training device 320 determines the first sub-graph to be replaced in the neural network model to be optimized according to the sub-graph set.
训练设备320根据子图特征确定待优化神经网络模型中与子图集包含的子图相同的第一子图。子图特征包含算子类型、子图结构和算子参数。算子类型是指子图中包含的算子的种类,例如算子类型包括卷积、矩阵变换、矩阵转置和线性整流。子图结构是指子图中包含的算子的连接方式。算子参数是指子图中包含的算子的权重等参数。The training device 320 determines the first subgraph in the neural network model to be optimized that is the same as the subgraph included in the subgraph set according to the subgraph features. Subgraph features include operator type, subgraph structure and operator parameters. The operator type refers to the type of operator contained in the subgraph. For example, the operator type includes convolution, matrix transformation, matrix transpose, and linear rectification. The subgraph structure refers to the connection mode of the operators contained in the subgraph. Operator parameters refer to parameters such as weights of operators included in the subgraph.
在一些实施例中,训练设备320基于子图集包含的子图与待优化神经网络模型包含的子图进行匹配,如果两个子图的算子类型、子图结构和算子参数均相同,则确定待优化神经网络模型包含的子图与子图集包含的子图相同,即训练设备320从待优化神经网络模型中确定到一个待替换子图。训练设备320遍历子图集中的子图,确定待优化神经网络模型中所有可能的待替换子图。In some embodiments, the training device 320 matches the subgraph contained in the subgraph set with the subgraph contained in the neural network model to be optimized. If the operator type, subgraph structure and operator parameters of the two subgraphs are the same, then It is determined that the subgraph contained in the neural network model to be optimized is the same as the subgraph contained in the subgraph set, that is, the training device 320 determines a subgraph to be replaced from the neural network model to be optimized. The training device 320 traverses the subgraphs in the subgraph set, and determines all possible subgraphs to be replaced in the neural network model to be optimized.
在一种示例中,具有等价关系的子图可以以表格的形式呈现,如表1所示。In an example, the subgraphs with equivalence relations can be presented in the form of a table, as shown in Table 1.
表1Table 1
等价子图标识Equivalent subgraph identification 等价子图equivalent subgraph
等价子图1Equivalent Subgraph 1 子图1<—>子图2 Subgraph 1 <—> Subgraph 2
等价子图2Equivalent Subgraph 2 子图3<—>子图4Subfigure 3 <—> Subfigure 4
由表1可知,子图1与子图2等价。子图3与子图4等价。It can be seen from Table 1 that sub-graph 1 is equivalent to sub-graph 2. Subgraph 3 is equivalent to subgraph 4.
需要说明的是,表1只是以表格的形式示意具有等价关系的子图在存储设备中的存储形式,并不是对该对应关系在存储设备中的存储形式的限定,当然,该对应关系在存储设备中的存储形式还可以以其他的形式存储,本实施例对此不做限定。It should be noted that Table 1 only shows the storage form of the subgraph with the equivalence relationship in the storage device in the form of a table, and does not limit the storage form of the corresponding relationship in the storage device. Of course, the corresponding relationship is in The storage form in the storage device may also be stored in other forms, which is not limited in this embodiment.
假设训练设备320根据子图集确定到待优化神经网络模型中待替换的第一子图。Assume that the training device 320 determines the first sub-graph to be replaced in the neural network model to be optimized according to the sub-graph set.
步骤730、训练设备320用子图集中与第一子图等价的第二子图替换第一子图,得到优化后神经网络模型。In step 730, the training device 320 replaces the first subgraph with a second subgraph equivalent to the first subgraph in the subgraph set to obtain an optimized neural network model.
依据计算资源(如:处理器)的特征,比如处理器的处理器核数量和处理器的硬件结构,计算资源对不同算子类型的算子的亲和性不同,即不同的计算资源适于计算不同算子类型的算子。亲和性是指计算资源计算算子有效利用硬件运算能力(简称:算力)的程度。运算能力(operation ability)是数学能力的基本成分之一,指运用有关运算的知识进行运算、推理求得运算结果的能力。According to the characteristics of computing resources (such as: processors), such as the number of processor cores and the hardware structure of processors, computing resources have different affinity for operators of different types, that is, different computing resources are suitable for Calculate operators of different operator types. Affinity refers to the degree to which computing resource computing operators effectively use hardware computing power (abbreviation: computing power). Operation ability is one of the basic components of mathematics ability, which refers to the ability to use the knowledge about operations to perform operations and reason to obtain the results of operations.
例如,处理器1适于计算矩阵变换算子,处理器2适于计算矩阵转置算子。如果处理器 1计算矩阵转置算子,则处理器1无法有效利用处理器1的算力计算矩阵转置算子。因此,处理器1计算矩阵变换算子的算力高于处理器1计算矩阵转置算子的算力。For example, processor 1 is adapted to calculate a matrix transformation operator, and processor 2 is adapted to calculate a matrix transpose operator. If processor 1 calculates the matrix transpose operator, processor 1 cannot effectively use the computing power of processor 1 to calculate the matrix transpose operator. Therefore, the computing power of the processor 1 to calculate the matrix transformation operator is higher than the computing power of the processor 1 to calculate the matrix transpose operator.
应理解,计算资源计算算子的算力与计算资源计算算子的时长有关。若计算资源计算算子时能够有效利用计算资源的算力,则计算资源计算算子的时长较短;若计算资源计算算子时无法有效利用计算资源的算力,则计算资源计算算子的时长较长。It should be understood that the computing power of the computing resource computing operator is related to the duration of the computing resource computing operator. If the computing power of computing resources can be effectively used when computing resource computing operators, the duration of computing resource computing operators will be shorter; if the computing power of computing resources cannot be effectively used when computing resource computing The duration is longer.
在一些实施例中,训练设备320根据子图集确定待优化神经网络模型中待替换的第一子图后,依据执行设备310计算第二子图的算力和执行设备310计算第一子图的算力,判断是否用子图集中与第一子图等价的第二子图替换第一子图。执行设备310可以是需要部署待优化神经网络模型的设备,基于待优化神经网络模型处理应用数据实现识别等应用功能的资源。In some embodiments, after the training device 320 determines the first sub-graph to be replaced in the neural network model to be optimized according to the sub-graph set, the execution device 310 calculates the computing power of the second sub-graph and the execution device 310 calculates the first sub-graph , and determine whether to replace the first subgraph with the second subgraph equivalent to the first subgraph in the subgraph set. The execution device 310 may be a device that needs to deploy a neural network model to be optimized, and a resource that processes application data based on the neural network model to be optimized to implement application functions such as recognition.
若基于计算资源计算第二子图的算力高于计算资源计算第一子图的算力,用第二子图替换第一子图;若基于计算资源计算第二子图的算力低于计算资源计算第一子图的算力,则不用第二子图替换第一子图。计算资源计算第二子图的算力可以是指计算资源计算第二子图的数据处理效率。计算资源计算第一子图的算力可以是指计算资源计算第一子图的数据处理效率。If the calculation power of the second sub-graph based on computing resources is higher than the calculation power of the first sub-graph based on computing resources, replace the first sub-graph with the second sub-graph; if the calculation power of the second sub-graph based on computing resources is lower than Computing resources calculate the computing power of the first sub-graph, and the second sub-graph is not used to replace the first sub-graph. The computing power of the computing resource to calculate the second sub-graph may refer to the data processing efficiency of the computing resource to calculate the second sub-graph. The computing power of the computing resource to calculate the first sub-graph may refer to the data processing efficiency of the computing resource to calculate the first sub-graph.
方式一,训练设备320在子图集中确定与第一子图对应的第二子图包括:根据第二映射关系确定第一子图对应的第二子图。确定计算设备中用于执行第一子图的计算资源执行第二子图时的数据处理效率高于执行第一子图时数据处理效率包括:根据第二映射关系确定用于执行第一子图的计算资源执行第二子图时的数据处理效率高于执行第一子图时数据处理效率。 Way 1, the training device 320 determining the second sub-graph corresponding to the first sub-graph in the sub-graph set includes: determining the second sub-graph corresponding to the first sub-graph according to the second mapping relationship. Determining that the computing resources used to execute the first subgraph in the computing device have a higher data processing efficiency when executing the second subgraph than when executing the first subgraph, including: determining according to the second mapping relationship The data processing efficiency when the computing resource executes the second subgraph is higher than the data processing efficiency when executing the first subgraph.
子图集用于指示计算资源计算相互等价的子图的算力对应关系,即第二映射关系。训练设备320可以基于算力对应关系判断是否进行子图替换。训练设备320执行步骤731,即根据算力对应关系所确定的第二子图替换第一子图。The sub-graph set is used to indicate the computing power corresponding relationship of computing resources to calculate equivalent sub-graphs, that is, the second mapping relationship. The training device 320 may determine whether to perform subgraph replacement based on the computing power correspondence. The training device 320 executes step 731, that is, replacing the first subgraph with the second subgraph determined according to the computing power correspondence.
例如,算力对应关系用于表征了计算资源、神经网络模型、子图和运行子图的算力的对应关系。算力对应关系可以以表格的形式呈现,如表2所示。For example, the computing power correspondence is used to represent the correspondence between computing resources, neural network models, sub-graphs, and computing power for running sub-graphs. The computing power correspondence can be presented in the form of a table, as shown in Table 2.
表2Table 2
Figure PCTCN2022142689-appb-000011
Figure PCTCN2022142689-appb-000011
由表2可知,计算资源1、神经网络模型1、子图1、子图2、子图3和子图4具有算力对应关系,即计算资源1基于神经网络模型1计算子图1的算力低于计算资源1基于神经网络模型1计算子图2的算力。计算资源1基于神经网络模型1计算子图3的算力低于计算资源1基于神经网络模型1计算子图4的算力。It can be seen from Table 2 that computing resource 1, neural network model 1, sub-graph 1, sub-graph 2, sub-graph 3, and sub-graph 4 have a computing power correspondence, that is, computing resource 1 calculates the computing power of sub-graph 1 based on neural network model 1 It is lower than the computing power of computing resource 1 to calculate sub-graph 2 based on neural network model 1. The computing power of computing resource 1 for calculating sub-graph 3 based on neural network model 1 is lower than the computing power of computing resource 1 for computing sub-graph 4 based on neural network model 1.
示例地,假设待优化神经网络模型为神经网络模型1,训练设备320根据算力对应关系 确定神经网络模型1中待替换子图为子图1,子图1与子图2是一对相互等价的子图,由于计算资源1基于神经网络模型1计算子图1的算力低于计算资源1基于神经网络模型1计算子图2的算力,则训练设备320可以将神经网络模型1中子图1替换为子图2。For example, assuming that the neural network model to be optimized is neural network model 1, the training device 320 determines that the subgraph to be replaced in neural network model 1 is subgraph 1 according to the computing power correspondence, and subgraph 1 and subgraph 2 are a pair of equal Since the computing power of computing resource 1 to calculate sub-graph 1 based on neural network model 1 is lower than the computing power of computing resource 1 to calculate sub-graph 2 based on neural network model 1, the training device 320 can use neural network model 1 to Subfigure 1 is replaced with Subfigure 2.
需要说明的是,表2只是以表格的形式示意算力对应关系在存储设备中的存储形式,并不是对该算力对应关系在存储设备中的存储形式的限定,当然,该算力对应关系在存储设备中的存储形式还可以以其他的形式存储,本实施例对此不做限定。It should be noted that Table 2 only shows the storage form of the computing power correspondence in the storage device in the form of a table, and does not limit the storage form of the computing power correspondence in the storage device. Of course, the computing power correspondence The storage form in the storage device may also be stored in other forms, which is not limited in this embodiment.
方式二,计算资源计算子图的算力与计算资源计算子图的时长有关。训练设备320可以基于计算资源利用代价函数判断是否进行子图替换。代价函数用于基于同一计算资源计算具有等价关系的子图的时长。代价函数的输入数据包括算子类型、子图结构、算子参数和输入参数。代价函数的输出数据包括基于同一计算资源计算具有等价关系的子图的时长。In the second way, the computing power of computing resources to calculate sub-graphs is related to the duration of computing resources to calculate sub-graphs. The training device 320 may determine whether to perform subgraph replacement based on the computing resource utilization cost function. The cost function is used to calculate the duration of subgraphs with equivalence relations based on the same computing resource. The input data of the cost function includes operator type, subgraph structure, operator parameters and input parameters. The output data of the cost function includes the duration of computing subgraphs with equivalence relations based on the same computing resource.
例如,子图集中包括第一子图与第二子图的第一映射关系。训练设备320在子图集中确定与第一子图对应的第二子图包括:根据第一映射关系确定与第一子图对应的第二子图。训练设备320执行步骤732。步骤732,基于计算资源利用代价函数分别确定第二子图的时长和第一子图的时长。步骤733,判断第二子图的时长是否大于第一子图的时长。For example, the subgraph set includes a first mapping relationship between the first subgraph and the second subgraph. The training device 320 determining the second subgraph corresponding to the first subgraph in the subgraph set includes: determining the second subgraph corresponding to the first subgraph according to the first mapping relationship. The training device 320 executes step 732 . Step 732: Determine the duration of the second subgraph and the duration of the first subgraph respectively based on the computing resource utilization cost function. Step 733, judging whether the duration of the second sub-picture is greater than the duration of the first sub-picture.
若第二子图的时长小于第一子图的时长,则执行步骤734,即用第二子图替换第一子图。If the duration of the second sub-image is less than the duration of the first sub-image, perform step 734, that is, replace the first sub-image with the second sub-image.
若第二子图的时长大于或等于第一子图的时长,则执行步骤735,即不用第二子图替换第一子图。If the duration of the second sub-image is greater than or equal to the duration of the first sub-image, step 735 is performed, that is, the first sub-image is not replaced by the second sub-image.
优化后神经网络模型包含第二子图,基于计算资源处理数据时,优化后神经网络模型的时长小于待优化神经网络模型的时长。The optimized neural network model includes a second subgraph, and when data is processed based on computing resources, the duration of the optimized neural network model is shorter than the duration of the neural network model to be optimized.
方式三,子图集中包含多个子图。训练设备320可以实时确定与待优化模型中的子图等价的子图,即将输入数据分别输入待优化模型中的子图,以及子图集中的子图,将结果相同的子图确定为相互等价的子图。例如,训练设备320在子图集中确定与第一子图对应的第二子图还可以包括:步骤736,训练设备320输入所述输入数据至第一子图,通过训练设备320运行第一子图,输出运行结果;训练设备320输入所述输入数据至子图集中的至少一个子图,确定与运行结果相同的子图为第二子图。进而,训练设备320再根据方式二判断是否用第二子图替换第一子图。Mode 3, the sub-image set contains multiple sub-images. The training device 320 can determine the subgraphs equivalent to the subgraphs in the model to be optimized in real time, that is, input the input data into the subgraphs in the model to be optimized and the subgraphs in the subgraph set, and determine the subgraphs with the same result as mutual Equivalent subgraphs. For example, the training device 320 determining the second sub-graph corresponding to the first sub-graph in the sub-graph set may also include: Step 736, the training device 320 inputs the input data to the first sub-graph, and the training device 320 runs the first sub-graph Graph, outputting the running result; the training device 320 inputs the input data to at least one sub-graph in the sub-graph set, and determines the same sub-graph as the running result as the second sub-graph. Furthermore, the training device 320 judges whether to replace the first sub-picture with the second sub-picture according to the second way.
示例地,如图8所示,待优化神经网络模型包括子图3,子图3包括2个矩阵变换算子和1个矩阵转置算子,1个矩阵变换算子连接1个矩阵转置算子,再连接1个矩阵变换算子。子图4包括卷积算子。子图3与子图4是一对相互等价的子图,用子图4替换子图3,即将大数据分布式处理中的交换(shuffle)操作替换为卷积,得到优化后神经网络模型。Exemplarily, as shown in Figure 8, the neural network model to be optimized includes a sub-graph 3, and the sub-graph 3 includes 2 matrix transformation operators and 1 matrix transposition operator, and 1 matrix transformation operator is connected to 1 matrix transposition operator operator, and then connect to a matrix transformation operator. Subgraph 4 includes the convolution operator. Subgraph 3 and subgraph 4 are a pair of mutually equivalent subgraphs. Replace subgraph 3 with subgraph 4, that is, replace the exchange (shuffle) operation in the distributed processing of big data with convolution, and obtain the optimized neural network model .
在另一些实施例中,训练设备320根据算力对应关系确定用第二子图替换第一子图后,还可以再根据利用代价函数分别确定第二子图的时长和第一子图的时长。若第二子图的时长小于第一子图的时长,则用第二子图替换第一子图。如图9所示,训练设备320可以先执行步骤731,再执行步骤732和步骤733,以及步骤734或步骤735。从而,训练设备320通过两次判断提高子图替换的准确性。In some other embodiments, after the training device 320 determines to replace the first subgraph with the second subgraph according to the computing power correspondence, it can also determine the duration of the second subgraph and the duration of the first subgraph respectively according to the utilization cost function . If the time length of the second sub-picture is less than the time length of the first sub-picture, then replace the first sub-picture with the second sub-picture. As shown in FIG. 9 , the training device 320 may first execute step 731 , and then execute step 732 and step 733 , and step 734 or step 735 . Therefore, the training device 320 improves the accuracy of subgraph replacement through two judgments.
在另一些实施例中,也可将算力对应关系部署到执行设备310,由执行设备310根据算力对应关系对待优化神经网络模型进行子图替换的优化操作。例如,执行设备310根据算力对应关系确定待优化神经网络模型中待替换的第一子图,以及第二子图与第一子图等价,则用第二子图替换第一子图,得到优化后神经网络模型。In some other embodiments, the computing power correspondence may also be deployed to the execution device 310, and the execution device 310 performs an optimization operation of subgraph replacement for the neural network model to be optimized according to the computing power correspondence. For example, the execution device 310 determines the first subgraph to be replaced in the neural network model to be optimized according to the computing power correspondence, and the second subgraph is equivalent to the first subgraph, then replaces the first subgraph with the second subgraph, Get the optimized neural network model.
在另一些实施例中,训练设备320对待优化神经网络模型进行优化过程中,可自动保存对计算资源能够带来性能收益的等价子图,形成第二映射关系,即硬件亲和的等价子图知识 库,任何命中知识库中子图的模型,基于本申请实施例提供的方法,对待优化神经网络模型进行优化均可获得推理性能提升。如图10中的(a)所示,假设计算资源1基于神经网络模型1计算子图2的算力高于计算资源1基于神经网络模型1计算子图1的算力,训练设备320可以将神经网络模型1中子图1替换为子图2,并生成对应关系1,即计算资源1、神经网络模型1、子图1和子图2。如图10中的(b)所示,假设计算资源2基于神经网络模型2计算子图4的算力高于计算资源2基于神经网络模型2计算子图3的算力,训练设备320可以将神经网络模型2中子图3替换为子图4,并生成对应关系2,即计算资源2、神经网络模型2、子图3和子图4。In some other embodiments, during the optimization process of the neural network model to be optimized, the training device 320 can automatically save the equivalent subgraph that can bring performance benefits to computing resources to form the second mapping relationship, that is, the equivalent of hardware affinity In the subgraph knowledge base, any model that hits the subgraph in the knowledge base, based on the method provided by the embodiment of the present application, can improve the reasoning performance by optimizing the neural network model to be optimized. As shown in (a) in FIG. 10 , assuming that the computing power of computing resource 1 to calculate subgraph 2 based on neural network model 1 is higher than the computing power of computing resource 1 computing subgraph 1 based on neural network model 1, training device 320 can In neural network model 1, subgraph 1 is replaced with subgraph 2, and corresponding relationship 1 is generated, that is, computing resource 1, neural network model 1, subgraph 1 and subgraph 2. As shown in (b) in FIG. 10 , assuming that the computing power of computing resource 2 to calculate subgraph 4 based on neural network model 2 is higher than the computing power of computing resource 2 computing subgraph 3 based on neural network model 2, training device 320 can In neural network model 2, subgraph 3 is replaced with subgraph 4, and a corresponding relationship 2 is generated, that is, computing resource 2, neural network model 2, subgraph 3, and subgraph 4.
训练设备320可以通过算子适配层对底层AI芯片及其提供的算子接口进行调用,优化神经网络模型,以及收集系统所需的相关算子和子图性能数据。The training device 320 can call the underlying AI chip and the operator interface provided by the operator adaptation layer, optimize the neural network model, and collect relevant operator and sub-graph performance data required by the system.
相对于根据领域专家的经验设计的等价子图进行子图替换,由于未参考计算资源计算子图的算力特性,子图替换后可能反而延长基于计算资源处理数据时优化后神经网络模型的时长。而且,依据应用设计的等价子图不一定适用于不同的计算资源,针对不同的计算资源,需进行重新分析设计等价子图,无法复用等价子图。本申请实施例依据算力对应关系或代价函数决策是否进行子图替换,确保即使基于不同的计算资源,每次子图替换均可以有效低带来性能收益。本申请实施例提供的神经网络模型优化方法使用简单,流程完全自动化,用户只需要输入待优化神经网络模型,无需进行任何其他操作即可得到优化后神经网络模型,优化过程简单高效。Compared with replacing subgraphs with equivalent subgraphs designed based on the experience of domain experts, since the computing power characteristics of computing resources are not considered, subgraph replacement may instead prolong the performance of the optimized neural network model when processing data based on computing resources. duration. Moreover, the equivalent subgraph designed according to the application is not necessarily applicable to different computing resources. For different computing resources, the equivalent subgraph needs to be re-analyzed and designed, and the equivalent subgraph cannot be reused. The embodiment of the present application decides whether to perform subgraph replacement according to the corresponding relationship of computing power or the cost function, so as to ensure that each subgraph replacement can effectively bring low performance benefits even if it is based on different computing resources. The neural network model optimization method provided by the embodiment of the present application is easy to use, and the process is fully automated. The user only needs to input the neural network model to be optimized, and the optimized neural network model can be obtained without any other operations. The optimization process is simple and efficient.
需要说明的是,对于待优化神经网络模型中任意一个可能替换的子图,训练设备320可以根据上述步骤720和步骤730的方法进行子图替换。另外,若训练设备320对待优化神经网络模型已经执行了子图替换,得到更新后神经网络模型,则训练设备320可以根据步骤720和步骤730,继续基于更新后神经网络模型进行子图替换,遍历所有可能替换的子图,直到得到优化后神经网络模型。可理解的是,训练设备320根据步骤720和步骤730对待优化神经网络模型进行子图替换,可以得到多个更新后神经网络模型,最终得到的优化后神经网络模型可以是多个更新后神经网络模型中最优的神经网络模型。It should be noted that, for any subgraph that may be replaced in the neural network model to be optimized, the training device 320 may replace the subgraph according to the methods of step 720 and step 730 above. In addition, if the training device 320 has performed subgraph replacement on the neural network model to be optimized to obtain an updated neural network model, then the training device 320 can continue to perform subgraph replacement based on the updated neural network model according to steps 720 and 730, traversing All possible replacement subgraphs until the optimized neural network model is obtained. It is understandable that the training device 320 performs subgraph replacement of the neural network model to be optimized according to step 720 and step 730, so as to obtain multiple updated neural network models, and the finally obtained optimized neural network model may be multiple updated neural network models. The optimal neural network model in the model.
训练设备320对待优化神经网络模型进行子图替换优化后得到优化后神经网络模型,可以将优化后神经网络模型部署到执行设备310中,执行设备310中基于优化后神经网络模型处理应用数据。示例地,如图7或图9所示,在步骤734之后,执行设备310执行步骤740和步骤750。步骤740,训练设备320向执行设备310部署优化后神经网络模型。步骤750,执行设备310基于优化后神经网络模型处理应用数据,以实现识别等应用功能。从而,降低执行设备310处理应用数据的时长。The training device 320 obtains the optimized neural network model after performing subgraph replacement optimization on the neural network model to be optimized, and deploys the optimized neural network model to the execution device 310, and the execution device 310 processes application data based on the optimized neural network model. Exemplarily, as shown in FIG. 7 or FIG. 9 , after step 734 , the executing device 310 executes step 740 and step 750 . Step 740 , the training device 320 deploys the optimized neural network model to the executing device 310 . In step 750, the executing device 310 processes the application data based on the optimized neural network model to realize application functions such as recognition. Therefore, the duration of processing application data by the executing device 310 is reduced.
相对于基于量化技术修改神经网络模型的权重,来降低底层计算量到达加速效果。但是量化技术一般需要样本数据进行校正,否则会引起较大精度损失,在无任何样本数据的场景下,量化技术不适用。Compared with modifying the weight of the neural network model based on quantization technology, it can reduce the underlying calculation amount and achieve the acceleration effect. However, quantization techniques generally require sample data to be corrected, otherwise it will cause a large loss of precision. In scenarios without any sample data, quantization techniques are not applicable.
相对于剪枝技术将神经网络模型重要性低的权重或者通道进行删除,减少神经网络模型的参数量,达到推理加速的效果。权重剪枝又称非结构化剪枝,剪枝完成后导致稀疏化问题,一般需要特定的支持稀疏化计算的硬件,否则不会有加速效果;通道剪枝又称结构化剪枝,结构化剪枝会带来明显的精度损失,剪枝完成后需要训练数据对神经网络模型进行训练来提升精度,不适用于无训练数据的场景。Compared with the pruning technology, the weights or channels with low importance of the neural network model are deleted to reduce the parameter amount of the neural network model and achieve the effect of inference acceleration. Weight pruning is also called unstructured pruning. After the pruning is completed, it will cause sparsification. Generally, specific hardware that supports sparse computing is required, otherwise there will be no acceleration effect; channel pruning is also called structured pruning. Pruning will bring obvious loss of precision. After pruning, training data is needed to train the neural network model to improve the precision, which is not suitable for scenarios without training data.
本申请实施例提供的基于等价子图替换的推理加速算法,可自动搜索硬件平台亲和的等价子图,并自动对待优化神经网络模型进行子图替换,实现推理加速且精度无损。本申请实 施例无需任何数据,在无任何数据的情况下也能实现推理加速,应用场景广泛。The inference acceleration algorithm based on equivalent subgraph replacement provided by the embodiment of the present application can automatically search for equivalent subgraphs compatible with the hardware platform, and automatically perform subgraph replacement on the neural network model to be optimized to achieve inference acceleration without loss of accuracy. The embodiment of this application does not require any data, and can achieve inference acceleration without any data, and has a wide range of application scenarios.
本申请实施例所述的应用场景可以包括目标检测、监控、自动驾驶、语音识别,商品推荐,机器翻译、AI商品分类和工业质量检测等等。The application scenarios described in the embodiments of this application may include target detection, monitoring, automatic driving, speech recognition, product recommendation, machine translation, AI product classification, industrial quality inspection, and so on.
目标检测是计算机视觉重要的组成部分。计算机视觉是各个应用领域,如制造业、检验、文档分析和医疗诊断等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取用户所需的被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成像系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。Object detection is an important part of computer vision. Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, and medical diagnosis. The subject's data and information knowledge. To put it figuratively, it is to install eyes (cameras/video cameras) and brains (algorithms) on computers to replace human eyes to identify and measure targets, so that computers can perceive the environment. Because perception can be thought of as extracting information from sensory signals, computer vision can also be thought of as the science of how to make artificial systems "perceive" from images or multidimensional data. In general, computer vision is to use various imaging systems to replace the visual organs to obtain input information, and then use the computer to replace the brain to complete the processing and interpretation of these input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
目标检测方法可以应用在人脸检测、车辆检测、行人计数、自动驾驶、安全系统和医疗领域等场景。例如,在自动驾驶场景中,自动驾驶汽车在行驶过程中,识别周围环境内的物体,以调整自动驾驶汽车的速度和方向,以便于自动驾驶汽车能够安全行驶,避免发生交通事故。物体可以是其它车辆、交通控制设备或者其它类型的物体。又如,在安全系统中,对大量的用户进行识别,辅助工作人员尽快地确定目标人物。通常,将输入数据(如图像或视频)输入到具有目标检测功能的神经网络,神经网络对输入数据进行特征提取,基于提取的特征进行目标检测,得到检测结果。Object detection methods can be applied in scenarios such as face detection, vehicle detection, pedestrian counting, automatic driving, security systems, and medical fields. For example, in an autonomous driving scenario, an autonomous vehicle recognizes objects in the surrounding environment during driving to adjust the speed and direction of the autonomous vehicle so that the autonomous vehicle can drive safely and avoid traffic accidents. Objects may be other vehicles, traffic control devices, or other types of objects. As another example, in the security system, a large number of users are identified to assist the staff to determine the target person as soon as possible. Usually, the input data (such as image or video) is input to the neural network with target detection function, the neural network performs feature extraction on the input data, and the target detection is performed based on the extracted features, and the detection result is obtained.
另外,执行设备310可以在执行步骤750,即执行设备310根据基于优化后神经网络模型处理应用数据之前,已经存储了优化后神经网络模型,因此,执行设备310可以从存储器中读取优化后神经网络模型,基于优化后神经网络模型处理应用数据。In addition, the execution device 310 may have stored the optimized neural network model before executing step 750, that is, the execution device 310 processes the application data based on the optimized neural network model, therefore, the execution device 310 may read the optimized neural network model from the memory. The network model processes application data based on the optimized neural network model.
可选地,执行设备310没有存储优化后神经网络模型,需要从服务器下载优化后神经网络模型或自行对神经网络模型进行优化。服务器可以是指云服务器。Optionally, the execution device 310 does not store the optimized neural network model, and needs to download the optimized neural network model from the server or optimize the neural network model by itself. The server may refer to a cloud server.
示例地,图11为本申请提供的一种系统1100的结构示意图,如图11所示,系统1100可以是利用基础资源向用户提供云服务的实体。系统1100包括云数据中心1110。所述云数据中心1110包括设备资源池(包括计算资源1111、存储资源1112和网络资源1113)和云服务平台1120。云数据中心1110包括的计算资源1111可以是计算设备(例如服务器)。As an example, FIG. 11 is a schematic structural diagram of a system 1100 provided in the present application. As shown in FIG. 11 , the system 1100 may be an entity that provides cloud services to users by using basic resources. System 1100 includes cloud data center 1110 . The cloud data center 1110 includes a device resource pool (including computing resources 1111 , storage resources 1112 and network resources 1113 ) and a cloud service platform 1120 . The computing resource 1111 included in the cloud data center 1110 may be a computing device (such as a server).
执行设备1130上可以部署交互装置1131。交互装置1131可以是浏览器或者能够实现与云服务平台1120进行消息交互的应用。用户可以通过交互装置1131访问云服务平台1120,向云数据中心1110上传请求,请求对用于自动驾驶场景的神经网络模型进行优化。云数据中心1110接收到执行设备1130上传的请求后,对用户请求的神经网络模型进行优化,向执行设备1130反馈优化后的神经网络模型301。执行设备1130可以是智能终端或边缘小站。边缘小站可以处理自动驾驶汽车的应用数据,将处理结果传输给自动驾驶汽车。处理结果用于指示自动驾驶汽车运行操作。或者,执行设备1130还可以是自动驾驶汽车,边缘小站将优化后的神经网络模型301部署到自动驾驶汽车,自动驾驶汽车根据优化后神经网络模型处理应用数据,指示自动驾驶汽车运行操作。An interaction means 1131 may be deployed on the execution device 1130 . The interaction means 1131 may be a browser or an application capable of message interaction with the cloud service platform 1120 . The user can access the cloud service platform 1120 through the interaction device 1131, upload a request to the cloud data center 1110, and request to optimize the neural network model used in the automatic driving scene. After receiving the request uploaded by the execution device 1130 , the cloud data center 1110 optimizes the neural network model requested by the user, and feeds back the optimized neural network model 301 to the execution device 1130 . The execution device 1130 may be a smart terminal or an edge station. The edge station can process the application data of the self-driving car and transmit the processing results to the self-driving car. The processing results are used to instruct the autonomous vehicle to operate. Alternatively, the execution device 1130 may also be an automatic driving vehicle, and the edge station deploys the optimized neural network model 301 to the automatic driving vehicle, and the automatic driving vehicle processes application data according to the optimized neural network model, and instructs the automatic driving vehicle to operate.
可以理解的是,为了实现上述实施例中的功能,计算设备包括了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本申请中所公开的实施例描述的各示例的单元及方法步骤,本申请能够以硬件或硬件和计算机软件相结合的形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用场 景和设计约束条件。It can be understood that, in order to realize the functions in the foregoing embodiments, the computing device includes corresponding hardware structures and/or software modules for performing various functions. Those skilled in the art should easily realize that the present application can be implemented in the form of hardware or a combination of hardware and computer software with reference to the units and method steps of the examples described in the embodiments disclosed in the present application. Whether a certain function is executed by hardware or by computer software driving hardware depends on the specific application scenario and design constraints of the technical solution.
上文中结合图1至图11,详细描述了根据本实施例所提供的神经网络模型优化方法,下面将结合图12,描述根据本实施例所提供的神经网络模型优化装置。The neural network model optimization method provided by this embodiment is described in detail above with reference to FIG. 1 to FIG. 11 , and the neural network model optimization device provided by this embodiment will be described below in conjunction with FIG. 12 .
图12为本实施例提供的可能的神经网络模型优化装置的结构示意图。这些神经网络模型优化装置可以用于实现上述方法实施例中训练设备320的功能,因此也能实现上述方法实施例所具备的有益效果。在本实施例中,该神经网络模型优化装置可以是如图4、图7或图9所示的训练设备320,还可以是应用于服务器的模块(如芯片)。FIG. 12 is a schematic structural diagram of a possible neural network model optimization device provided in this embodiment. These neural network model optimization devices can be used to implement the functions of the training device 320 in the above method embodiments, and therefore can also achieve the beneficial effects of the above method embodiments. In this embodiment, the apparatus for optimizing the neural network model may be the training device 320 shown in FIG. 4 , FIG. 7 or FIG. 9 , or it may be a module (such as a chip) applied to a server.
如图12所示,神经网络模型优化装置1200包括通信模块1210、待替换模块1220、替换模块1230和存储模块1240。神经网络模型优化装置1200用于实现上述图4、图7或图9中所示的方法实施例中训练设备320的功能。As shown in FIG. 12 , the neural network model optimization apparatus 1200 includes a communication module 1210 , a module to be replaced 1220 , a replacement module 1230 and a storage module 1240 . The neural network model optimization apparatus 1200 is used to realize the function of the training device 320 in the method embodiment shown in FIG. 4 , FIG. 7 or FIG. 9 above.
通信模块1210用于获取待优化神经网络模型,向执行设备310部署优化后神经网络模型。所述待优化神经网络模型包含多个算子,所述多个算子构成多个子图,至少两个算子构成一个子图。例如,通信模块1210用于执行图7中步骤710和步骤740。The communication module 1210 is used to obtain the neural network model to be optimized, and deploy the optimized neural network model to the execution device 310 . The neural network model to be optimized includes a plurality of operators, the plurality of operators form a plurality of subgraphs, and at least two operators form a subgraph. For example, the communication module 1210 is used to execute step 710 and step 740 in FIG. 7 .
待替换模块1220用于在子图集中查找所述待优化神经网络模型中的第一子图的等价子图,所述等价子图与所述第一子图针对相同的输入数据,输出也相同,且所述等价子图对所述输入数据的处理效率大于所述第一子图对所述输入数据的处理效率,所述子图集中包括多个子图。例如,待替换模块1220用于执行图7中步骤720和步骤730。The module to be replaced 1220 is used to find an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set, the equivalent subgraph and the first subgraph are for the same input data, and output It is also the same, and the processing efficiency of the equivalent subgraph on the input data is greater than the processing efficiency of the first subgraph on the input data, and the subgraph set includes a plurality of subgraphs. For example, the module to be replaced 1220 is used to execute step 720 and step 730 in FIG. 7 .
替换模块1230用于用将所述待优化神经网络模型中的第一子图替换为所述等价子图。例如,替换模块1230用于执行图7中步骤734。The replacement module 1230 is used to replace the first subgraph in the neural network model to be optimized with the equivalent subgraph. For example, the replacement module 1230 is used to execute step 734 in FIG. 7 .
待替换模块1220具体用于在所述子图集中确定与所述第一子图对应的第二子图,所述第二子图与所述第一子图针对相同的输入数据,输出也相同;确定所述计算设备中用于执行所述第一子图的计算资源执行所述第二子图时的数据处理效率高于执行所述第一子图时数据处理效率;将所述第二子图作为所述等价子图。The to-be-replaced module 1220 is specifically configured to determine a second sub-graph corresponding to the first sub-graph in the sub-graph set, the second sub-graph and the first sub-graph are for the same input data, and the output is also the same ; determining that the computing resource used to execute the first sub-graph in the computing device has a data processing efficiency higher than that of executing the second sub-graph when executing the second sub-graph; subgraph as the equivalent subgraph.
存储模块1240可以对应上述方法实施例中用于存储子图集和算子集等信息。The storage module 1240 may correspond to storing information such as sub-graph sets and operator sets in the above method embodiments.
神经网络模型优化装置1200还可以包含搜索模块1250。搜索模块1250用于根据多个应用场景的神经网络模型获取算子集;以及,根据所述算子集搜索所述具有等价关系的子图,以生成所述子图集。例如,搜索模块1250用于执行图4中步骤410至步骤420。The neural network model optimization apparatus 1200 may also include a search module 1250 . The search module 1250 is configured to obtain an operator set according to the neural network models of multiple application scenarios; and, according to the operator set, search for the sub-graph with an equivalence relationship, so as to generate the sub-graph set. For example, the search module 1250 is used to execute step 410 to step 420 in FIG. 4 .
可选地,神经网络模型优化装置1200还可以包含更新模块1260。更新模块1260用新增的算子更新算子集和子图集。Optionally, the neural network model optimization apparatus 1200 may also include an updating module 1260 . The update module 1260 updates the operator set and the sub-graph set with the newly added operator.
应理解的是,本申请实施例的神经网络模型优化装置1200可以通过图形处理器(graphics processing unit,GPU)、神经网络处理器(neural network processing unit,NPU)、特定应用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图4、图7或图9所示的神经网络模型优化方法时,神经网络模型优化装置1200及其各个模块也可以为软件模块。It should be understood that the neural network model optimization device 1200 in the embodiment of the present application may be implemented by a graphics processing unit (graphics processing unit, GPU), a neural network processor (neural network processing unit, NPU), an application-specific integrated circuit (application-specific Integrated circuit, ASIC) implementation, or programmable logic device (programmable logic device, PLD) implementation, the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), generic array logic (GAL), or any combination thereof. When the neural network model optimization method shown in FIG. 4 , FIG. 7 or FIG. 9 can also be realized by software, the neural network model optimization device 1200 and its modules can also be software modules.
根据本申请实施例的神经网络模型优化装置1200可对应于执行本申请实施例中描述的方法,并且神经网络模型优化装置1200中的各个单元的上述和其它操作和/或功能分别为了实现图4、图7或图9中的各个方法的相应流程,为了简洁,在此不再赘述。The neural network model optimization device 1200 according to the embodiment of the present application may correspond to the implementation of the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the various units in the neural network model optimization device 1200 are respectively in order to realize Fig. 4 , the corresponding flow of each method in FIG. 7 or FIG. 9 , for the sake of brevity, details are not repeated here.
图13为本实施例提供的一种计算设备1300的结构示意图。如图所示,计算设备1300包 括处理器1310、总线1320、存储器1330、内存单元1350(也可以称为主存(main memory)单元)和通信接口1340。处理器1310、存储器1330、内存单元1350和通信接口1340通过总线1320相连。FIG. 13 is a schematic structural diagram of a computing device 1300 provided in this embodiment. As shown, the computing device 1300 includes a processor 1310, a bus 1320, a memory 1330, a memory unit 1350 (also referred to as a main memory unit), and a communication interface 1340. The processor 1310 , memory 1330 , memory unit 1350 and communication interface 1340 are connected through a bus 1320 .
应理解,在本实施例中,处理器1310可以是CPU,该处理器1310还可以是其他通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。It should be understood that in this embodiment, the processor 1310 may be a CPU, and the processor 1310 may also be other general-purpose processors, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components wait. A general purpose processor may be a microprocessor or any conventional processor or the like.
处理器还可以是GPU、NPU、微处理器、ASIC、或一个或多个用于控制本申请方案程序执行的集成电路。The processor can also be a GPU, NPU, microprocessor, ASIC, or one or more integrated circuits used to control the program execution of the solution of this application.
通信接口1340用于实现计算设备1300与外部设备或器件的通信。在本实施例中,通信接口1340用于与其他计算设备进行数据交互。The communication interface 1340 is used to realize communication between the computing device 1300 and external devices or devices. In this embodiment, the communication interface 1340 is used for data interaction with other computing devices.
总线1320可以包括一通路,用于在上述组件(如处理器1310、内存单元1350和存储器1330)之间传送信息。总线1320除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线1320。总线1320可以是快捷外围部件互连标准(Peripheral Component Interconnect Express,PCIe)总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。Bus 1320 may include a path for communicating information between the components described above (eg, processor 1310, memory unit 1350, and storage 1330). In addition to the data bus, the bus 1320 may also include a power bus, a control bus, a status signal bus, and the like. However, for clarity of illustration, the various buses are labeled as bus 1320 in the figure. The bus 1320 can be a peripheral component interconnection standard (Peripheral Component Interconnect Express, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, unified bus (unified bus, Ubus or UB), computer fast link ( compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc.
作为一个示例,计算设备1300可以包括多个处理器。处理器可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的计算单元。处理器1310可以调用存储器1330存储的子图集,根据子图集确定所述待优化神经网络模型中待替换的第一子图,用所述子图集中与所述第一子图等价的第二子图替换所述第一子图,得到优化后神经网络模型,基于计算资源处理数据时,所述优化后神经网络模型的时长小于所述待优化神经网络模型的时长。As one example, computing device 1300 may include multiple processors. The processor may be a multi-CPU processor. A processor herein may refer to one or more devices, circuits, and/or computing units for processing data (eg, computer program instructions). The processor 1310 may call the sub-graph set stored in the memory 1330, determine the first sub-graph to be replaced in the neural network model to be optimized according to the sub-atlas, and use the sub-graph equivalent to the first sub-graph in the sub-atlas The second subgraph replaces the first subgraph to obtain an optimized neural network model, and when data is processed based on computing resources, the duration of the optimized neural network model is shorter than the duration of the to-be-optimized neural network model.
值得说明的是,图13中仅以计算设备1300包括1个处理器1310和1个存储器1330为例,此处,处理器1310和存储器1330分别用于指示一类器件或设备,具体实施例中,可以根据业务需求确定每种类型的器件或设备的数量。It is worth noting that in FIG. 13 , the computing device 1300 includes only one processor 1310 and one memory 1330 as an example. Here, the processor 1310 and the memory 1330 are respectively used to indicate a type of device or device. In a specific embodiment , the quantity of each type of device or equipment can be determined according to business needs.
内存单元1350可以对应上述方法实施例中用于存储子图集等信息的存储介质。内存单元1350可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。The memory unit 1350 may correspond to a storage medium for storing information such as a sub-atlas in the foregoing method embodiments. The memory unit 1350 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).
存储器1330用于存储数据,可以是固态硬盘或机械硬盘。The storage 1330 is used to store data, and may be a solid-state hard disk or a mechanical hard disk.
上述计算设备1300可以是一个通用设备或者是一个专用设备。例如,计算设备1300可以是手机终端,平板电脑,笔记本电脑,VR设备、AR设备、混合现实(Mixed Reality,MR)设备或扩展现实(Extended Reality,ER)设备,车载终端等,还可以是边缘设备(例如, 携带具有处理能力芯片的盒子)等。可选地,计算设备1300也可以是服务器或其他具有计算能力的设备。The above-mentioned computing device 1300 may be a general-purpose device or a special-purpose device. For example, the computing device 1300 may be a mobile phone terminal, a tablet computer, a notebook computer, a VR device, an AR device, a mixed reality (Mixed Reality, MR) device or an extended reality (Extended Reality, ER) device, a vehicle terminal, etc., and may also be an edge equipment (eg, a box carrying a chip with processing power), etc. Optionally, computing device 1300 may also be a server or other devices with computing capabilities.
应理解,根据本实施例的计算设备1300可对应于本实施例中的神经网络模型优化装置1200,并可以对应于执行根据图4、图7或图9中的相应主体,并且神经网络模型优化装置1200中的各个模块的上述和其它操作和/或功能分别为了实现图4、图7或图9中的相应流程,为了简洁,在此不再赘述。It should be understood that the computing device 1300 according to this embodiment may correspond to the neural network model optimization apparatus 1200 in this embodiment, and may correspond to executing the corresponding subject in FIG. 4 , FIG. 7 or FIG. 9 , and the neural network model optimization The above and other operations and/or functions of each module in the apparatus 1200 are for realizing the corresponding processes in FIG. 4 , FIG. 7 or FIG. 9 , and for the sake of brevity, details are not repeated here.
本实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于计算设备中。当然,处理器和存储介质也可以作为分立组件存在于网络设备或终端设备中。The method steps in this embodiment may be implemented by means of hardware, and may also be implemented by means of a processor executing software instructions. Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or known in the art any other form of storage medium. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be a component of the processor. The processor and storage medium can be located in the ASIC. Additionally, the ASIC may reside in a computing device. Certainly, the processor and the storage medium may also exist in the network device or the terminal device as discrete components.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer programs or instructions. When the computer program or instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are executed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable devices. The computer program or instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program or instructions may be downloaded from a website, computer, A server or data center transmits to another website site, computer, server or data center by wired or wireless means. The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrating one or more available media. Described usable medium can be magnetic medium, for example, floppy disk, hard disk, magnetic tape; It can also be optical medium, for example, digital video disc (digital video disc, DVD); It can also be semiconductor medium, for example, solid state drive (solid state drive) , SSD).
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims (17)

  1. 一种神经网络模型优化方法,其特征在于,所述方法由计算设备执行,所述方法包括:A neural network model optimization method, characterized in that said method is executed by a computing device, said method comprising:
    获取待优化神经网络模型,所述待优化神经网络模型包含多个算子,所述多个算子构成多个子图,至少两个算子构成一个子图;Obtaining a neural network model to be optimized, the neural network model to be optimized includes a plurality of operators, the plurality of operators form a plurality of subgraphs, and at least two operators form a subgraph;
    在子图集中查找所述待优化神经网络模型中的第一子图的等价子图,所述等价子图与所述第一子图针对相同的输入数据,输出也相同,且所述等价子图对所述输入数据的处理效率大于所述第一子图对所述输入数据的处理效率,所述子图集中包括多个子图;Find the equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set, the equivalent subgraph and the first subgraph are for the same input data, and the output is also the same, and the The processing efficiency of the equivalent subgraph for the input data is greater than the processing efficiency of the first subgraph for the input data, and the subgraph set includes a plurality of subgraphs;
    将所述待优化神经网络模型中的第一子图替换为所述等价子图。The first subgraph in the neural network model to be optimized is replaced with the equivalent subgraph.
  2. 根据权利要求1所述的方法,其特征在于,所述在子图集中查找所述待优化神经网络模型中的第一子图的等价子图包括:The method according to claim 1, wherein the searching for an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set comprises:
    在所述子图集中确定与所述第一子图对应的第二子图,所述第二子图与所述第一子图针对相同的输入数据,输出也相同;Determining a second subgraph corresponding to the first subgraph in the subgraph set, the second subgraph and the first subgraph are for the same input data, and the output is also the same;
    确定所述计算设备中用于执行所述第一子图的计算资源执行所述第二子图时的数据处理效率高于执行所述第一子图时数据处理效率;determining that the computing resources used to execute the first subgraph in the computing device have a higher data processing efficiency when executing the second subgraph than when executing the first subgraph;
    将所述第二子图作为所述等价子图。The second subgraph is used as the equivalent subgraph.
  3. 根据权利要求2所述的方法,其特征在于,所述在所述子图集中确定与所述第一子图对应的第二子图包括:The method according to claim 2, wherein the determining the second subgraph corresponding to the first subgraph in the subgraph set comprises:
    输入所述输入数据至所述第一子图,通过所述计算资源运行所述第一子图,输出运行结果;inputting the input data into the first subgraph, running the first subgraph through the computing resource, and outputting an operation result;
    输入所述输入数据至所述子图集中的至少一个子图,确定与所述运行结果相同的子图为所述第二子图。Inputting the input data into at least one sub-graph in the sub-graph set, and determining the same sub-graph as the operation result as the second sub-graph.
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:The method according to claim 3, characterized in that the method further comprises:
    记录所述第一子图与所述第二子图的映射关系至所述子图集。Recording the mapping relationship between the first sub-graph and the second sub-graph in the sub-graph set.
  5. 根据权利要求2所述的方法,其特征在于,所述子图集中包括所述第一子图与所述第二子图的第一映射关系;The method according to claim 2, wherein the set of sub-graphs includes a first mapping relationship between the first sub-graph and the second sub-graph;
    所述在所述子图集中确定与所述第一子图对应的第二子图包括:The determining the second sub-image corresponding to the first sub-image in the sub-image set includes:
    根据所述第一映射关系确定与所述第一子图对应的第二子图。A second subgraph corresponding to the first subgraph is determined according to the first mapping relationship.
  6. 根据权利要求3或4所述的方法,其特征在于,所述确定所述计算设备中用于执行所述第一子图的计算资源执行所述第二子图时的数据的处理效率高于执行所述第一子图时的数据处理效率包括:The method according to claim 3 or 4, characterized in that the processing efficiency of the data when the computing resource for executing the first sub-graph in the computing device is determined to execute the second sub-graph is higher than The data processing efficiency when executing the first sub-graph includes:
    所述计算资源调用代价函数运行所述第一子图,记录第一数据处理效率;The computing resource invokes a cost function to run the first subgraph, and records the first data processing efficiency;
    所述计算资源调用代价函数运行所述第二子图,记录第二数据处理效率;The computing resources call a cost function to run the second subgraph, and record the second data processing efficiency;
    通过比较所述第一数据处理效率与所述第二数据处理效率确定执行所述第二子图时的数据的处理效率高于执行所述第一子图时的数据处理效率。By comparing the first data processing efficiency with the second data processing efficiency, it is determined that the data processing efficiency when the second sub-graph is executed is higher than the data processing efficiency when the first sub-graph is executed.
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:The method according to claim 6, further comprising:
    记录所述计算资源、所述第一子图、所述第二子图的映射关系至所述子图集。Recording the mapping relationship of the computing resource, the first sub-graph, and the second sub-graph to the sub-graph set.
  8. 根据权利要求2所述的方法,其特征在于,所述子图集中包括计算资源、第一子图、第二子图的第二映射关系;The method according to claim 2, wherein the set of sub-graphs includes a second mapping relationship of computing resources, the first sub-graph, and the second sub-graph;
    所述在所述子图集中确定与所述第一子图对应的第二子图包括:The determining the second sub-image corresponding to the first sub-image in the sub-image set includes:
    根据所述第二映射关系确定所述第一子图对应的第二子图;determining a second subgraph corresponding to the first subgraph according to the second mapping relationship;
    所述确定所述计算设备中用于执行所述第一子图的计算资源执行所述第二子图时的数据 处理效率高于执行所述第一子图时数据处理效率包括:The determining that the computing resource used to execute the first subgraph in the computing device has a data processing efficiency higher than that of executing the first subgraph when executing the second subgraph includes:
    根据所述第二映射关系确定用于执行所述第一子图的计算资源执行所述第二子图时的数据处理效率高于执行所述第一子图时数据处理效率。It is determined according to the second mapping relationship that the computing resource used to execute the first subgraph has a higher data processing efficiency when executing the second subgraph than when executing the first subgraph.
  9. 一种神经网络模型优化装置,其特征在于,包括:A neural network model optimization device, characterized in that it comprises:
    通信模块,用于获取待优化神经网络模型,所述待优化神经网络模型包含多个算子,所述多个算子构成多个子图,至少两个算子构成一个子图;A communication module, configured to obtain a neural network model to be optimized, the neural network model to be optimized includes a plurality of operators, the plurality of operators form a plurality of subgraphs, and at least two operators form a subgraph;
    待替换模块,用于在子图集中查找所述待优化神经网络模型中的第一子图的等价子图,所述等价子图与所述第一子图针对相同的输入数据,输出也相同,且所述等价子图对所述输入数据的处理效率大于所述第一子图对所述输入数据的处理效率,所述子图集中包括多个子图;The module to be replaced is used to find the equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set, the equivalent subgraph and the first subgraph are for the same input data, and output It is also the same, and the processing efficiency of the equivalent subgraph for the input data is greater than the processing efficiency of the first subgraph for the input data, and the subgraph set includes multiple subgraphs;
    所述替换模块,还用于将所述待优化神经网络模型中的第一子图替换为所述等价子图。The replacement module is further configured to replace the first subgraph in the neural network model to be optimized with the equivalent subgraph.
  10. 根据权利要求9所述的装置,其特征在于,所述待替换模块在子图集中查找所述待优化神经网络模型中的第一子图的等价子图时,具体用于:The device according to claim 9, wherein when the module to be replaced searches for an equivalent subgraph of the first subgraph in the neural network model to be optimized in the subgraph set, it is specifically used for:
    在所述子图集中确定与所述第一子图对应的第二子图,所述第二子图与所述第一子图针对相同的输入数据,输出也相同;Determining a second subgraph corresponding to the first subgraph in the subgraph set, the second subgraph and the first subgraph are for the same input data, and the output is also the same;
    确定所述计算设备中用于执行所述第一子图的计算资源执行所述第二子图时的数据处理效率高于执行所述第一子图时数据处理效率;determining that the computing resources used to execute the first subgraph in the computing device have a higher data processing efficiency when executing the second subgraph than when executing the first subgraph;
    将所述第二子图作为所述等价子图。The second subgraph is used as the equivalent subgraph.
  11. 根据权利要求10所述的装置,其特征在于,所述待替换模块在所述子图集中确定与所述第一子图对应的第二子图时,具体用于:The device according to claim 10, wherein when the module to be replaced determines the second submap corresponding to the first submap in the submap set, it is specifically used for:
    输入所述输入数据至所述第一子图,通过所述计算资源运行所述第一子图,输出运行结果;inputting the input data into the first subgraph, running the first subgraph through the computing resource, and outputting an operation result;
    输入所述输入数据至所述子图集中的至少一个子图,确定与所述运行结果相同的子图为所述第二子图。Inputting the input data into at least one sub-graph in the sub-graph set, and determining the same sub-graph as the operation result as the second sub-graph.
  12. 根据权利要求11所述的装置,其特征在于,所述装置还包括:The device according to claim 11, further comprising:
    存储模块,用于记录所述第一子图与所述第二子图的映射关系至所述子图集。A storage module, configured to record the mapping relationship between the first sub-graph and the second sub-graph to the sub-graph set.
  13. 根据权利要求10所述的装置,其特征在于,所述子图集中包括所述第一子图与所述第二子图的第一映射关系;The device according to claim 10, wherein the set of sub-graphs includes a first mapping relationship between the first sub-graph and the second sub-graph;
    所述待替换模块在所述子图集中确定与所述第一子图对应的第二子图时,具体用于:When the module to be replaced determines the second submap corresponding to the first submap in the submap set, it is specifically used for:
    根据所述第一映射关系确定与所述第一子图对应的第二子图。A second subgraph corresponding to the first subgraph is determined according to the first mapping relationship.
  14. 根据权利要求11或12所述的装置,其特征在于,所述待替换模块确定所述计算设备中用于执行所述第一子图的计算资源执行所述第二子图时的数据的处理效率高于执行所述第一子图时的数据处理效率时,具体用于:The apparatus according to claim 11 or 12, wherein the module to be replaced determines the processing of data when the computing resources used to execute the first sub-graph in the computing device execute the second sub-graph When the efficiency is higher than the data processing efficiency when executing the first subgraph, it is specifically used for:
    所述计算资源调用代价函数运行所述第一子图,记录第一数据处理效率;The computing resource invokes a cost function to run the first subgraph, and records the first data processing efficiency;
    所述计算资源调用代价函数运行所述第二子图,记录第二数据处理效率;The computing resources call a cost function to run the second subgraph, and record the second data processing efficiency;
    通过比较所述第一数据处理效率与所述第二数据处理效率确定执行所述第二子图时的数据的处理效率高于执行所述第一子图时的数据处理效率。By comparing the first data processing efficiency with the second data processing efficiency, it is determined that the data processing efficiency when the second sub-graph is executed is higher than the data processing efficiency when the first sub-graph is executed.
  15. 根据权利要求14所述的装置,其特征在于,所述装置还包括:The device according to claim 14, further comprising:
    存储模块,用于记录所述计算资源、所述第一子图、所述第二子图的映射关系至所述子图集。A storage module, configured to record the mapping relationship of the computing resource, the first sub-graph, and the second sub-graph to the sub-graph set.
  16. 根据权利要求10所述的装置,其特征在于,所述子图集中包括计算资源、第一子图、 第二子图的第二映射关系;The device according to claim 10, wherein the set of sub-graphs includes a second mapping relationship of computing resources, the first sub-graph, and the second sub-graph;
    所述待替换模块在所述子图集中确定与所述第一子图对应的第二子图时,具体用于:When the module to be replaced determines the second submap corresponding to the first submap in the submap set, it is specifically used for:
    根据所述第二映射关系确定所述第一子图对应的第二子图;determining a second subgraph corresponding to the first subgraph according to the second mapping relationship;
    所述确定所述计算设备中用于执行所述第一子图的计算资源执行所述第二子图时的数据处理效率高于执行所述第一子图时数据处理效率包括:The determining that the data processing efficiency of the computing resources used to execute the first subgraph in the computing device when executing the second subgraph is higher than the data processing efficiency when executing the first subgraph includes:
    根据所述第二映射关系确定用于执行所述第一子图的计算资源执行所述第二子图时的数据处理效率高于执行所述第一子图时数据处理效率。It is determined according to the second mapping relationship that the computing resource used to execute the first subgraph has a higher data processing efficiency when executing the second subgraph than when executing the first subgraph.
  17. 一种计算设备,其特征在于,包括存储器和处理器,所述存储器用于存储一组计算机指令;当所述处理器执行所述一组计算机指令时,执行上述权利要求1-8中任一项所述的方法的操作步骤。A computing device, characterized in that it includes a memory and a processor, the memory is used to store a set of computer instructions; when the processor executes the set of computer instructions, any one of the preceding claims 1-8 is performed Operation steps of the method described in item.
PCT/CN2022/142689 2021-12-31 2022-12-28 Neural network model optimization method and apparatus, and computing device WO2023125628A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111673491.2 2021-12-31
CN202111673491.2A CN116432736A (en) 2021-12-31 2021-12-31 Neural network model optimization method and device and computing equipment

Publications (1)

Publication Number Publication Date
WO2023125628A1 true WO2023125628A1 (en) 2023-07-06

Family

ID=86997968

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142689 WO2023125628A1 (en) 2021-12-31 2022-12-28 Neural network model optimization method and apparatus, and computing device

Country Status (2)

Country Link
CN (1) CN116432736A (en)
WO (1) WO2023125628A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629339A (en) * 2023-07-21 2023-08-22 美智纵横科技有限责任公司 Model optimization method, data processing device, storage medium and chip

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114091B (en) * 2023-10-25 2024-03-05 深圳开鸿数字产业发展有限公司 Calculation graph processing method based on federal learning, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659728A (en) * 2019-09-24 2020-01-07 上海寒武纪信息科技有限公司 Neural network optimization method and device, computer equipment and storage medium
CN111723935A (en) * 2020-06-24 2020-09-29 湖北亿咖通科技有限公司 Neural network computation graph processing method, computer storage medium and electronic device
CN111860820A (en) * 2020-07-31 2020-10-30 北京灵汐科技有限公司 Neural network operator dividing method and device and dividing equipment
US20210319298A1 (en) * 2021-06-24 2021-10-14 Intel Corporation Compute-based subgraph partitioning of deep learning models for framework integration

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659728A (en) * 2019-09-24 2020-01-07 上海寒武纪信息科技有限公司 Neural network optimization method and device, computer equipment and storage medium
CN111723935A (en) * 2020-06-24 2020-09-29 湖北亿咖通科技有限公司 Neural network computation graph processing method, computer storage medium and electronic device
CN111860820A (en) * 2020-07-31 2020-10-30 北京灵汐科技有限公司 Neural network operator dividing method and device and dividing equipment
US20210319298A1 (en) * 2021-06-24 2021-10-14 Intel Corporation Compute-based subgraph partitioning of deep learning models for framework integration

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629339A (en) * 2023-07-21 2023-08-22 美智纵横科技有限责任公司 Model optimization method, data processing device, storage medium and chip
CN116629339B (en) * 2023-07-21 2023-10-03 美智纵横科技有限责任公司 Model optimization method, data processing device, storage medium and chip

Also Published As

Publication number Publication date
CN116432736A (en) 2023-07-14

Similar Documents

Publication Publication Date Title
US20220092351A1 (en) Image classification method, neural network training method, and apparatus
CN110175671B (en) Neural network construction method, image processing method and device
EP4064130A1 (en) Neural network model update method, and image processing method and device
WO2022083536A1 (en) Neural network construction method and apparatus
WO2020253416A1 (en) Object detection method and device, and computer storage medium
WO2021147325A1 (en) Object detection method and apparatus, and storage medium
WO2021155792A1 (en) Processing apparatus, method and storage medium
WO2020192736A1 (en) Object recognition method and device
WO2021238366A1 (en) Neural network construction method and apparatus
CN112990211B (en) Training method, image processing method and device for neural network
WO2021218517A1 (en) Method for acquiring neural network model, and image processing method and apparatus
WO2021057056A1 (en) Neural architecture search method, image processing method and device, and storage medium
WO2021164750A1 (en) Method and apparatus for convolutional layer quantization
WO2022001805A1 (en) Neural network distillation method and device
US20230215159A1 (en) Neural network model training method, image processing method, and apparatus
WO2023125628A1 (en) Neural network model optimization method and apparatus, and computing device
US20230082597A1 (en) Neural Network Construction Method and System
WO2021008206A1 (en) Neural architecture search method, and image processing method and device
WO2021164751A1 (en) Perception network architecture search method and device
CN110222718B (en) Image processing method and device
WO2022007867A1 (en) Method and device for constructing neural network
CN113807399A (en) Neural network training method, neural network detection method and neural network detection device
WO2021129668A1 (en) Neural network training method and device
CN114492723A (en) Neural network model training method, image processing method and device
CN115018039A (en) Neural network distillation method, target detection method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22914880

Country of ref document: EP

Kind code of ref document: A1