CN111667054B

CN111667054B - Method, device, electronic equipment and storage medium for generating neural network model

Info

Publication number: CN111667054B
Application number: CN202010503073.8A
Authority: CN
Inventors: 希滕; 张刚; 温圣召
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2023-09-01
Anticipated expiration: 2040-06-05
Also published as: CN111667054A

Abstract

The embodiment of the application discloses a method, a device, electronic equipment and a storage medium for generating a neural network model, and relates to the technical fields of artificial intelligence, deep learning and image processing. The specific implementation scheme is as follows: performing a plurality of iterative search operations, the iterative search operations comprising the steps of: determining a target compression strategy of a preset neural network model in a search space of the preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combined strategy of pruning and quantization, pruning and quantizing the preset neural network model according to the target compression strategy to obtain a current compressed model, acquiring the performance of the current compressed model, generating feedback information based on the performance of the compressed model, and determining that the current compressed model is the generated target neural network model in response to determining that the feedback information reaches a preset convergence condition. The method can search out the optimal model compression strategy.

Description

Method, device, electronic equipment and storage medium for generating neural network model

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical fields of artificial intelligence, deep learning and image processing, and particularly relates to a method, a device, electronic equipment and a storage medium for generating a neural network model.

Background

With the continuous development of artificial intelligence technology, the performance of the deep neural network reaches an unprecedented height. Complex models have good performance, but their large memory space and high computational resource consumption make it difficult to apply efficiently on hardware platforms and provide real-time services.

Disclosure of Invention

Provided are a method, apparatus, electronic device, and storage medium for generating a neural network model.

According to a first aspect, there is provided a method of generating a neural network model, the method comprising performing a plurality of iterative search operations; the iterative search operation includes the steps of: determining a target compression strategy of a preset neural network model in a search space of the preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combined strategy of pruning and quantization; pruning and quantifying a preset neural network model according to a target compression strategy to obtain a current compressed model, and obtaining the performance of the current compressed model; generating feedback information based on the performance of the compressed model, updating a preset compression strategy controller based on the feedback information in response to determining that the feedback information does not reach a preset convergence condition, and executing the next iterative search operation based on the updated compression strategy controller; and determining the current compressed model as the generated target neural network model in response to determining that the feedback information reaches a preset convergence condition.

According to a second aspect, there is provided an apparatus for generating a neural network model, the apparatus comprising: an execution unit configured to perform a plurality of iterative search operations; the execution unit includes: a search unit configured to perform the following steps in an iterative search operation: determining a target compression strategy of a preset neural network model in a search space of the preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combined strategy of pruning and quantization; a compression unit configured to perform the following steps in the iterative search operation: pruning and quantifying a preset neural network model according to a target compression strategy to obtain a current compressed model, and obtaining the performance of the current compressed model; a feedback unit configured to perform the following steps in the iterative search operation: generating feedback information based on the performance of the compressed model, updating a preset compression strategy controller based on the feedback information in response to determining that the feedback information does not reach a preset convergence condition, and executing the next iterative search operation based on the updated compression strategy controller; a determining unit configured to perform the following steps in the iterative search operation: and determining the current compressed model as the generated target neural network model in response to determining that the feedback information reaches a preset convergence condition.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described in the first aspect.

According to the technology provided by the application, the memory space occupied by the neural network model can be reduced under the condition that the accuracy of the neural network model is ensured.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method of generating a neural network model, according to the present application;

FIG. 3 is a flow chart of one embodiment of constructing a search space for a preset compression policy;

FIG. 4 is a schematic structural view of one embodiment of an apparatus for generating a neural network model according to the present application;

fig. 5 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 illustrates an exemplary architecture 100 of a method of generating a neural network model or an apparatus of generating a neural network model to which the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The terminal devices 101, 102, 103 interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as an image processing type application, an information analysis type application, a voice assistant type application, a shopping type application, a financial type application, and the like, may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting internet access, including but not limited to smartphones, tablets, notebooks, and desktop computers, etc. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a back-end server providing back-end support for applications installed on the terminal devices 101, 102, 103. For example, the server 105 may receive information to be processed transmitted by the terminal devices 101, 102, 103, process the information using a neural network model, and return the processing results to the terminal devices 101, 102, 103.

In the application scenario summary of the present disclosure, the server 105 may adapt the compressed neural network model to run on the terminal device 101, 102, 103 based on a suitable compression strategy of the pre-trained neural network model search.

It should be noted that, the method for generating the neural network model provided by the embodiments of the present disclosure is generally performed by the server 105, and accordingly, the device for generating the neural network model is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method of generating a neural network model in accordance with the present application is shown. The method of generating a neural network model includes performing a plurality of iterative search operations. Specifically, the iterative search operation includes the steps of:

step 201, determining a target compression strategy of a preset neural network model in a search space of the preset compression strategy by adopting a preset compression strategy controller.

In this embodiment, the execution body (e.g., the server 105 shown in fig. 1) of the method for generating a neural network model may determine, by using a preset compression policy controller, a target compression policy of the preset neural network model in a search space of the preset compression policy. Here, the preset neural network model is a model constructed or trained based on a deep learning method, and may be used to perform a deep learning task such as image processing or the like. The compression strategy includes a combination strategy of pruning and quantization, which may represent the quantization and pruning method employed by the neural network model, which are two important methods of compressing the neural network model. Pruning is a method for pruning unimportant network elements in a neural network model through a certain rule so as to reduce operations corresponding to the unimportant network elements in the operation process of the neural network model and reduce occupied memory space, for example, pruning convolution kernels of a certain layer of neural network in the neural network model. Pruning may also be pruning rates set for each layer of neural network in the neural network model, with common pruning rates of 5%, 10%, 15%, 20%..95%, etc. Quantization is a method of storing floating point numbers expressed by high bit values in the form of low bit values to reduce the occupied memory space, for example, the values of 64 bits and 32 bits are quantized into 16 bits, 8 bits, 4 bits, 2 bits, and the like.

The compression policy controller may be used to control or generate a compression policy for the neural network model, and may be embodied as various machine learning algorithms, such as a recurrent neural network, a reinforcement learning algorithm, a genetic algorithm, and the like. The compression policy controller may select and combine pruning policies and quantization policies in a search space of a preset compression policy to generate a target compression policy of the preset neural network model, or select a combination of pruning and quantization policies in a search space of a preset compression policy to generate a target compression policy of the preset neural network model. Optionally, the executing body may generate the coding sequence through the compression policy controller, and then decode the coding sequence according to a predefined correspondence between coding and compression policies, to obtain a target compression policy of the preset neural network model.

Step 202, pruning and quantifying a preset neural network model according to a target compression strategy to obtain a current compressed model, and obtaining the performance of the current compressed model.

In this embodiment, the execution body may prune and quantize a preset neural network model according to a combined strategy of pruning and quantization in the target compression strategy, and test performance of the compressed model. The performance of the neural network model may include, but is not limited to, at least one of: calculation efficiency, precision, calculation complexity, and the like.

Here, the sequence of pruning and quantifying the preset neural network model may not be limited.

Alternatively, the search space includes an optional sequence of pruning operation and quantization operation, and then the sequence of pruning and quantization may be specified in the target compression policy searched in each iterative search. The execution body may sequentially perform pruning and quantization operations in an order specified in the target compression policy.

After compressing the neural network model, the performance of the compressed model may be tested using a test dataset based on the specified deep learning task. Here, the test dataset may be a collection of media data such as a text dataset, an image dataset, an audio dataset, etc. For example, the specified deep learning task may be to extract image features, then the accuracy of the compressed model to extract image features may be obtained as the performance of the current compressed model, for example, the specified deep learning task may be to extract text features, then the accuracy of the compressed model to extract text features may be obtained as the performance of the current compressed model.

In step 203, feedback information is generated based on the performance of the compressed model, and in response to determining that the feedback information does not reach the preset convergence condition, a preset compression strategy controller is updated based on the feedback information, and a next iterative search operation is performed based on the updated compression strategy controller.

In this embodiment, the execution body may guide the preset compression policy controller to update its policy generation mode by using the performance of the compressed model as feedback information, and execute the next iterative search operation. When the feedback information is determined to not reach the preset convergence condition, the feedback information representing the performance of the compressed model is utilized to guide the updating of the preset compression strategy controller, and the updated compression strategy controller is utilized to execute iterative search operations such as determining the target compression strategy of the preset neural network model in the search space of the preset compression strategy.

As an example, when the compression policy controller is implemented as a recurrent neural network, the parameters of the recurrent neural network may be adjusted by using a gradient descent method based on the feedback information, so that the recurrent neural network model after adjusting the parameters searches for a new compression policy from the search space. When the compression policy controller is implemented as a reinforcement learning algorithm, the feedback information is used as a reward value (reward) to guide the reinforcement learning algorithm to update the action and state parameters, thereby generating a new compression policy.

And 204, determining the current compressed model as the generated target neural network model in response to determining that the feedback information reaches the preset convergence condition.

In this embodiment, after updating the feedback information, the executing body may determine whether the feedback information reaches a preset convergence condition, for example, whether the rate of change of the feedback information in the last several consecutive iterative search operations is lower than a preset rate threshold, if so, the iterative search operation may be stopped, and the current compression policy is used as the searched optimal target compression policy, and a model obtained by pruning and quantifying the preset neural network model by using the target compression policy is used as the target neural network model. Here, the target neural network model may be a neural network model for performing a deep learning task.

In some optional implementations of the present implementation, the executing entity may further obtain the performance of the current compressed model according to the following steps: training the current compressed model based on the sample data, and testing the performance of the trained compressed model by using the test data.

In this implementation manner, the execution subject may acquire sample data according to a specified deep learning task executed by a preset neural network model, train a current compressed model based on the sample data, and test performance of the trained compressed model using test data. The number of samples may include input data for the model and data that is expected to be output by the compressed model to train the trained compressed model.

Through the implementation mode, the compressed model is further trained after the neural network model is compressed, parameters of the compressed model are optimized, and the performance of the compressed model is further improved. Because the structure and parameters of the compressed model are optimized, the training of the compressed model does not occupy excessive computing resources. And the performance evaluation is carried out on the trained compressed model, so that the compression strategy of the model can be evaluated more accurately, and the best compression strategy can be improved by combining the embodiment, and the performance of the generated target neural network model is improved.

In some optional implementations of the present implementation, the method for generating a neural network model further includes: and sending the target neural network model to a task execution end of media data processing so as to process the media data by using the target neural network model at the task execution end. Wherein the media data may include image data, the target neural network model is a compressed neural network model for processing the image data.

In this implementation manner, the execution body may send the generated target neural network model to the task execution end of the media data processing, so as to implement processing of the media data by using the target neural network model at the task execution end. Here, the task execution end of the media data processing may be a mobile client. After the execution body sends the target neural network model to the task execution end of the media data processing, the application program operation on the mobile client is supported by using the target neural network model.

The target neural network model is a compressed model, so that the operation efficiency is high, the consumption of operation resources such as memory is reduced, the requirement on the hardware environment for model operation is reduced, the method can be suitable for task execution end equipment with lower hardware configuration, and a more accurate processing result can be obtained. In addition, the calculation amount of the deep learning task processed by the model is reduced after compression, so that the hardware delay generated by running the model can be effectively reduced, and the real-time requirement of a task execution end is met. By the implementation mode, the target neural network model generated after compression is applied to the application program, so that the feedback efficiency of the application program can be improved.

The above-described target neural network model may be used to perform image processing tasks, with the image data typically being converted into matrix data in the processing of the neural network model, involving a large number of matrix operations. The computation amount of matrix operation in the image processing process can be effectively reduced by searching the combined compression strategy of optimizing pruning and quantization and compressing the neural network model for executing the image processing through the flow, so that the image processing efficiency is improved.

In some embodiments, the method may further include the step of constructing a search space of a preset compression policy. Referring to fig. 3, fig. 3 is a flow chart 300 of one embodiment of constructing a search space for a preset compression policy, comprising the steps of:

step 301, determining first sensitivity information of each network structural unit in the preset neural network model to pruning methods in the preset pruning method set, and respectively screening candidate pruning methods meeting sensitivity screening conditions corresponding to each network structural unit from the preset pruning method set based on the first sensitivity information.

In this embodiment, when the pruning strategy of the network structural unit is changed from another pruning strategy to a candidate pruning strategy, the first sensitivity information characterizes a degradation rate of performance of the corresponding neural network model. Here, a plurality of candidate pruning methods may be included in the pruning method set, and for example, the candidate pruning method may be pruning rate for each network structural unit of the neural network, and the pruning rate may be, for example, 5%, 10%, 15%, 20%..95%. The network structural unit may be a single layer, or may be a module formed by stacking a plurality of layers, for example, a convolution module formed by stacking a plurality of convolution layers, a residual module (residual block) in a Resnet network, and the like.

The sensitivity screening condition may be that the first sensitivity information does not exceed a preset threshold, or that the sensitivity is not the highest of all candidate pruning methods. The sensitivity of different network structure units to the same pruning mode (pruning rate) is different, and the sensitivity screening conditions corresponding to different network structure units may be different.

For example, when a pruning strategy of a certain layer is changed from another pruning strategy (pruning rate 95%) to a candidate pruning strategy (pruning rate 85%) aiming at a preset neural network model, if the performance degradation rate of the preset neural network model exceeds a preset threshold by 20% under the condition that the pruning strategy of the other layer is unchanged, the sensitivity of the layer to the candidate pruning strategy is higher, the candidate pruning strategy is not suitable to be used as the pruning strategy corresponding to the layer, and then the candidate pruning strategy or a compression strategy containing the candidate pruning strategy can be deleted from the layer or the compression strategy search space of the neural network model; or if the performance degradation rate of the preset neural network model does not exceed 2%, it may be determined that the sensitivity of the layer to the candidate pruning strategy is low, and the candidate pruning strategy may be retained in the compressed strategy search space.

Step 302, determining second sensitivity information of each network structure unit in the preset neural network model to the quantization methods in the preset quantization method set, and respectively screening candidate quantization methods meeting sensitivity screening conditions corresponding to each network structure unit from the preset quantization method set based on the second sensitivity information.

In this embodiment, the second sensitivity information characterizes a degradation rate of performance of the corresponding neural network model when the quantization strategy of the network structural unit is changed from the other quantization strategy to the candidate quantization strategy. Here, the candidate quantization method set may include a plurality of candidate quantization methods, for example, the candidate quantization methods may be quantization bit widths (bit values) for respective network configuration units of the neural network, and the quantization bit widths (bit values) may be, for example, 16 bits, 8 bits, 4 bits, 2 bits. The network structural unit may be a single layer or a module formed by stacking a plurality of layers.

The sensitivity screening condition may be that the second sensitivity information does not exceed a preset threshold, or that the sensitivity is not the highest of all candidate quantization methods. The sensitivity of different network structure units to the same quantization mode (quantization bit width) is different, and the sensitivity screening conditions corresponding to different network structure units may be different.

For example, when a quantization strategy of a certain layer is changed from other quantization strategies (32 bit) to candidate quantization strategies (16 bit) aiming at a preset neural network model, and under the condition that the quantization strategies of other layers of neural network are unchanged, if the performance degradation rate of the preset neural network model exceeds a preset threshold value by 20%, the sensitivity of the layer to the candidate quantization strategies is higher, the candidate quantization strategies are not suitable to be used as the quantization strategies corresponding to the layer, and then the candidate quantization strategies or the compression strategies containing the candidate quantization strategies can be deleted from a compression strategy search space of the layer or the neural network model; or if the performance degradation rate of the preset neural network model does not exceed 2%, it may be determined that the sensitivity of the layer to the candidate quantization strategy is low, and the candidate quantization strategy may be retained in the compression strategy search space. Step 303, combining the candidate pruning method and the candidate quantization method corresponding to each network structural unit to construct a search space of the compression strategy corresponding to each corresponding network structural unit.

In this embodiment, the executing body may combine the candidate pruning method and the candidate quantization method corresponding to each network structural unit determined in step 301 and step 302, for example, the candidate pruning method of a certain layer of the neural network may be a pruning method a and a pruning method B, the candidate quantization method of a certain layer of the neural network may be a quantization method a and a quantization method B, and the executing body may combine the candidate pruning method and the candidate quantization method, and based on the combination of the pruning method a and the quantization method a, the combination of the pruning method B and the quantization method a, the combination of the pruning method a and the quantization method B, and the combination of the pruning method B and the quantization method B, the combination of the pruning method B and the quantization method B is reserved, thereby constructing a search space of a compression policy corresponding to the certain layer of the neural network.

In some alternative implementations of the present implementation, the performance of the model includes at least the accuracy of the model and/or the run time delay of the model; the first sensitivity information includes: the pruning strategy of the network structure unit is changed from other pruning strategies to the corresponding decreasing rate of the accuracy or the increasing rate of the operation delay of the neural network model when the pruning strategy of the network structure unit is changed from other pruning strategies to the candidate pruning strategy, or the pruning strategy comprising the network structure unit is changed from other pruning strategies to the corresponding decreasing rate of the comprehensive performance index of the neural network model when the pruning strategy is changed from other pruning strategies, and the comprehensive performance index is calculated based on the accuracy and the operation delay of the model; the second sensitivity information includes: the quantization strategy of the network structure unit is changed from other quantization strategies to the corresponding precision reduction rate or running delay increase rate of the neural network model when the candidate quantization strategy is changed from other quantization strategies, or the quantization strategy of the network structure unit is changed from other quantization strategies to the corresponding comprehensive performance index reduction rate of the neural network model when the candidate quantization strategy is changed from other quantization strategies.

In this implementation manner, the execution body may use the accuracy of the model as a performance index of the model, or may use the running delay of the model as a performance index of the model, or may comprehensively evaluate the accuracy of the model and the running delay of the model to obtain the performance index of the model.

The execution body may change the pruning method of the specified network structural unit from the first candidate pruning method to the second candidate pruning method, and obtain, as the first sensitivity of the specified network structural unit to the second candidate pruning method, the decreasing rate of the accuracy of the changed neural network model, or the increasing rate of the operation delay, or the comprehensive performance index calculated based on the accuracy of the model and the operation delay.

The execution body may change the quantization method of the specified network element from the first candidate quantization method to the second candidate quantization method, and obtain, as the second sensitivity of the specified network element to the second candidate quantization method, the rate of decrease in the accuracy of the changed neural network model, or the rate of increase in the operation delay, or the comprehensive performance index calculated based on the accuracy of the model and the operation delay.

By the implementation mode, the comprehensive performance index calculated by the precision and the operation delay of the model can be used as sensitivity, the influence of the compression strategy on the operation delay and the precision of the model can be estimated more accurately, the sensitivity of the model to the compression strategy can be determined more accurately, and the search space of the compression strategy is optimized.

In the process 300 of constructing the search space of the preset compression strategy in this embodiment, the sensitivities of the candidate quantization method and the candidate pruning method corresponding to each network structural unit can be analyzed, and the combination of the candidate quantization method and the candidate pruning method conforming to the sensitivity screening condition is reserved in the corresponding search space, so that the search efficiency of the optimal compression strategy is further improved, and the consumption of operation resources is reduced.

With further reference to fig. 4, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an apparatus for generating a neural network model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.

As shown in fig. 4, the apparatus 400 for generating a neural network model provided in the present embodiment includes an execution unit 401 configured to perform a plurality of iterative search operations; the execution unit 401 includes: the search unit 4011 is configured to perform the following steps in the iterative search operation: determining a target compression strategy of a preset neural network model in a search space of the preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combined strategy of pruning and quantization; the compression unit 4012 is configured to perform the following steps in the iterative search operation: pruning and quantifying a preset neural network model according to a target compression strategy to obtain a current compressed model, and obtaining the performance of the current compressed model; the feedback unit 4013 is configured to perform the following steps in the iterative search operation: generating feedback information based on the performance of the compressed model, updating a preset compression strategy controller based on the feedback information in response to determining that the feedback information does not reach a preset convergence condition, and executing the next iterative search operation based on the updated compression strategy controller; the determination unit 4014 is configured to perform the following steps in the iterative search operation: and determining the current compressed model as the generated target neural network model in response to determining that the feedback information reaches a preset convergence condition.

In this embodiment, in the apparatus 400 for generating a neural network model: specific processes of the search unit 4011, the compression unit 4012, the feedback unit 4013 and the determination unit 4014 and technical effects thereof can refer to the relevant descriptions of step 201, step 202, step 203 and step 204 in the corresponding embodiment of fig. 2, and are not repeated here.

In some embodiments, the apparatus further comprises: a construction unit configured to construct a search space of a preset compression policy; the construction unit comprises: the first screening unit is configured to determine first sensitivity information of each network structure unit in the preset neural network model to pruning methods in the preset pruning method set, and respectively screen candidate pruning methods which accord with sensitivity screening conditions corresponding to each network structure unit from the preset pruning method set based on the first sensitivity information, wherein the first sensitivity information characterizes the degradation rate of the performance of the corresponding neural network model when the pruning strategy of the network structure unit is changed from other pruning strategies to candidate pruning strategies; the second screening unit is configured to determine second sensitivity information of each network structure unit in the preset neural network model to the quantization methods in the preset quantization method set, and respectively screen candidate quantization methods which accord with sensitivity screening conditions corresponding to each network structure unit from the preset quantization method set based on the second sensitivity information, wherein the second sensitivity information characterizes the degradation rate of the performance of the corresponding neural network model when the quantization strategy of the network structure unit is changed from other quantization strategies to candidate quantization strategies; and the combination unit is configured to combine the candidate pruning method and the candidate quantization method corresponding to each network structure unit so as to construct a search space of the compression strategy corresponding to each corresponding network structure unit.

In some embodiments, the performance of the model includes at least the accuracy of the model and/or the run time delay of the model; the first sensitivity information includes: the pruning strategy of the network structure unit is changed from other pruning strategies to the corresponding decreasing rate of the accuracy or the increasing rate of the operation delay of the neural network model when the pruning strategy of the network structure unit is changed from other pruning strategies to the candidate pruning strategy, or the pruning strategy comprising the network structure unit is changed from other pruning strategies to the corresponding decreasing rate of the comprehensive performance index of the neural network model when the pruning strategy is changed from other pruning strategies, and the comprehensive performance index is calculated based on the accuracy and the operation delay of the model; the second sensitivity information includes: the quantization strategy of the network structure unit is changed from other quantization strategies to the corresponding precision reduction rate or running delay increase rate of the neural network model when the candidate quantization strategy is changed from other quantization strategies, or the quantization strategy of the network structure unit is changed from other quantization strategies to the corresponding comprehensive performance index reduction rate of the neural network model when the candidate quantization strategy is changed from other quantization strategies.

In some embodiments, the apparatus further comprises: a compression unit configured to acquire performance of the compressed model as follows: training the current compressed model based on the sample data, and testing the performance of the trained compressed model by using the test data. In some embodiments, the apparatus further comprises: and the processing unit is configured to send the target neural network model to a task execution end of media data processing so as to process the media data by using the target neural network model at the task execution end.

The apparatus provided in the above embodiment of the present application performs, by the execution unit 401, a plurality of iterative search operations including: determining a target compression strategy of a preset neural network model in a search space of the preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combined strategy of pruning and quantization; pruning and quantifying a preset neural network model according to a target compression strategy to obtain a current compressed model, and obtaining the performance of the current compressed model; generating feedback information based on the performance of the compressed model, updating a preset compression strategy controller based on the feedback information in response to determining that the feedback information does not reach a preset convergence condition, and executing the next iterative search operation based on the updated compression strategy controller; and determining the current compressed model as the generated target neural network model in response to determining that the feedback information reaches a preset convergence condition. The device can reduce the memory space occupied by the neural network model under the condition of ensuring the accuracy of the neural network model.

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

As shown in fig. 5, is a block diagram of an electronic device of a method of generating a neural network model according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 5, the electronic device includes: one or more processors 501, memory 502, and interfaces for connecting components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses 505 and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses 505 may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 501 is illustrated in fig. 5.

Memory 502 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of generating a neural network model provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of generating a neural network model provided by the present application.

The memory 502 is a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules (e.g., the execution unit 401, the search unit 4011, the compression unit 4012, the feedback unit 4013, and the determination unit 4014 shown in fig. 4) corresponding to the method for generating the neural network model in the embodiment of the present application. The processor 501 executes various functional applications of the server and data processing, that is, implements the method of generating a neural network model in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 502.

Memory 502 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created from the use of the electronic device that generated the neural network model, and the like. In addition, memory 502 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 502 may optionally include memory remotely located relative to processor 501, which may be connected via a network to the electronic device generating the neural network model. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of generating a neural network model may further include: an input device 503 and an output device 504. The processor 501, memory 502, input devices 503 and output devices 504 may be connected by a bus 505 or otherwise, in fig. 5 by way of example by bus 505.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device generating the neural network model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, and the like. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme provided by the embodiment of the application, under the condition that the accuracy of the neural network model can be ensured, the memory space occupied by the neural network model is reduced

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method of generating a neural network model for image processing, comprising performing a plurality of iterative search operations; the iterative search operation includes the steps of:

determining a target compression strategy of a preset neural network model in a search space of the preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combined strategy of pruning and quantization, and the preset neural network model is used for executing an image processing task;

Pruning and quantifying the preset neural network model according to the target compression strategy to obtain a current compressed model, and acquiring the performance of the current compressed model based on a test data set of the image processing task, wherein the performance comprises hardware delay generated by model operation and the accuracy of extracting image features of the compressed model based on the image processing task, and the test data set comprises an image data set;

generating feedback information based on the performance of the compressed model, updating the preset compression strategy controller based on the feedback information in response to determining that the feedback information does not reach a preset convergence condition, and executing the next iterative search operation based on the updated compression strategy controller;

responding to the fact that the feedback information reaches a preset convergence condition, and determining that a current compressed model is a generated target neural network model;

the method further comprises the steps of: the construction of the search space of the preset compression strategy comprises the following steps: determining first sensitivity information of each network structure unit in a preset neural network model to pruning methods in a preset pruning method set, and respectively screening candidate pruning methods meeting sensitivity screening conditions corresponding to each network structure unit from the preset pruning method set based on the first sensitivity information, wherein the first sensitivity information represents the performance degradation rate of the corresponding neural network model when the pruning strategy of the network structure unit is changed from other pruning strategies to candidate pruning strategies; screening candidate quantization methods which meet sensitivity screening conditions corresponding to each network structural unit from a preset quantization method set based on the second sensitivity information; and combining the candidate pruning method and the candidate quantization method corresponding to each network structural unit to construct a search space of a compression strategy corresponding to each corresponding network structural unit.

2. The method of claim 1, wherein the screening candidate quantization methods from the preset quantization method set based on the second sensitivity information, respectively, according to sensitivity screening conditions corresponding to each network structural unit, comprises

Determining second sensitivity information of each network structure unit in a preset neural network model to quantization methods in a preset quantization method set, and respectively screening candidate quantization methods meeting sensitivity screening conditions corresponding to each network structure unit from the preset quantization method set based on the second sensitivity information, wherein the second sensitivity information characterizes the degradation rate of the performance of the corresponding neural network model when the quantization strategy of the network structure unit is changed from other quantization strategies to candidate quantization strategies.

3. The method according to claim 2, wherein the performance of the model comprises at least the accuracy of the model and/or the run time delay of the model; and

the first sensitivity information includes: the pruning strategy of the network structure unit is changed from other pruning strategies to the corresponding decreasing rate of the accuracy or the increasing rate of the operation delay of the neural network model when the pruning strategy of the network structure unit is changed from other pruning strategies to the candidate pruning strategy, or the pruning strategy comprising the network structure unit is changed from other pruning strategies to the corresponding decreasing rate of the comprehensive performance index of the neural network model when the pruning strategy is changed from other pruning strategies to the candidate pruning strategy, wherein the comprehensive performance index is calculated based on the accuracy and the operation delay of the model;

The second sensitivity information includes: the quantization strategy of the network structural unit is changed from other quantization strategies to the corresponding precision reduction rate or running delay increase rate of the neural network model when the candidate quantization strategy is changed from other quantization strategies, or the quantization strategy of the network structural unit is changed from other quantization strategies to the corresponding comprehensive performance index reduction rate of the neural network model when the candidate quantization strategy is changed from other quantization strategies.

4. The method of claim 1, wherein the obtaining the performance of the current compressed model comprises:

training the current compressed model based on sample data, and testing the performance of the trained compressed model by using test data.

5. The method of any of claims 1-4, wherein the method further comprises:

and sending the target neural network model to a task execution end of media data processing so as to process media data by using the target neural network model at the task execution end.

6. An apparatus for generating a neural network model for image processing, comprising:

an execution unit configured to perform a plurality of iterative search operations;

the execution unit includes:

A search unit configured to perform the following steps in the iterative search operation: determining a target compression strategy of a preset neural network model in a search space of the preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combined strategy of pruning and quantization, and the preset neural network model is used for executing an image processing task;

a compression unit configured to perform the following steps in the iterative search operation: pruning and quantifying the preset neural network model according to the target compression strategy to obtain a current compressed model, and acquiring the performance of the current compressed model based on a test data set of the image processing task, wherein the performance comprises hardware delay generated by model operation and the accuracy of extracting image features of the compressed model based on the image processing task, and the test data set comprises an image data set;

a feedback unit configured to perform the following steps in the iterative search operation: generating feedback information based on the performance of the compressed model, updating the preset compression strategy controller based on the feedback information in response to determining that the feedback information does not reach a preset convergence condition, and executing the next iterative search operation based on the updated compression strategy controller;

A determining unit configured to perform the following steps in the iterative search operation: responding to the fact that the feedback information reaches a preset convergence condition, and determining that a current compressed model is a generated target neural network model;

the apparatus further comprises: a construction unit configured to construct a search space of the preset compression policy;

the construction unit includes: a first screening unit configured to determine first sensitivity information of each network structural unit in a preset neural network model to pruning methods in a preset pruning method set, and screen candidate pruning methods meeting sensitivity screening conditions corresponding to each network structural unit from the preset pruning method set based on the first sensitivity information, wherein the first sensitivity information characterizes a degradation rate of performance of the corresponding neural network model when pruning strategies of the network structural units are changed from other pruning strategies to candidate pruning strategies; the second screening unit is configured to screen candidate quantization methods which meet sensitivity screening conditions corresponding to the network structural units from a preset quantization method set respectively based on the second sensitivity information; and the combination unit is configured to combine the candidate pruning method and the candidate quantization method corresponding to each network structure unit so as to construct a search space of a compression strategy corresponding to each corresponding network structure unit.

7. The apparatus of claim 6, wherein the

The second screening unit is further configured to determine second sensitivity information of each network structure unit in the preset neural network model to quantization methods in a preset quantization method set, and respectively screen candidate quantization methods meeting sensitivity screening conditions corresponding to each network structure unit from the preset quantization method set based on the second sensitivity information, wherein the second sensitivity information characterizes the degradation rate of the performance of the corresponding neural network model when the quantization strategy of the network structure unit is changed from other quantization strategies to candidate quantization strategies.

8. The apparatus of claim 7, wherein the performance of the model comprises at least a precision of the model and/or a run time delay of the model; and

9. The apparatus of claim 6, wherein the apparatus further comprises: a compression unit configured to acquire performance of the compressed model as follows:

10. The apparatus according to any one of claims 6-9, wherein the apparatus further comprises:

and the processing unit is configured to send the target neural network model to a task execution end of media data processing so as to process media data by using the target neural network model at the task execution end.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.