CN111667054A

CN111667054A - Method and device for generating neural network model, electronic equipment and storage medium

Info

Publication number: CN111667054A
Application number: CN202010503073.8A
Authority: CN
Inventors: 希滕; 张刚; 温圣召
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-09-15
Anticipated expiration: 2040-06-05
Also published as: CN111667054B

Abstract

The embodiment of the application discloses a method, a device, electronic equipment and a storage medium for generating a neural network model, and relates to the technical field of artificial intelligence, deep learning and image processing. The specific implementation scheme is as follows: performing a plurality of iterative search operations, the iterative search operations comprising the steps of: the method comprises the steps of determining a target compression strategy of a preset neural network model in a search space of the preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combined strategy of pruning and quantization, pruning and quantizing the preset neural network model according to the target compression strategy to obtain a current compressed model, acquiring the performance of the current compressed model, generating feedback information based on the performance of the compressed model, and determining the current compressed model as the generated target neural network model in response to the fact that the feedback information reaches a preset convergence condition. The method can search out the optimal model compression strategy.

Description

Method and device for generating neural network model, electronic equipment and storage medium

Technical Field

Embodiments of the present application relate to the field of computer technologies, and further relate to the field of artificial intelligence, deep learning, and image processing technologies, and in particular, to a method and an apparatus for generating a neural network model, an electronic device, and a storage medium.

Background

With the continuous development of artificial intelligence technology, the performance of the deep neural network reaches unprecedented level. The complex model has good performance, but its huge storage space and high computational resource consumption make it difficult to be effectively applied on various hardware platforms and provide real-time services.

Disclosure of Invention

A method, an apparatus, an electronic device, and a storage medium for generating a neural network model are provided.

According to a first aspect, there is provided a method of generating a neural network model, the method comprising performing a plurality of iterative search operations; the iterative search operation includes the steps of: determining a target compression strategy of a preset neural network model in a search space of a preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combination strategy of pruning and quantization; according to a target compression strategy, pruning and quantifying a preset neural network model to obtain a current compressed model, and acquiring the performance of the current compressed model; generating feedback information based on the performance of the compressed model, updating a preset compression strategy controller based on the feedback information in response to the fact that the feedback information does not reach a preset convergence condition, and executing next iterative search operation based on the updated compression strategy controller; and in response to determining that the feedback information reaches a preset convergence condition, determining the current compressed model as the generated target neural network model.

According to a second aspect, there is provided an apparatus for generating a neural network model, the apparatus comprising: an execution unit configured to perform a plurality of iterative search operations; the execution unit includes: a search unit configured to perform the following steps in an iterative search operation: determining a target compression strategy of a preset neural network model in a search space of a preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combination strategy of pruning and quantization; a compression unit configured to perform the following steps in an iterative search operation: according to a target compression strategy, pruning and quantifying a preset neural network model to obtain a current compressed model, and acquiring the performance of the current compressed model; a feedback unit configured to perform the following steps in the iterative search operation: generating feedback information based on the performance of the compressed model, updating a preset compression strategy controller based on the feedback information in response to the fact that the feedback information does not reach a preset convergence condition, and executing next iterative search operation based on the updated compression strategy controller; a determination unit configured to perform the following steps in an iterative search operation: and in response to determining that the feedback information reaches a preset convergence condition, determining the current compressed model as the generated target neural network model.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the first aspect.

According to the technology of the application, the memory space occupied by the neural network model can be reduced under the condition of ensuring the precision of the neural network model.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method of generating a neural network model according to the present application;

FIG. 3 is a flow diagram of one embodiment of constructing a search space for a preset compression strategy;

FIG. 4 is a schematic diagram of an embodiment of an apparatus for generating a neural network model according to the present application;

FIG. 5 is a schematic block diagram of an electronic device suitable for use in implementing embodiments of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 illustrates an exemplary architecture 100 to which the method of generating a neural network model or the apparatus for generating a neural network model of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as an image processing application, an information analysis application, a voice assistant application, a shopping application, a financial application, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting internet access, including but not limited to smart phones, tablet computers, notebook computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a backend server that provides backend support for applications installed on the

terminal devices

101, 102, 103. For example, the server 105 may receive information to be processed sent by the

terminal devices

101, 102, 103, process the information using the neural network model, and return the processing results to the

terminal devices

101, 102, 103.

In the application scenario summarization of the present disclosure, the server 105 may search for an appropriate compression strategy based on a pre-trained neural network model, so that the compressed neural network model is suitable for running on the

terminal devices

101, 102, 103.

It should be noted that the method for generating the neural network model provided by the embodiment of the present disclosure is generally performed by the server 105, and accordingly, the apparatus for generating the neural network model is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method of generating a neural network model in accordance with the present application is shown. The method of generating a neural network model includes performing a plurality of iterative search operations. Specifically, the iterative search operation includes the steps of:

step 201, a preset compression strategy controller is adopted to determine a preset target compression strategy of the neural network model in a search space of the preset compression strategy.

In this embodiment, an executing entity (for example, the server 105 shown in fig. 1) of the method for generating a neural network model may determine a target compression policy of the preset neural network model in a search space of the preset compression policy by using a preset compression policy controller. Here, the preset neural network model is a model constructed or trained based on a deep learning method, and may be used to perform a deep learning task, such as image processing. The compression strategy comprises a combination strategy of pruning and quantization, the combination strategy of pruning and quantization can represent a quantization and pruning method adopted by the neural network model, and pruning and quantization are two important methods for compressing the neural network model. Pruning is a method for pruning unimportant network units in the neural network model according to a certain rule so as to reduce operations corresponding to the unimportant network units in the running process of the neural network model and reduce the occupied memory space, for example, pruning the convolution kernel of a certain layer of neural network in the neural network model. Pruning can also be the pruning rate set for each layer of neural network in the neural network model, and the common pruning rate is 5%, 10%, 15%, 20%. 95%, etc. Quantization is a method of storing floating point numbers expressed by high bit values in a low bit value form to reduce the occupied memory space, for example, the numerical values of 64 bits and 32 bits are quantized into 16 bits, 8 bits, 4 bits, 2 bits, and the like.

The compression strategy controller may be used to control or generate compression strategies for neural network models, and may be embodied as various machine learning algorithms, such as recurrent neural networks, reinforcement learning algorithms, genetic algorithms, and so forth. The compression policy controller may select and combine a pruning policy and a quantization policy in a search space of a preset compression policy to generate a target compression policy of the preset neural network model, or select and combine a pruning policy and a quantization policy in a search space of a preset compression policy to generate a target compression policy of the preset neural network model. Optionally, the execution main body may generate a coding sequence through the compression policy controller, and then decode the coding sequence according to a predefined correspondence between the coding and the compression policy to obtain a preset target compression policy of the neural network model.

Step 202, pruning and quantifying a preset neural network model according to a target compression strategy to obtain a current compressed model, and acquiring the performance of the current compressed model.

In this embodiment, the executing entity may prune and quantize the preset neural network model according to a combination strategy of pruning and quantization in the target compression strategy, and test the performance of the compressed model. The performance of the neural network model may include, but is not limited to, at least one of: computational efficiency, accuracy, computational complexity, etc.

Here, the order of pruning and quantizing the preset neural network model may not be limited.

Or, the search space includes an optional order of pruning and quantization operations, and the target compression policy searched in each iterative search may specify an order of pruning and quantization. The execution body may sequentially execute the pruning and quantization operations in an order specified in the target compression policy.

After the neural network model is compressed, the performance of the compressed model can be tested using the test data set based on a specified deep learning task. Here, the test data set may be a collection of media data such as a text data set, an image data set, an audio data set, and the like. For example, the specified deep learning task may be to extract image features, and then the accuracy of extracting image features by the compressed model may be obtained as the performance of the current compressed model, and for example, the specified deep learning task may be to extract text features, and then the accuracy of extracting text features by the compressed model may be obtained as the performance of the current compressed model.

And 203, generating feedback information based on the performance of the compressed model, updating the preset compression strategy controller based on the feedback information in response to the fact that the feedback information does not reach the preset convergence condition, and executing the next iterative search operation based on the updated compression strategy controller.

In this embodiment, the execution body may use the performance of the compressed model as feedback information to guide a preset compression policy controller to update its policy generation manner, and execute the next iterative search operation. And when the feedback information is determined not to reach the preset convergence condition, the feedback information representing the performance of the compressed model is used for guiding the updating of the preset compression strategy controller, and the updated compression strategy controller is used for executing iterative search operations such as determining a target compression strategy of the preset neural network model in the search space of the preset compression strategy.

As an example, when the compression strategy controller is implemented as a recurrent neural network, parameters of the recurrent neural network may be adjusted by using a gradient descent method based on the feedback information, so that the recurrent neural network model after adjusting the parameters searches a new compression strategy from the search space. When the compression strategy controller described above is implemented as a reinforcement learning algorithm, this feedback information acts as a reward value (reward) to direct the reinforcement learning algorithm to update the action (action) and state (state) parameters, thereby generating a new compression strategy.

And 204, in response to the fact that the feedback information reaches the preset convergence condition, determining the current compressed model as the generated target neural network model.

In this embodiment, after updating the feedback information, the execution main body may determine whether the feedback information reaches a preset convergence condition, for example, whether a change rate of the feedback information in a last continuous number of iterative search operations is lower than a preset change rate threshold, if so, may stop the iterative search operations, and use a current compression policy as a searched optimal target compression policy, and use a model obtained by pruning and quantizing the preset neural network model by using the target compression policy as the target neural network model. Here, the target neural network model may be a neural network model for performing a deep learning task.

In some optional implementations of the present implementation, the executing entity may further obtain the performance of the current compressed model according to the following steps: and training the current compressed model based on the sample data, and testing the performance of the trained compressed model by using the test data.

In this implementation manner, the execution subject may obtain sample data according to a specified deep learning task executed by a preset neural network model, train a current compressed model based on the sample data, and test performance of the trained compressed model by using test data. The number of samples may include input data for the model and data that the model is expected to output after compression to train the trained compressed model.

Through the implementation mode, the compressed model is further trained after the neural network model is compressed, parameters of the compressed model are optimized, and the performance of the compressed model is further improved. Because the structure and the parameters of the compressed model are optimized, the training of the compressed model does not occupy excessive computing resources. And the performance evaluation is carried out on the trained compressed model, so that the compression strategy of the model can be evaluated more accurately, the searched optimal compression strategy can be improved by combining the embodiment, and the performance of the generated target neural network model is improved.

In some optional implementations of the present implementation, the method for generating a neural network model further includes: and sending the target neural network model to a task execution end for processing the media data so as to process the media data by using the target neural network model at the task execution end. Wherein the media data may comprise image data and the target neural network model is a compressed neural network model for processing the image data.

In this implementation manner, the executing body may send the generated target neural network model to a task executing end for media data processing, so as to implement processing of media data by using the target neural network model at the task executing end. Here, the task performing end of the media data processing may be a mobile client. The execution main body can support the application program on the mobile client to run by using the target neural network model after sending the target neural network model to the task execution end of the media data processing.

Because the target neural network model is a compressed model, the operation efficiency is high, and the consumption of computing resources such as memory is reduced, so that the requirement on the hardware environment for the operation of the model is reduced, the method can be adapted to task execution end equipment with lower hardware configuration, and a more accurate processing result can be obtained. In addition, the calculation amount of the deep learning task processed by the model is reduced after compression, and the hardware delay generated by running the model can be effectively reduced, so that the real-time requirement of a task execution end is met. By the implementation mode, the target neural network model generated after compression is applied to the application program, and the feedback efficiency of the application program can be improved.

The above-described target neural network model may be used to perform image processing tasks, where image data is typically converted to matrix data in the processing of the neural network model, involving a large number of matrix operations. By searching and optimizing a combined compression strategy of pruning and quantization and compressing the neural network model for executing the image processing through the process, the operation amount of matrix operation in the image processing process can be effectively reduced, and the image processing efficiency is improved.

In some embodiments, the method may further include the step of constructing a search space of the preset compression strategy. Referring to fig. 3, fig. 3 is a flowchart 300 of an embodiment of constructing a search space of a preset compression strategy, including the following steps:

step 301, determining first sensitivity information of each network structure unit in the preset neural network model to a pruning method in the preset pruning method set, and respectively screening out candidate pruning methods meeting sensitivity screening conditions corresponding to each network structure unit from the preset pruning method set based on the first sensitivity information.

In this embodiment, when the pruning policy of the first sensitivity information characterization network structure unit is changed from another pruning policy to a candidate pruning policy, the performance degradation rate of the corresponding neural network model is reduced. Here, the set of pruning methods may include a plurality of candidate pruning methods, for example, the candidate pruning methods may be pruning rates for each network structural unit of the neural network, and the pruning rates may be, for example, 5%, 10%, 15%, 20%. 95%. The network structure unit may be a single layer, or may be a module formed by stacking a plurality of layers, for example, a convolution module formed by stacking a plurality of convolution layers, a residual block in a Resnet network, or the like.

The sensitivity screening condition may be that the first sensitivity information does not exceed a preset threshold, or may be that the sensitivity is not the highest of all candidate pruning methods. The sensitivity of different network structure units to the same pruning mode (pruning rate) is different, and the sensitivity screening conditions corresponding to different network structure units can be different.

For example, when a pruning strategy of a certain layer of a preset neural network model is changed from another pruning strategy (the pruning rate is 95%) to a candidate pruning strategy (the pruning rate is 85%), and the pruning strategies of other layers of the neural network model are not changed, the performance reduction rate of the preset neural network model exceeds a preset threshold value of 20%, so that the layer has high sensitivity to the candidate pruning strategy and is not suitable for using the candidate pruning strategy as the pruning strategy corresponding to the layer, and at this time, the candidate pruning strategy or the compression strategy containing the candidate pruning strategy can be deleted from the layer or the compression strategy search space of the neural network model; or, if the performance degradation rate of the preset neural network model does not exceed 2%, it may be determined that the layer has low sensitivity to the candidate pruning policy, and the candidate pruning policy may be retained in the compression policy search space.

Step 302, determining second sensitivity information of each network structure unit in the preset neural network model to a quantization method in the preset quantization method set, and respectively screening out candidate quantization methods meeting sensitivity screening conditions corresponding to each network structure unit from the preset quantization method set based on the second sensitivity information.

In this embodiment, the second sensitivity information represents a degradation rate of the performance of the corresponding neural network model when the quantization strategy of the network structure unit is changed from another quantization strategy to a candidate quantization strategy. Here, the candidate quantization method set may include a plurality of candidate quantization methods, for example, the candidate quantization methods may be quantization bit widths (bit values) for each network structure unit of the neural network, and the quantization bit widths (bit values) may be, for example, 16 bits, 8 bits, 4 bits, and 2 bits. The network structure unit may be a single layer, or may be a module formed by stacking a plurality of layers.

The sensitivity screening condition may be that the second sensitivity information does not exceed a preset threshold, or may be that the sensitivity is not the highest of all candidate quantification methods. The sensitivity of different network structure units to the same quantization mode (quantization bit width) is different, and the sensitivity screening conditions corresponding to different network structure units can be different.

For example, when a quantization strategy of a certain layer of a preset neural network model is changed from another quantization strategy (32bit) to a candidate quantization strategy (16bit), and the quantization strategies of other layers of the neural network are not changed, the performance degradation rate of the preset neural network model exceeds a preset threshold value by 20%, the sensitivity of the layer to the candidate quantization strategy is high, and the layer is not suitable for using the candidate quantization strategy as the quantization strategy corresponding to the layer, and at this time, the candidate quantization strategy or the compression strategy including the candidate quantization strategy can be deleted from the layer or the compression strategy search space of the neural network model; or, if the performance degradation rate of the preset neural network model does not exceed 2%, it may be determined that the layer has low sensitivity to the candidate quantization strategy, and the candidate quantization strategy may be retained in the compression strategy search space. Step 303, combining the candidate pruning method and the candidate quantization method corresponding to each network structure unit to construct a search space of the compression strategy corresponding to each network structure unit.

In this embodiment, the executing entity may combine the candidate pruning method and the candidate quantization method corresponding to each network structure unit determined in

steps

301 and 302, for example, the candidate pruning method of a certain layer of neural network may be a pruning method a and a pruning method B, the candidate quantization method of a certain layer of neural network may be a quantization method a and a quantization method B, and the executing entity may combine the candidate pruning method and the candidate quantization method, and based on the combination of the pruning method a and the quantization method a, the combination of the pruning method B and the quantization method a, the combination of the pruning method a and the quantization method B, and the combination of the pruning method B and the quantization method B, the searching space of the compression policy corresponding to a certain layer of neural network is constructed.

In some optional implementations of this implementation, the performance of the model includes at least an accuracy of the model and/or a running delay of the model; and the first sensitivity information includes: when the pruning strategy of the network structure unit is changed into a candidate pruning strategy from other pruning strategies, the precision reduction rate or the running delay increase rate of the corresponding neural network model, or when the pruning strategy of the network structure unit is changed into the candidate pruning strategy from other pruning strategies, the comprehensive performance index is calculated based on the precision and the running delay of the model; the second sensitivity information includes: when the quantization strategy of the network structure unit is changed from other quantization strategies to the candidate quantization strategy, the precision reduction rate or the running delay increase rate of the corresponding neural network model, or when the quantization strategy of the network structure unit is changed from other quantization strategies to the candidate quantization strategy, the reduction rate of the comprehensive performance index of the corresponding neural network model is obtained.

In this implementation manner, the execution subject may use the precision of the model as a performance index of the model, may use the operation delay of the model as a performance index of the model, or may perform comprehensive evaluation on the precision of the model and the operation delay of the model to obtain the performance index of the model.

The executing body may change the pruning method of the designated network structure unit from the first candidate pruning method to the second candidate pruning method, and obtain a reduction rate of the accuracy of the neural network model after the change, or an increase rate of the operation delay, or a comprehensive performance index calculated based on the accuracy of the model and the operation delay, as the first sensitivity of the designated network structure unit to the second candidate pruning method.

The execution subject may change the quantization method of the designated network structure unit from the first candidate quantization method to the second candidate quantization method, and obtain a decrease rate of the accuracy of the neural network model after the change, or an increase rate of the operation delay, or a comprehensive performance index calculated based on the accuracy of the model and the operation delay, as a second sensitivity of the designated network structure unit to the second candidate quantization method.

By the implementation mode, the comprehensive performance index calculated by the precision and the operation delay of the model can be used as the sensitivity, the influence of the model on the operation delay and the precision of the compression strategy can be more accurately evaluated, the sensitivity of the model to the compression strategy can be more accurately determined, and the search space of the compression strategy is further optimized.

In the process 300 for constructing the search space of the preset compression strategy in this embodiment, the sensitivities of the candidate quantization method and the candidate pruning method corresponding to each network structure unit may be analyzed, and the combination of the candidate quantization method and the candidate pruning method that meet the sensitivity screening condition is reserved in the corresponding search space, so that the search efficiency of the optimal compression strategy is further improved, and the consumption of the operation resources is reduced.

With further reference to fig. 4, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating a neural network model, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 4, the apparatus 400 for generating a neural network model provided in the present embodiment executes a unit 401 configured to perform a plurality of iterative search operations; the execution unit 401 includes: a search unit 4011 configured to perform the following steps in the iterative search operation: determining a target compression strategy of a preset neural network model in a search space of a preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combination strategy of pruning and quantization; a compression unit 4012 configured to perform the following steps in the iterative search operation: according to a target compression strategy, pruning and quantifying a preset neural network model to obtain a current compressed model, and acquiring the performance of the current compressed model; a feedback unit 4013 configured to perform the following steps in the iterative search operation: generating feedback information based on the performance of the compressed model, updating a preset compression strategy controller based on the feedback information in response to the fact that the feedback information does not reach a preset convergence condition, and executing next iterative search operation based on the updated compression strategy controller; a determination unit 4014 configured to perform the following steps in the iterative search operation: and in response to determining that the feedback information reaches a preset convergence condition, determining the current compressed model as the generated target neural network model.

In the present embodiment, in the apparatus 400 for generating a neural network model: the specific processing and the technical effects of the search unit 4011, the compression unit 4012, the feedback unit 4013 and the determination unit 4014 can refer to the related descriptions of step 201, step 202, step 203 and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.

In some embodiments, the above apparatus further comprises: a construction unit configured to construct a search space of a preset compression policy; the construction unit comprises: the first screening unit is configured to determine first sensitivity information of each network structure unit in a preset neural network model to a pruning method in a preset pruning method set, and screen out candidate pruning methods meeting sensitivity screening conditions corresponding to each network structure unit from the preset pruning method set based on the first sensitivity information, wherein the first sensitivity information represents a reduction rate of performance of the corresponding neural network model when a pruning strategy of the network structure unit is changed from other pruning strategies to the candidate pruning strategies; the second screening unit is configured to determine second sensitivity information of each network structure unit in the preset neural network model to a quantization method in the preset quantization method set, and screen out candidate quantization methods meeting sensitivity screening conditions corresponding to each network structure unit from the preset quantization method set based on the second sensitivity information, wherein the second sensitivity information represents a reduction rate of performance of the corresponding neural network model when a quantization strategy of the network structure unit is changed from other quantization strategies to the candidate quantization strategies; and the combination unit is configured to combine the candidate pruning method and the candidate quantization method corresponding to each network structure unit so as to construct a search space of the compression strategy corresponding to each corresponding network structure unit.

In some embodiments, the performance of the model includes at least the accuracy of the model and/or the operational delay of the model; and the first sensitivity information includes: when the pruning strategy of the network structure unit is changed into a candidate pruning strategy from other pruning strategies, the precision reduction rate or the running delay increase rate of the corresponding neural network model, or when the pruning strategy of the network structure unit is changed into the candidate pruning strategy from other pruning strategies, the comprehensive performance index is calculated based on the precision and the running delay of the model; the second sensitivity information includes: when the quantization strategy of the network structure unit is changed from other quantization strategies to the candidate quantization strategy, the precision reduction rate or the running delay increase rate of the corresponding neural network model, or when the quantization strategy of the network structure unit is changed from other quantization strategies to the candidate quantization strategy, the reduction rate of the comprehensive performance index of the corresponding neural network model is obtained.

In some embodiments, the above apparatus further comprises: a compression unit configured to obtain performance of the compressed model as follows: and training the current compressed model based on the sample data, and testing the performance of the trained compressed model by using the test data. In some embodiments, the apparatus further comprises: and the processing unit is configured to send the target neural network model to a task execution end of media data processing so as to process the media data by using the target neural network model at the task execution end.

The apparatus provided by the foregoing embodiment of the present application, through the execution unit 401, executes a plurality of iterative search operations, where the iterative search operations include: determining a target compression strategy of a preset neural network model in a search space of a preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combination strategy of pruning and quantization; according to a target compression strategy, pruning and quantifying a preset neural network model to obtain a current compressed model, and acquiring the performance of the current compressed model; generating feedback information based on the performance of the compressed model, updating a preset compression strategy controller based on the feedback information in response to the fact that the feedback information does not reach a preset convergence condition, and executing next iterative search operation based on the updated compression strategy controller; and in response to determining that the feedback information reaches a preset convergence condition, determining the current compressed model as the generated target neural network model. The device can reduce the memory space occupied by the neural network model under the condition of ensuring the precision of the neural network model.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 5, is a block diagram of an electronic device of a method of generating a neural network model according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses 505 and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses 505 may be used, along with multiple memories and multiple memories, if desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 501 is taken as an example.

Memory 502 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of generating a neural network model provided herein. A non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of generating a neural network model provided herein.

The memory 502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method of generating a neural network model in the embodiments of the present application (for example, the execution unit 401, the search unit 4011, the compression unit 4012, the feedback unit 4013, and the determination unit 4014 shown in fig. 4). The processor 501 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in the memory 502, that is, implements the method of generating a neural network model in the above method embodiments.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device that generates the neural network model, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 optionally includes memory located remotely from processor 501, which may be connected over a network to an electronic device that generates the neural network model. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of generating a neural network model may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus 505 or other means, and fig. 5 illustrates an example in which these are connected by the bus 505.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus generating the neural network model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, under the condition that the precision of the neural network model can be guaranteed, the memory space occupied by the neural network model is reduced

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of generating a neural network model, comprising performing a plurality of iterative search operations; the iterative search operation comprises the steps of:

determining a target compression strategy of a preset neural network model in a search space of a preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combination strategy of pruning and quantization;

according to the target compression strategy, pruning and quantifying the preset neural network model to obtain a current compressed model and obtain the performance of the current compressed model;

generating feedback information based on the performance of the compressed model, updating the preset compression strategy controller based on the feedback information in response to determining that the feedback information does not reach a preset convergence condition, and executing next iterative search operation based on the updated compression strategy controller;

and in response to determining that the feedback information reaches a preset convergence condition, determining the current compressed model as the generated target neural network model.

2. The method of claim 1, wherein the method further comprises:

constructing a search space of the preset compression strategy, including:

determining first sensitivity information of each network structure unit in a preset neural network model to a pruning method in a preset pruning method set, and respectively screening out candidate pruning methods which meet sensitivity screening conditions corresponding to each network structure unit from the preset pruning method set on the basis of the first sensitivity information, wherein the first sensitivity information represents the performance reduction rate of the corresponding neural network model when the pruning strategy of the network structure unit is changed from other pruning strategies to the candidate pruning strategies;

determining second sensitivity information of each network structure unit in a preset neural network model to a quantization method in a preset quantization method set, and respectively screening out candidate quantization methods which accord with sensitivity screening conditions corresponding to each network structure unit from the preset quantization method set on the basis of the second sensitivity information, wherein the second sensitivity information represents a performance reduction rate of the corresponding neural network model when a quantization strategy of the network structure unit is changed from other quantization strategies to the candidate quantization strategies;

and combining the candidate pruning method and the candidate quantization method corresponding to each network structure unit to construct a search space of the compression strategy corresponding to each network structure unit.

3. The method of claim 2, wherein the performance of the model comprises at least an accuracy of the model and/or a running delay of the model; and

the first sensitivity information includes: when the pruning strategy of the network structure unit is changed from other pruning strategies to a candidate pruning strategy, the corresponding reduction rate of the precision of the neural network model or the increase rate of the operation delay, or when the pruning strategy of the network structure unit is changed from other pruning strategies to the candidate pruning strategy, the corresponding reduction rate of the comprehensive performance index of the neural network model, wherein the comprehensive performance index is calculated based on the precision and the operation delay of the model;

the second sensitivity information includes: when the quantization strategy of the network structure unit is changed from other quantization strategies to a candidate quantization strategy, the corresponding precision reduction rate or running delay increase rate of the neural network model, or when the quantization strategy of the network structure unit is changed from other quantization strategies to the candidate quantization strategy, the corresponding comprehensive performance index reduction rate of the neural network model.

4. The method of claim 1, wherein the obtaining performance of the current compressed model comprises:

and training the current compressed model based on sample data, and testing the performance of the trained compressed model by using test data.

5. The method of any of claims 1-4, wherein the method further comprises:

and sending the target neural network model to a task execution end for media data processing, so that the media data is processed by using the target neural network model at the task execution end.

6. An apparatus for generating a neural network model, comprising:

an execution unit configured to perform a plurality of iterative search operations;

the execution unit includes:

a search unit configured to perform the following steps in the iterative search operation: determining a target compression strategy of a preset neural network model in a search space of a preset compression strategy by adopting a preset compression strategy controller, wherein the compression strategy comprises a combination strategy of pruning and quantization;

a compression unit configured to perform the following steps in the iterative search operation: according to the target compression strategy, pruning and quantifying the preset neural network model to obtain a current compressed model and obtain the performance of the current compressed model;

a feedback unit configured to perform the following steps in the iterative search operation: generating feedback information based on the performance of the compressed model, updating the preset compression strategy controller based on the feedback information in response to determining that the feedback information does not reach a preset convergence condition, and executing next iterative search operation based on the updated compression strategy controller;

a determination unit configured to perform the following steps in the iterative search operation: and in response to determining that the feedback information reaches a preset convergence condition, determining the current compressed model as the generated target neural network model.

7. The apparatus of claim 6, wherein the apparatus further comprises:

a construction unit configured to construct a search space of the preset compression strategy;

the construction unit includes:

the first screening unit is configured to determine first sensitivity information of each network structure unit in a preset neural network model to a pruning method in a preset pruning method set, and screen out candidate pruning methods meeting sensitivity screening conditions corresponding to each network structure unit from the preset pruning method set based on the first sensitivity information, wherein the first sensitivity information represents a performance reduction rate of the corresponding neural network model when a pruning strategy of the network structure unit is changed from other pruning strategies to the candidate pruning strategies;

a second screening unit, configured to determine second sensitivity information of each network structure unit in a preset neural network model to a quantization method in a preset quantization method set, and screen out a candidate quantization method meeting a sensitivity screening condition corresponding to each network structure unit from the preset quantization method set based on the second sensitivity information, respectively, where the second sensitivity information represents a degradation rate of performance of the corresponding neural network model when a quantization strategy of the network structure unit is changed from other quantization strategies to candidate quantization strategies;

and the combining unit is configured to combine the candidate pruning method and the candidate quantization method corresponding to each network structure unit so as to construct a search space of the compression strategy corresponding to each corresponding network structure unit.

8. The apparatus of claim 7, wherein the performance of the model comprises at least an accuracy of the model and/or a running delay of the model; and

9. The apparatus of claim 6, wherein the apparatus further comprises: a compression unit configured to obtain performance of the compressed model as follows:

10. The apparatus of any of claims 6-9, wherein the apparatus further comprises:

and the processing unit is configured to send the target neural network model to a task execution end for media data processing so as to process the media data by using the target neural network model at the task execution end.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.