CN117195999A

CN117195999A - Differential topk-based differential model scaling method and system

Info

Publication number: CN117195999A
Application number: CN202311175699.0A
Authority: CN
Inventors: 刘凯; 王若辉; 高剑飞; 陈恺
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-12-08

Abstract

The application relates to the technical field of neural networks, in particular to a differentiable model scaling method and system based on differentiable topk, wherein the method comprises the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result. According to the differential model scaling method based on the differential topk, the depth and the width of the model are searched through the differential topk operator, so that the parameter adjustment cost in the model scaling process is reduced, and the model performance after model scaling is improved.

Description

Differential topk-based differential model scaling method and system

Technical Field

The embodiment of the application relates to the technical field of neural networks, in particular to a differentiable model scaling method and system based on differentiable topk.

Background

As large models exhibit great capabilities, scaling the model becomes an important way to enhance the performance of the model. Network structure searching is an important method for automatically scaling models.

Existing network structure search algorithms can be divided into two major categories, random search algorithms and gradient-based search algorithms. For the random search algorithm, there is an advantage in that various search spaces can be processed, but search efficiency is low. Although the search efficiency can be improved to some extent for gradient-based search algorithms, challenges remain when differentially modeling network structure superparameters. Specifically, existing gradient-based algorithms either cannot model structural superparameters directly or cannot differentiate after direct modeling, but rather require an estimate of the gradient. Both of these results in performance degradation.

Disclosure of Invention

The embodiment of the application provides a differentiable model scaling method and system based on differentiable topk, which solve the problem of low optimization efficiency in a network structure search algorithm.

In order to solve the above technical problems, in a first aspect, an embodiment of the present application provides a differentiable model scaling method based on differentiable topk, including the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result.

In some exemplary embodiments, constructing the differentiable topk operator includes: evaluating the importance of the element; carrying out standardized treatment on the importance of the elements; a soft mask is generated based on the leachable pruning proportion and the normalized element importance.

In some exemplary embodiments, the importance of an element is assessed using taylor importance; the calculation formula of the importance of the element is as follows:

wherein c _i Representing the importance of the i-th element; t represents the iterative step number of training; decay is the attenuation coefficient, m _i Is c _i A corresponding soft mask; g _i Is m _i Will be m _i As an index for evaluating the importance of the element.

In some exemplary embodiments, the importance of the element is normalized by changing the importance to a uniform distribution between 0 and 1; the formula for the normalization process is as follows:

wherein c' _i Representing the importance of the i-th element; c' _i The value of (2) indicates that the importance of the i-th element exceeds c' _i *100% of elements.

In some exemplary embodiments, the calculation formula for generating the soft mask is as follows:

wherein Sigmoid represents a Sigmoid function; lambda represents the control soft mask m _i A degree approaching 0 or 1; wherein, the larger the lambda, the m _i The closer to 0 or 1.

In some exemplary embodiments, modeling and searching the width and depth of a model using a learnable parameter includes: based on a differentiable topk operator, modeling is carried out on networks with different widths and different depths respectively; for the width, multiplying the soft mask with the corresponding feature, thereby being able to simulate a pruned model; for depth, a neural network with residual connection is adopted, soft masks are multiplied by residual blocks, and when the soft masks approach 0, the residual blocks corresponding to the soft masks are cut off, so that the depth is reduced.

In some exemplary embodiments, the calculation formula of the resource consumption amount of the final network is as follows:

loss＝loss _task +λ _resource ×loss _resource (4)

wherein loss is _task Representing the original loss of the task; loss of loss _resource Representing a loss of resources; lambda represents the control soft mask m _i To a degree approaching 0 or 1.

In a second aspect, an embodiment of the present application further provides a differentiable model scaling system based on a differentiable topk, including: the system comprises a differentiable topk operator module, a modeling and searching module and a calculating module which are connected in sequence; the differentiable topk operator module is used for constructing a differentiable topk operator; the modeling and searching module is used for modeling and searching the width and the depth of the neural network by adopting the learnable parameters according to the differentiable topk operator to obtain a searching result; the calculation module is used for limiting the resource consumption of the final network according to the resource loss function and the search result.

In some exemplary embodiments, the differentiable topk operator module includes an evaluation unit, a normalization processing unit, and a soft mask generation unit; wherein the evaluation unit is used for evaluating the importance of the element; the standardized processing unit is used for carrying out standardized processing on the importance of the element; the soft mask generating unit is used for generating a soft mask according to the leachable pruning proportion and the element importance after the standardized processing.

In some exemplary embodiments, the evaluation unit evaluates the importance of the element using taylor importance; the standardized processing unit performs standardized processing on the importance of the elements to change the importance into uniform distribution between 0 and 1; the modeling and searching module models and searches the width and depth of the neural network by inputting the leachable pruning proportion and the standardized element importance and generating a soft mask based on the differentiable topk operator.

The technical scheme provided by the embodiment of the application has at least the following advantages:

the embodiment of the application provides a differential model scaling method and a differential model scaling system based on a differential topk, wherein the method comprises the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result.

The application provides a differentiable model scaling method and system based on differentiable topk, which uses the differentiable topk to model and search the depth and width of a model. The application provides a differentiable topk operator, which only needs one learnable parameter to directly model a structural parameter. Moreover, the topk operator is fully differentiable, and the learnable parameters can derive gradients from the task loss. The application uses differentiable topk, generates soft mask by inputting leachable pruning proportion and element importance, simulates networks with different depths/widths, models and searches the widths and depths of the neural network, and constrains the resource consumption of the final network by resource loss. In addition, the application standardizes the importance of the elements, so that the importance of the elements has practical significance.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, which are not to be construed as limiting the embodiments unless specifically indicated otherwise.

FIG. 1 is a flow chart of a differential model scaling method based on a differential topk according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a differential model scaling system based on a differential topk according to an embodiment of the present application;

FIG. 3 is a block diagram of a differentiable topk according to one embodiment of the present application;

FIG. 4 is a schematic diagram of a function image and a gradient image of forward reasoning of a differentiable topk according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

As known from the background art, the existing network structure search algorithm has the problem of low search efficiency.

Existing structure search algorithms can be divided into two main categories: (1) random search. Random searches typically have two parts, sampling and evaluation. The method firstly samples different network structures and then compares the different network structures, thereby obtaining a better network structure. The algorithm has low searching efficiency, or requires huge searching cost, or the searched model structure has low precision. (2) gradient-based search algorithms. The gradient-based search algorithm improves the search efficiency and can quickly find the optimization direction of the learnable structural parameters.

The existing structure search algorithm mainly has the following problems. (1) indirect modeling. Because the network structure parameters do not directly participate in the calculation of the network and cannot be directly modeled, some methods model the network structure super-parameters in an indirect modeling mode. For example, the width of a channel is modeled using the selection of the channel. This approach results in a rapid expansion of the search space, which becomes a combined problem, resulting in reduced optimization efficiency. And (2) a gradient estimation mode. Other approaches also attempt to model structural superparameters of the model directly, but are not differentiable and only allow optimization of learnable superparameters using gradient estimation.

Specifically, existing methods for modeling the width and depth of a model using a single learnable parameter employ operators that are not differentiable by updating the learnable parameter using a gradient estimation method: the method comprises the following steps of (1) sampling a learnable structural parameter in a neighborhood of the learnable structural parameter; (2) calculating loss from the sampled structural parameters; (3) updating the learnable parameters according to the loss size. The non-differentiable nature of this approach results in greater search overhead and lower accuracy.

In order to solve the technical problems, an embodiment of the present application provides a differentiable model scaling method based on differentiable topk, including the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result. The embodiment of the application provides a differential model scaling method and a differential model scaling system based on a differential topk, which realize the completely differential topk, can directly model the super-parameters of a structure, optimize the super-parameters by using gradient descent and greatly improve the optimization efficiency.

Embodiments of the present application will be described in detail below with reference to the attached drawings. However, it will be understood by those of ordinary skill in the art that in various embodiments of the present application, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, the claimed technical solution of the present application can be realized without these technical details and various changes and modifications based on the following embodiments.

Referring to fig. 1, an embodiment of the present application provides a differentiable model scaling method based on differentiable topk, comprising the steps of:

and S1, constructing a differentiable topk operator.

And step S2, modeling and searching the width and the depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result.

And step S3, restraining the resource consumption of the final network based on the resource loss function and the search result.

The application provides a differentiable model scaling method based on differentiable topk, which aims at finding out the maximum K number called topk from mass data by proposing a differentiable topk operator; the topk operator has two properties, (1) a structural superparameter is modeled directly using a single learnable parameter. (2) the operator is completely microminiaturizable. Based on the differentiable topk operator, the application provides the differentiable model scaling method which is used for searching the super parameters of the depth/width structure of the model and improving the performance of the model. The embodiment of the application provides a differentiable model scaling method based on differentiable topk, which searches the depth and the width of a model through a differentiable topk operator, automatically searches the model, is favorable for reducing the parameter adjusting cost in the model scaling process, and improves the model performance after model scaling. The method realizes the topk which can be fully differentiated, can directly model the super-parameters of the structure, optimizes the super-parameters by using gradient descent, and greatly improves the optimization efficiency.

Referring to FIG. 2, an embodiment of the present application also provides a differentiable model scaling system based on differentiable topk, comprising: the system comprises a differentiable topk operator module 101, a modeling and searching module 102 and a calculating module 103 which are connected in sequence; wherein the differentiable topk operator module 101 is configured to construct a differentiable topk operator; the modeling and searching module 102 is configured to model and search the width and depth of the neural network by using a learnable parameter according to the differentiable topk operator, so as to obtain a search result; the calculation module 103 is configured to constrain the resource consumption of the final network according to the resource loss function and the search result.

In some embodiments, the differentiable topk operator module 101 includes an evaluation unit, a normalization processing unit, and a soft mask generation unit; wherein the evaluation unit is used for evaluating the importance of the element; the standardized processing unit is used for carrying out standardized processing on the importance of the element; the soft mask generating unit is used for generating a soft mask according to the leachable pruning proportion and the element importance after the standardized processing.

In some embodiments, the evaluation unit evaluates the importance of the element using taylor importance; the standardized processing unit performs standardized processing on the importance of the elements to change the importance into uniform distribution between 0 and 1; the modeling and search module 102 models and searches the width and depth of the neural network based on the differentiable topk operator by inputting the leachable pruning proportion and the normalized element importance, generating a soft mask.

The embodiment of the application provides a differentiable model scaling method based on differentiable topk, which comprises the steps of firstly constructing a differentiable topk operator. The differentiable topk contains an element importance assessment, an importance normalization, and a soft mask generation. The topk operator requires only one learnable parameter to model the structural parameters directly. The topk operator is fully differentiable and the learnable parameters can derive gradients from the task loss. The differential model scaling method provided by the application uses the differential topk to model and search the super parameters of the depth and width structures of the model, and the resource consumption of the final network is constrained through the resource loss function.

The application provides a differentiable model scaling method based on differentiable topk, which firstly provides a differentiable topk operator, and an architecture diagram of the differentiable topk operator is shown in figure 3. The present application uses parameter a to model a structural superparameter. a represents pruning proportion. For example, a convolution layer has k channels, and the maximum value is N, then it can be obtained that: k=round (N (1-a)).

The topk operator comprises three components: 1. element importance assessment, 2. Importance normalization, 3. Soft mask generation.

For element importance assessment, in particular, the element importance assessment may take many forms, like L1 norms, etc., where the present application uses taylor importance to assess the importance of an element. Vector c for the present application _i To represent the importance of the element, the calculation formula of the importance is as follows:

wherein c _i Representing the importance of the i-th element; t represents the iterative step number of training; decay is the attenuation coefficient, m _i Is c _i A corresponding soft mask; g _i Is m _i Will be m _i As an index for evaluating the importance of the element. By means of the index, the importance of the element can be easily evaluated.

In order to generate soft masks more easily, the application performs normalization processing on the importance of elements in advance so that the importance becomes uniform distribution between 0 and 1, and the normalization processing has the following formula:

wherein c' _i Representing the importance of the i-th element.

The absolute value of the normalized element importance becomes of practical significance. c' _i The value of (2) indicates that the importance of the i-th element exceeds c' _i *100% of elements. For example, c' _i =0.5 represents an element whose importance exceeds 50%.

The present application uses pruning proportion parameter a and normalized element importance to jointly generate a soft mask. The formula is as follows:

The forward reasoning function image and gradient image of the differentiable topk of the present application is shown in fig. 4. From this, the differentiable topk of the application can model the structural superparameter directly using the pruning proportion parameter a, and can model the structural superparameter from the soft mask m _i In (3) a gradient is obtained. The differentiable topk of the present application is capable of applying a soft mask to the blurred region (0.05<m<0.95 A) is particularly sensitive to gradients.

Based on a differentiable topk operator, the application can model the depth and the width of the model, and for the width, the soft mask is multiplied with the corresponding characteristic, so that the model after pruning can be simulated; for depth, a neural network with residual connection is adopted, soft masks are multiplied by residual blocks, and when the soft masks approach 0, the residual blocks corresponding to the soft masks are cut off, so that the depth is reduced.

It should be noted that, a layer of network has N neurons, the output of N neurons is N features, and the soft mask is also N, where the soft mask and the output features are in a one-to-one correspondence. When modeling the width of the model, the soft mask is multiplied by the corresponding features, so that the pruned model can be simulated.

In addition, in order to accurately control the resource consumption of the searched model, the application introduces additional resource loss, thereby controlling the resource consumption of the model, as shown in the following formula:

loss＝loss _task +λ _resource ×loss _resource (4)

The resource loss can be directly obtained by calculation through the pruning proportion parameter a. For example, parameters of a fully connected layer of IN features and OUT features may be expressed as in_out_in_a_out.

The method adopts a flow similar to the differential model pruning, and firstly, the model is pruned to the specified resource consumption. Then, retraining the searched model structure. The only difference is that the method of the present application does not require pre-training.

The method of the present application uses differentiable topk to search for depth and width of the model. The key point is the differentiable topk used in the present application. First, the differentiable topk models the width and depth of the neural network by inputting the leachable pruning proportion and element importance, generating a soft mask. Secondly, the application standardizes the importance of the elements, thereby leading the importance of the elements to obtain practical significance. And thirdly, the application directly generates a soft mask through the element importance and the leachable pruning proportion and simulates networks with different depths/widths. Compared with the prior art, the differentiable model scaling method based on the differentiable topk is easier to optimize, less searching cost can be used, and higher final performance is obtained.

The feasibility of the method is verified by carrying out experiments and simulation on the method provided by the application. Therefore, experiments are carried out on various model structures on the ImageNet, and the experimental results show that: the method of the application can significantly improve the performance of the model, and the experimental results are shown in the following table 1.

Table 1 test results of the present application on different model structures

Model	Top1
		EfficientNet-B0	77.1
DMS-EN-B0(ours)	78.5
		EfficientNet-B1	79.1
DMS-EN-B1(ours)	80.0
		EfficientNet-B2	80.1
DMS-ES-B2(ours)	81.1
		ResNet-50	76.5
DMS-ResNet(ours)	77.7
		MobileNetV2	72.0
DMS-MobileNetV2(ours)	73.0
		Deit-Tiny	74.5
DMS-Deit-Tiny(ours)	75.1
		Swin-Tiny	81.3
DMS-Swin-Tiny(ours)	81.5

From the experimental results in table 1, it can be seen that the method of the present application brings about significant performance improvement on various model structures (including convolutional neural networks and convectors). In table 1, top1 indicates the accuracy of image classification.

In the differential model scaling method based on the differential topk, different element importance evaluation modes can be adopted. Furthermore, because of the variety of model structures, the differentiable topk of the present application can be adapted to a variety of different network layers, like convolutional layers, fully-connected layers, attention layers, and the like. Therefore, modeling different network structures using different element importance assessment approaches belongs to variations of the application.

Referring to fig. 5, another embodiment of the present application provides an electronic device, including: at least one processor 110; and a memory 111 communicatively coupled to the at least one processor; the memory 111 stores instructions executable by the at least one processor 110, the instructions being executable by the at least one processor 110 to enable the at least one processor 110 to perform any one of the method embodiments described above.

Where the memory 111 and the processor 110 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 110 and the memory 111 together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 110 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 110.

The processor 110 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 111 may be used to store data used by processor 110 in performing operations.

Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described above. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

By the above technical scheme, the embodiment of the application provides a differentiable model scaling method and system based on differentiable topk, wherein the method comprises the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the application and that various changes in form and details may be made therein without departing from the spirit and scope of the application. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the application, and the scope of the application is therefore intended to be limited only by the appended claims.

Claims

1. A differentiable model scaling method based on differentiable topk, comprising the steps of:

constructing a differentiable topk operator;

modeling and searching the width and depth of the neural network by adopting a learnable parameter based on the differentiable topk operator to obtain a search result;

and constraining the resource consumption of the final network based on the resource loss function and the search result.

2. The differentiable model scaling method based on differentiable topk of claim 1, wherein constructing the differentiable topk operator comprises:

evaluating the importance of the element;

carrying out standardized treatment on the importance of the elements;

a soft mask is generated based on the leachable pruning proportion and the normalized element importance.

3. The differentiable model scaling method based on differentiable topk of claim 2, wherein the importance of the element is evaluated using taylor importance;

the calculation formula of the importance of the element is as follows:

4. The differentiable model scaling method based on differentiable topk of claim 2, wherein the normalizing the importance of the elements is to change the importance to a uniform distribution between 0 and 1; the formula for the normalization process is as follows:

5. The differentiable model scaling method of claim 2, wherein the calculation formula for generating the soft mask is as follows:

6. The differentiable model scaling method of claim 1, wherein modeling and searching the width and depth of the model using the learnable parameters comprises:

based on a differentiable topk operator, modeling is carried out on networks with different widths and different depths respectively;

for the width, multiplying the soft mask with the corresponding feature, thereby being able to simulate a pruned model;

for depth, a neural network with residual connection is adopted, soft masks are multiplied by residual blocks, and when the soft masks approach 0, the residual blocks corresponding to the soft masks are cut off, so that the depth is reduced.

7. The differentiable model scaling method based on differentiable topk of claim 1, wherein the calculation formula of the resource consumption of the final network is as follows:

loss＝loss _task +λ _resource ×loss _resource (4)

8. A differentiable model scaling system based on differentiable topk, comprising: the system comprises a differentiable topk operator module, a modeling and searching module and a calculating module which are connected in sequence; wherein,

the differentiable topk operator module is used for constructing a differentiable topk operator;

the modeling and searching module is used for modeling and searching the width and the depth of the neural network by adopting the learnable parameters according to the differentiable topk operator to obtain a searching result;

and the calculation module is used for limiting the resource consumption of the final network according to the resource loss function and the search result.

9. The differentiable model scaling system based on differentiable topk of claim 8, wherein the differentiable topk operator module comprises an evaluation unit, a normalization processing unit, and a soft mask generation unit; wherein,

the evaluation unit is used for evaluating the importance of the elements;

the standardized processing unit is used for carrying out standardized processing on the importance of the element;

the soft mask generation unit is used for generating a soft mask according to the leachable pruning proportion and the element importance after the standardized processing.

10. The differentiable model scaling system based on differentiable topk of claim 9, wherein the evaluation unit evaluates the importance of the element using taylor importance; the standardized processing unit performs standardized processing on the importance of the elements to change the importance into uniform distribution between 0 and 1;

the modeling and searching module models and searches the width and depth of the neural network by inputting the leachable pruning proportion and the element importance after the standardized processing based on the differentiable topk operator and generating a soft mask.