CN117195999A - Differential topk-based differential model scaling method and system - Google Patents
Differential topk-based differential model scaling method and system Download PDFInfo
- Publication number
- CN117195999A CN117195999A CN202311175699.0A CN202311175699A CN117195999A CN 117195999 A CN117195999 A CN 117195999A CN 202311175699 A CN202311175699 A CN 202311175699A CN 117195999 A CN117195999 A CN 117195999A
- Authority
- CN
- China
- Prior art keywords
- differentiable
- topk
- importance
- operator
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 101100481876 Danio rerio pbk gene Proteins 0.000 title claims abstract description 102
- 101100481878 Mus musculus Pbk gene Proteins 0.000 title claims abstract description 102
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000013528 artificial neural network Methods 0.000 claims abstract description 21
- 230000008569 process Effects 0.000 claims abstract description 4
- 238000012545 processing Methods 0.000 claims description 22
- 238000013138 pruning Methods 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 12
- 239000012633 leachable Substances 0.000 claims description 12
- 238000011156 evaluation Methods 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 9
- 238000013459 approach Methods 0.000 claims description 7
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 6
- 238000009827 uniform distribution Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 description 14
- 238000010845 search algorithm Methods 0.000 description 11
- 238000005457 optimization Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 101150041570 TOP1 gene Proteins 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 101100153581 Bacillus anthracis topX gene Proteins 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Landscapes
- Image Analysis (AREA)
Abstract
The application relates to the technical field of neural networks, in particular to a differentiable model scaling method and system based on differentiable topk, wherein the method comprises the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result. According to the differential model scaling method based on the differential topk, the depth and the width of the model are searched through the differential topk operator, so that the parameter adjustment cost in the model scaling process is reduced, and the model performance after model scaling is improved.
Description
Technical Field
The embodiment of the application relates to the technical field of neural networks, in particular to a differentiable model scaling method and system based on differentiable topk.
Background
As large models exhibit great capabilities, scaling the model becomes an important way to enhance the performance of the model. Network structure searching is an important method for automatically scaling models.
Existing network structure search algorithms can be divided into two major categories, random search algorithms and gradient-based search algorithms. For the random search algorithm, there is an advantage in that various search spaces can be processed, but search efficiency is low. Although the search efficiency can be improved to some extent for gradient-based search algorithms, challenges remain when differentially modeling network structure superparameters. Specifically, existing gradient-based algorithms either cannot model structural superparameters directly or cannot differentiate after direct modeling, but rather require an estimate of the gradient. Both of these results in performance degradation.
Disclosure of Invention
The embodiment of the application provides a differentiable model scaling method and system based on differentiable topk, which solve the problem of low optimization efficiency in a network structure search algorithm.
In order to solve the above technical problems, in a first aspect, an embodiment of the present application provides a differentiable model scaling method based on differentiable topk, including the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result.
In some exemplary embodiments, constructing the differentiable topk operator includes: evaluating the importance of the element; carrying out standardized treatment on the importance of the elements; a soft mask is generated based on the leachable pruning proportion and the normalized element importance.
In some exemplary embodiments, the importance of an element is assessed using taylor importance; the calculation formula of the importance of the element is as follows:
wherein c i Representing the importance of the i-th element; t represents the iterative step number of training; decay is the attenuation coefficient, m i Is c i A corresponding soft mask; g i Is m i Will be m i As an index for evaluating the importance of the element.
In some exemplary embodiments, the importance of the element is normalized by changing the importance to a uniform distribution between 0 and 1; the formula for the normalization process is as follows:
wherein c' i Representing the importance of the i-th element; c' i The value of (2) indicates that the importance of the i-th element exceeds c' i *100% of elements.
In some exemplary embodiments, the calculation formula for generating the soft mask is as follows:
wherein Sigmoid represents a Sigmoid function; lambda represents the control soft mask m i A degree approaching 0 or 1; wherein, the larger the lambda, the m i The closer to 0 or 1.
In some exemplary embodiments, modeling and searching the width and depth of a model using a learnable parameter includes: based on a differentiable topk operator, modeling is carried out on networks with different widths and different depths respectively; for the width, multiplying the soft mask with the corresponding feature, thereby being able to simulate a pruned model; for depth, a neural network with residual connection is adopted, soft masks are multiplied by residual blocks, and when the soft masks approach 0, the residual blocks corresponding to the soft masks are cut off, so that the depth is reduced.
In some exemplary embodiments, the calculation formula of the resource consumption amount of the final network is as follows:
loss=loss task +λ resource ×loss resource (4)
wherein loss is task Representing the original loss of the task; loss of loss resource Representing a loss of resources; lambda represents the control soft mask m i To a degree approaching 0 or 1.
In a second aspect, an embodiment of the present application further provides a differentiable model scaling system based on a differentiable topk, including: the system comprises a differentiable topk operator module, a modeling and searching module and a calculating module which are connected in sequence; the differentiable topk operator module is used for constructing a differentiable topk operator; the modeling and searching module is used for modeling and searching the width and the depth of the neural network by adopting the learnable parameters according to the differentiable topk operator to obtain a searching result; the calculation module is used for limiting the resource consumption of the final network according to the resource loss function and the search result.
In some exemplary embodiments, the differentiable topk operator module includes an evaluation unit, a normalization processing unit, and a soft mask generation unit; wherein the evaluation unit is used for evaluating the importance of the element; the standardized processing unit is used for carrying out standardized processing on the importance of the element; the soft mask generating unit is used for generating a soft mask according to the leachable pruning proportion and the element importance after the standardized processing.
In some exemplary embodiments, the evaluation unit evaluates the importance of the element using taylor importance; the standardized processing unit performs standardized processing on the importance of the elements to change the importance into uniform distribution between 0 and 1; the modeling and searching module models and searches the width and depth of the neural network by inputting the leachable pruning proportion and the standardized element importance and generating a soft mask based on the differentiable topk operator.
The technical scheme provided by the embodiment of the application has at least the following advantages:
the embodiment of the application provides a differential model scaling method and a differential model scaling system based on a differential topk, wherein the method comprises the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result.
The application provides a differentiable model scaling method and system based on differentiable topk, which uses the differentiable topk to model and search the depth and width of a model. The application provides a differentiable topk operator, which only needs one learnable parameter to directly model a structural parameter. Moreover, the topk operator is fully differentiable, and the learnable parameters can derive gradients from the task loss. The application uses differentiable topk, generates soft mask by inputting leachable pruning proportion and element importance, simulates networks with different depths/widths, models and searches the widths and depths of the neural network, and constrains the resource consumption of the final network by resource loss. In addition, the application standardizes the importance of the elements, so that the importance of the elements has practical significance.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, which are not to be construed as limiting the embodiments unless specifically indicated otherwise.
FIG. 1 is a flow chart of a differential model scaling method based on a differential topk according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a differential model scaling system based on a differential topk according to an embodiment of the present application;
FIG. 3 is a block diagram of a differentiable topk according to one embodiment of the present application;
FIG. 4 is a schematic diagram of a function image and a gradient image of forward reasoning of a differentiable topk according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
As known from the background art, the existing network structure search algorithm has the problem of low search efficiency.
Existing structure search algorithms can be divided into two main categories: (1) random search. Random searches typically have two parts, sampling and evaluation. The method firstly samples different network structures and then compares the different network structures, thereby obtaining a better network structure. The algorithm has low searching efficiency, or requires huge searching cost, or the searched model structure has low precision. (2) gradient-based search algorithms. The gradient-based search algorithm improves the search efficiency and can quickly find the optimization direction of the learnable structural parameters.
The existing structure search algorithm mainly has the following problems. (1) indirect modeling. Because the network structure parameters do not directly participate in the calculation of the network and cannot be directly modeled, some methods model the network structure super-parameters in an indirect modeling mode. For example, the width of a channel is modeled using the selection of the channel. This approach results in a rapid expansion of the search space, which becomes a combined problem, resulting in reduced optimization efficiency. And (2) a gradient estimation mode. Other approaches also attempt to model structural superparameters of the model directly, but are not differentiable and only allow optimization of learnable superparameters using gradient estimation.
Specifically, existing methods for modeling the width and depth of a model using a single learnable parameter employ operators that are not differentiable by updating the learnable parameter using a gradient estimation method: the method comprises the following steps of (1) sampling a learnable structural parameter in a neighborhood of the learnable structural parameter; (2) calculating loss from the sampled structural parameters; (3) updating the learnable parameters according to the loss size. The non-differentiable nature of this approach results in greater search overhead and lower accuracy.
In order to solve the technical problems, an embodiment of the present application provides a differentiable model scaling method based on differentiable topk, including the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result. The embodiment of the application provides a differential model scaling method and a differential model scaling system based on a differential topk, which realize the completely differential topk, can directly model the super-parameters of a structure, optimize the super-parameters by using gradient descent and greatly improve the optimization efficiency.
Embodiments of the present application will be described in detail below with reference to the attached drawings. However, it will be understood by those of ordinary skill in the art that in various embodiments of the present application, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, the claimed technical solution of the present application can be realized without these technical details and various changes and modifications based on the following embodiments.
Referring to fig. 1, an embodiment of the present application provides a differentiable model scaling method based on differentiable topk, comprising the steps of:
and S1, constructing a differentiable topk operator.
And step S2, modeling and searching the width and the depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result.
And step S3, restraining the resource consumption of the final network based on the resource loss function and the search result.
The application provides a differentiable model scaling method based on differentiable topk, which aims at finding out the maximum K number called topk from mass data by proposing a differentiable topk operator; the topk operator has two properties, (1) a structural superparameter is modeled directly using a single learnable parameter. (2) the operator is completely microminiaturizable. Based on the differentiable topk operator, the application provides the differentiable model scaling method which is used for searching the super parameters of the depth/width structure of the model and improving the performance of the model. The embodiment of the application provides a differentiable model scaling method based on differentiable topk, which searches the depth and the width of a model through a differentiable topk operator, automatically searches the model, is favorable for reducing the parameter adjusting cost in the model scaling process, and improves the model performance after model scaling. The method realizes the topk which can be fully differentiated, can directly model the super-parameters of the structure, optimizes the super-parameters by using gradient descent, and greatly improves the optimization efficiency.
Referring to FIG. 2, an embodiment of the present application also provides a differentiable model scaling system based on differentiable topk, comprising: the system comprises a differentiable topk operator module 101, a modeling and searching module 102 and a calculating module 103 which are connected in sequence; wherein the differentiable topk operator module 101 is configured to construct a differentiable topk operator; the modeling and searching module 102 is configured to model and search the width and depth of the neural network by using a learnable parameter according to the differentiable topk operator, so as to obtain a search result; the calculation module 103 is configured to constrain the resource consumption of the final network according to the resource loss function and the search result.
In some embodiments, the differentiable topk operator module 101 includes an evaluation unit, a normalization processing unit, and a soft mask generation unit; wherein the evaluation unit is used for evaluating the importance of the element; the standardized processing unit is used for carrying out standardized processing on the importance of the element; the soft mask generating unit is used for generating a soft mask according to the leachable pruning proportion and the element importance after the standardized processing.
In some embodiments, the evaluation unit evaluates the importance of the element using taylor importance; the standardized processing unit performs standardized processing on the importance of the elements to change the importance into uniform distribution between 0 and 1; the modeling and search module 102 models and searches the width and depth of the neural network based on the differentiable topk operator by inputting the leachable pruning proportion and the normalized element importance, generating a soft mask.
The embodiment of the application provides a differentiable model scaling method based on differentiable topk, which comprises the steps of firstly constructing a differentiable topk operator. The differentiable topk contains an element importance assessment, an importance normalization, and a soft mask generation. The topk operator requires only one learnable parameter to model the structural parameters directly. The topk operator is fully differentiable and the learnable parameters can derive gradients from the task loss. The differential model scaling method provided by the application uses the differential topk to model and search the super parameters of the depth and width structures of the model, and the resource consumption of the final network is constrained through the resource loss function.
The application provides a differentiable model scaling method based on differentiable topk, which firstly provides a differentiable topk operator, and an architecture diagram of the differentiable topk operator is shown in figure 3. The present application uses parameter a to model a structural superparameter. a represents pruning proportion. For example, a convolution layer has k channels, and the maximum value is N, then it can be obtained that: k=round (N (1-a)).
The topk operator comprises three components: 1. element importance assessment, 2. Importance normalization, 3. Soft mask generation.
For element importance assessment, in particular, the element importance assessment may take many forms, like L1 norms, etc., where the present application uses taylor importance to assess the importance of an element. Vector c for the present application i To represent the importance of the element, the calculation formula of the importance is as follows:
wherein c i Representing the importance of the i-th element; t represents the iterative step number of training; decay is the attenuation coefficient, m i Is c i A corresponding soft mask; g i Is m i Will be m i As an index for evaluating the importance of the element. By means of the index, the importance of the element can be easily evaluated.
In order to generate soft masks more easily, the application performs normalization processing on the importance of elements in advance so that the importance becomes uniform distribution between 0 and 1, and the normalization processing has the following formula:
wherein c' i Representing the importance of the i-th element.
The absolute value of the normalized element importance becomes of practical significance. c' i The value of (2) indicates that the importance of the i-th element exceeds c' i *100% of elements. For example, c' i =0.5 represents an element whose importance exceeds 50%.
The present application uses pruning proportion parameter a and normalized element importance to jointly generate a soft mask. The formula is as follows:
wherein Sigmoid represents a Sigmoid function; lambda represents the control soft mask m i A degree approaching 0 or 1; wherein, the larger the lambda, the m i The closer to 0 or 1.
The forward reasoning function image and gradient image of the differentiable topk of the present application is shown in fig. 4. From this, the differentiable topk of the application can model the structural superparameter directly using the pruning proportion parameter a, and can model the structural superparameter from the soft mask m i In (3) a gradient is obtained. The differentiable topk of the present application is capable of applying a soft mask to the blurred region (0.05<m<0.95 A) is particularly sensitive to gradients.
Based on a differentiable topk operator, the application can model the depth and the width of the model, and for the width, the soft mask is multiplied with the corresponding characteristic, so that the model after pruning can be simulated; for depth, a neural network with residual connection is adopted, soft masks are multiplied by residual blocks, and when the soft masks approach 0, the residual blocks corresponding to the soft masks are cut off, so that the depth is reduced.
It should be noted that, a layer of network has N neurons, the output of N neurons is N features, and the soft mask is also N, where the soft mask and the output features are in a one-to-one correspondence. When modeling the width of the model, the soft mask is multiplied by the corresponding features, so that the pruned model can be simulated.
In addition, in order to accurately control the resource consumption of the searched model, the application introduces additional resource loss, thereby controlling the resource consumption of the model, as shown in the following formula:
loss=loss task +λ resource ×loss resource (4)
wherein loss is task Representing the original loss of the task; loss of loss resource Representing a loss of resources; lambda represents the control soft mask m i To a degree approaching 0 or 1.
The resource loss can be directly obtained by calculation through the pruning proportion parameter a. For example, parameters of a fully connected layer of IN features and OUT features may be expressed as in_out_in_a_out.
The method adopts a flow similar to the differential model pruning, and firstly, the model is pruned to the specified resource consumption. Then, retraining the searched model structure. The only difference is that the method of the present application does not require pre-training.
The method of the present application uses differentiable topk to search for depth and width of the model. The key point is the differentiable topk used in the present application. First, the differentiable topk models the width and depth of the neural network by inputting the leachable pruning proportion and element importance, generating a soft mask. Secondly, the application standardizes the importance of the elements, thereby leading the importance of the elements to obtain practical significance. And thirdly, the application directly generates a soft mask through the element importance and the leachable pruning proportion and simulates networks with different depths/widths. Compared with the prior art, the differentiable model scaling method based on the differentiable topk is easier to optimize, less searching cost can be used, and higher final performance is obtained.
The feasibility of the method is verified by carrying out experiments and simulation on the method provided by the application. Therefore, experiments are carried out on various model structures on the ImageNet, and the experimental results show that: the method of the application can significantly improve the performance of the model, and the experimental results are shown in the following table 1.
Table 1 test results of the present application on different model structures
Model | Top1 |
EfficientNet-B0 | 77.1 |
DMS-EN-B0(ours) | 78.5 |
EfficientNet-B1 | 79.1 |
DMS-EN-B1(ours) | 80.0 |
EfficientNet-B2 | 80.1 |
DMS-ES-B2(ours) | 81.1 |
ResNet-50 | 76.5 |
DMS-ResNet(ours) | 77.7 |
MobileNetV2 | 72.0 |
DMS-MobileNetV2(ours) | 73.0 |
Deit-Tiny | 74.5 |
DMS-Deit-Tiny(ours) | 75.1 |
Swin-Tiny | 81.3 |
DMS-Swin-Tiny(ours) | 81.5 |
From the experimental results in table 1, it can be seen that the method of the present application brings about significant performance improvement on various model structures (including convolutional neural networks and convectors). In table 1, top1 indicates the accuracy of image classification.
In the differential model scaling method based on the differential topk, different element importance evaluation modes can be adopted. Furthermore, because of the variety of model structures, the differentiable topk of the present application can be adapted to a variety of different network layers, like convolutional layers, fully-connected layers, attention layers, and the like. Therefore, modeling different network structures using different element importance assessment approaches belongs to variations of the application.
Referring to fig. 5, another embodiment of the present application provides an electronic device, including: at least one processor 110; and a memory 111 communicatively coupled to the at least one processor; the memory 111 stores instructions executable by the at least one processor 110, the instructions being executable by the at least one processor 110 to enable the at least one processor 110 to perform any one of the method embodiments described above.
Where the memory 111 and the processor 110 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 110 and the memory 111 together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 110 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 110.
The processor 110 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 111 may be used to store data used by processor 110 in performing operations.
Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.
That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described above. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
By the above technical scheme, the embodiment of the application provides a differentiable model scaling method and system based on differentiable topk, wherein the method comprises the following steps: firstly, constructing a differentiable topk operator; then, modeling and searching the width and depth of the neural network by adopting a learnable parameter based on a differentiable topk operator to obtain a search result; finally, the resource consumption of the final network is constrained based on the resource loss function and the search result.
The application provides a differentiable model scaling method and system based on differentiable topk, which uses the differentiable topk to model and search the depth and width of a model. The application provides a differentiable topk operator, which only needs one learnable parameter to directly model a structural parameter. Moreover, the topk operator is fully differentiable, and the learnable parameters can derive gradients from the task loss. The application uses differentiable topk, generates soft mask by inputting leachable pruning proportion and element importance, simulates networks with different depths/widths, models and searches the widths and depths of the neural network, and constrains the resource consumption of the final network by resource loss. In addition, the application standardizes the importance of the elements, so that the importance of the elements has practical significance.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the application and that various changes in form and details may be made therein without departing from the spirit and scope of the application. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the application, and the scope of the application is therefore intended to be limited only by the appended claims.
Claims (10)
1. A differentiable model scaling method based on differentiable topk, comprising the steps of:
constructing a differentiable topk operator;
modeling and searching the width and depth of the neural network by adopting a learnable parameter based on the differentiable topk operator to obtain a search result;
and constraining the resource consumption of the final network based on the resource loss function and the search result.
2. The differentiable model scaling method based on differentiable topk of claim 1, wherein constructing the differentiable topk operator comprises:
evaluating the importance of the element;
carrying out standardized treatment on the importance of the elements;
a soft mask is generated based on the leachable pruning proportion and the normalized element importance.
3. The differentiable model scaling method based on differentiable topk of claim 2, wherein the importance of the element is evaluated using taylor importance;
the calculation formula of the importance of the element is as follows:
wherein c i Representing the importance of the i-th element; t represents the iterative step number of training; decay is the attenuation coefficient, m i Is c i A corresponding soft mask; g i Is m i Will be m i As an index for evaluating the importance of the element.
4. The differentiable model scaling method based on differentiable topk of claim 2, wherein the normalizing the importance of the elements is to change the importance to a uniform distribution between 0 and 1; the formula for the normalization process is as follows:
wherein c' i Representing the importance of the i-th element; c' i The value of (2) indicates that the importance of the i-th element exceeds c' i *100% of elements.
5. The differentiable model scaling method of claim 2, wherein the calculation formula for generating the soft mask is as follows:
wherein Sigmoid represents a Sigmoid function; lambda represents the control soft mask m i A degree approaching 0 or 1; wherein, the larger the lambda, the m i The closer to 0 or 1.
6. The differentiable model scaling method of claim 1, wherein modeling and searching the width and depth of the model using the learnable parameters comprises:
based on a differentiable topk operator, modeling is carried out on networks with different widths and different depths respectively;
for the width, multiplying the soft mask with the corresponding feature, thereby being able to simulate a pruned model;
for depth, a neural network with residual connection is adopted, soft masks are multiplied by residual blocks, and when the soft masks approach 0, the residual blocks corresponding to the soft masks are cut off, so that the depth is reduced.
7. The differentiable model scaling method based on differentiable topk of claim 1, wherein the calculation formula of the resource consumption of the final network is as follows:
loss=loss task +λ resource ×loss resource (4)
wherein loss is task Representing the original loss of the task; loss of loss resource Representing a loss of resources; lambda represents the control soft mask m i To a degree approaching 0 or 1.
8. A differentiable model scaling system based on differentiable topk, comprising: the system comprises a differentiable topk operator module, a modeling and searching module and a calculating module which are connected in sequence; wherein,
the differentiable topk operator module is used for constructing a differentiable topk operator;
the modeling and searching module is used for modeling and searching the width and the depth of the neural network by adopting the learnable parameters according to the differentiable topk operator to obtain a searching result;
and the calculation module is used for limiting the resource consumption of the final network according to the resource loss function and the search result.
9. The differentiable model scaling system based on differentiable topk of claim 8, wherein the differentiable topk operator module comprises an evaluation unit, a normalization processing unit, and a soft mask generation unit; wherein,
the evaluation unit is used for evaluating the importance of the elements;
the standardized processing unit is used for carrying out standardized processing on the importance of the element;
the soft mask generation unit is used for generating a soft mask according to the leachable pruning proportion and the element importance after the standardized processing.
10. The differentiable model scaling system based on differentiable topk of claim 9, wherein the evaluation unit evaluates the importance of the element using taylor importance; the standardized processing unit performs standardized processing on the importance of the elements to change the importance into uniform distribution between 0 and 1;
the modeling and searching module models and searches the width and depth of the neural network by inputting the leachable pruning proportion and the element importance after the standardized processing based on the differentiable topk operator and generating a soft mask.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311175699.0A CN117195999A (en) | 2023-09-12 | 2023-09-12 | Differential topk-based differential model scaling method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311175699.0A CN117195999A (en) | 2023-09-12 | 2023-09-12 | Differential topk-based differential model scaling method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117195999A true CN117195999A (en) | 2023-12-08 |
Family
ID=88984610
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311175699.0A Pending CN117195999A (en) | 2023-09-12 | 2023-09-12 | Differential topk-based differential model scaling method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117195999A (en) |
-
2023
- 2023-09-12 CN CN202311175699.0A patent/CN117195999A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110851566A (en) | Improved differentiable network structure searching method | |
CN111459022B (en) | Device parameter adjustment method, device control apparatus, and computer-readable storage medium | |
JP2018185771A (en) | Sentence pair classification apparatus, sentence pair classification learning apparatus, method, and program | |
CN109614987A (en) | More disaggregated model optimization methods, device, storage medium and electronic equipment | |
CN115906303A (en) | Planar microwave filter design method and device based on machine learning | |
CN114692488A (en) | Method and device for generating supercritical airfoil, electronic equipment and storage medium | |
CN114637881A (en) | Image retrieval method based on multi-agent metric learning | |
CN113516163B (en) | Vehicle classification model compression method, device and storage medium based on network pruning | |
Yao et al. | Power function-based signal recovery transition optimization model of emergency traffic | |
CN117114053A (en) | Convolutional neural network model compression method and device based on structure search and knowledge distillation | |
CN117195999A (en) | Differential topk-based differential model scaling method and system | |
CN116992806A (en) | Automatic optimization method for large-scale analog integrated circuit based on self-attention mechanism | |
CN107491841A (en) | Nonlinear optimization method and storage medium | |
CN115529350B (en) | Parameter optimization method, device, electronic equipment and readable storage medium | |
CN115170902B (en) | Training method of image processing model | |
CN116010832A (en) | Federal clustering method, federal clustering device, central server, federal clustering system and electronic equipment | |
Li et al. | Adaptive support-driven Bayesian reweighted algorithm for sparse signal recovery | |
CN113705707B (en) | Method and device for determining power saving state of base station cell and electronic equipment | |
CN115345303A (en) | Convolutional neural network weight tuning method, device, storage medium and electronic equipment | |
CN114510873A (en) | Petroleum logging prediction method and device based on big data | |
KR20220061835A (en) | Apparatus and method for hardware acceleration | |
CN113129337A (en) | Background perception tracking method, computer readable storage medium and computer device | |
CN116111984B (en) | Filter design optimization method and device, filter, equipment and medium | |
CN114444658B (en) | Deep learning model reasoning method, system, equipment and computer medium | |
CN117077541B (en) | Efficient fine adjustment method and system for parameters of medical model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |