CN113988288A

CN113988288A - Network model training method, electronic device and computer-readable storage medium

Info

Publication number: CN113988288A
Application number: CN202111183804.6A
Authority: CN
Inventors: 才贺; 冯天鹏; 郭彦东
Original assignee: Shanghai Jinsheng Communication Technology Co ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd
Priority date: 2021-10-11
Filing date: 2021-10-11
Publication date: 2022-01-28

Abstract

The application relates to the technical field of network models, and discloses a network model training method, electronic equipment and a computer-readable storage medium. The architecture of the network model is a differential network architecture, the differential network architecture comprises a basic unit and a compression unit, and the method comprises the following steps: acquiring feature information output by the first two cascade units and training parameters of the previous unit of the same type; the characteristic information is obtained by training a basic unit and/or a compression unit in the network model based on a training sample; and inputting the training parameters and the characteristic information into the current unit so that the current unit outputs the characteristic information to the next unit to train the network model. By the method, the searching efficiency of each unit can be improved, the searching time is shortened, and the training time of the network model is further reduced.

Description

Network model training method, electronic device and computer-readable storage medium

Technical Field

The present application relates to the field of network model technology, and in particular, to a network model training method, an electronic device, and a computer-readable storage medium.

Background

In recent years, the rise of deep learning neural networks has led people to gradually dig up the great potential of neural networks in a great number of complex tasks such as image recognition, voice processing and the like. The processing capability of the neural network on data is extremely strong, and the neural network surpasses many traditional calculation methods and becomes a mainstream calculation method in many fields. The use effect of the neural network is closely related to the parameters and the weights of the network. The related neural network model architecture is mostly designed by human. During the design process of the network, a great deal of research and experiments are needed to try and explore the structural effects of different network models.

Therefore, the demand for automatic search design of network structures arises, and such a computer designs different search spaces according to a search strategy formulated in advance, and automatically searches for an optimal network structure in a specific search space.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a network model training method, an electronic device and a computer readable storage medium, which can improve the search efficiency of each unit, shorten the search time and further reduce the network model training time.

In order to solve the above problem, a technical solution adopted by the present application is to provide a method for training a network model, where the architecture of the network model is a differential network architecture, and the differential network architecture includes a base unit and a compression unit, and the method includes: acquiring feature information output by the first two cascade units and training parameters of the previous unit of the same type; the characteristic information is obtained by training a basic unit and/or a compression unit in the network model based on a training sample; and inputting the training parameters and the characteristic information into the current unit so that the current unit outputs the characteristic information to the next unit to train the network model.

In order to solve the above problem, another technical solution adopted by the present application is to provide an electronic device, which includes a processor and a memory coupled to the processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to implement the method provided in the above technical solution.

In order to solve the above problem, another technical solution adopted by the present application is to provide a computer-readable storage medium, in which a computer program is stored, and the computer program, when being executed by a processor, implements the method provided by the above technical solution.

The beneficial effect of this application is: different from the situation of the prior art, the training method of the network model provided by the application trains the current unit by utilizing the characteristic information output by the previous two cascade units and the training parameter of the previous unit of the same type when the current unit trains, so that the current unit can rapidly learn the information of the previous unit of the same type, the precision of the network model is further improved, and the training is carried out based on the training parameter of the previous unit of the same type, the searching efficiency of each unit can be improved, the searching time is shortened, and the training time of the network model is further reduced.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a method for training a network model provided in the present application;

FIG. 2 is a schematic structural diagram of a differential network architecture provided herein;

FIG. 3 is a schematic diagram of the structure of a single unit in the differential network architecture provided in the present application;

FIG. 4 is a schematic flow chart diagram illustrating a method for training a network model according to another embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram illustrating an embodiment of step 41 provided herein;

FIG. 6 is a flowchart illustrating an embodiment of step 412 provided herein;

FIG. 7 is a schematic flow chart diagram illustrating one embodiment of step 42 provided herein;

FIG. 8 is a schematic flow chart diagram illustrating one embodiment of step 422 provided herein;

FIG. 9 is a schematic diagram of an application scenario of a network model training method provided in the present application;

FIG. 10 is a schematic flow chart diagram illustrating a method for training a network model according to another embodiment of the present disclosure;

FIG. 11 is a schematic flow chart diagram illustrating a method for training a network model according to another embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of an embodiment of an electronic device provided in the present application;

FIG. 13 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided herein.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Related mainstream automatic network architecture searching methods include an automatic searching method based on reinforcement learning, a searching method based on an evolutionary algorithm, a differentiable searching method and the like. The earliest network architecture search methods were designed based on reinforcement learning methods, and search and optimization were performed on the network structure according to the output feedback of the network. This approach can take a significant amount of search time: reinforcement learning based NASNet requires a search time of 2000 days on the CIFAR-10 dataset. Many subsequent network Architecture Search methods, such as an evolutionary algorithm-based network Architecture Search method and a differential-based network Architecture Search method, have correspondingly changed in Search strategy and method to meet different task requirements, thereby improving the model effect, but still requiring a large amount of Search time, for example, AmoebaNet based on an evolutionary algorithm requires 3150 days of Search time on a CIFAR-10 dataset, and DARTS (scalable Architecture Search) based on a differential method requires 1.5-4 days of Search time on a CIFAR-10 dataset.

Based on this, the following technical scheme is provided in the application to solve the problem that the network model is trained by the network architecture searching method, and the searching time is too long.

Referring to fig. 1, fig. 1 is a schematic flowchart of an embodiment of a network model training method provided in the present application. The architecture of the network model is a differential network architecture, and the differential network architecture comprises a basic unit and a compression unit. The method comprises the following steps:

step 11: and acquiring the characteristic information output by the first two cascade units and the training parameters of the previous unit with the same type.

Wherein, the characteristic information is obtained by training a basic unit and/or a compression unit in the network model based on the training sample.

Wherein the training samples may be an image-based training set. The images of the training set are labeled with real labels of image content.

Step 12: and inputting the training parameters and the characteristic information into the current unit so that the current unit outputs the characteristic information to the next unit to train the network model.

As the previous unit of the same type is trained, more training parameters such as a plurality of prediction labels, node connection relations and weights are obtained. Then the current unit, when trained, has these training parameters, which is equivalent to having a larger library of labels, as well as a library of information. Therefore, not only can the training time be reduced, but also the training precision can be improved.

In some embodiments, the training parameters may be extracted from a previous unit of the same type by means of self-distillation. For example, when the characteristic information is output from the previous unit of the same type, the training parameters such as the prediction label, the node connection relation and the weight of the unit are extracted from the previous unit of the same type by using a self-distillation method.

Referring to fig. 2, a network architecture search for a differential network architecture is illustrated:

in this embodiment, the differential network architecture is divided into 6 basic units and 2 compression units. The 6 basic units and the 2 compression units are cascaded according to the 2 basic units and the 1 compression unit. As shown in fig. 2, if 6 basic units and 2 compression units are sorted by the serial numbers of 1 to 8, the serial numbers 1 and 2 are the basic units, the serial number 3 is the compression unit, the serial numbers 4 and 5 are the basic units, the serial number 6 is the compression unit, and the

serial numbers

7 and 8 are the basic units.

Each basic unit and each compression unit have two inputs, which can be seen in fig. 3, where fig. 3 is a schematic diagram of the basic structure of a single unit. Wherein, two inputs of the basic unit with the sequence number of 1 are the same training sample. I.e. the training samples can be copied to form two training samples when they are input.

And two inputs of the basic unit with the sequence number of 2 are respectively training samples, and the other input is the characteristic information output by the basic unit with the sequence number of 1. In some embodiments, the training samples may be images, and the feature information output by each unit is a feature map.

The compression unit with sequence number 3 has two inputs, one of which is the feature information output by the basic unit with sequence number 1, and the other of which is the feature information output by the basic unit with sequence number 2.

And two inputs of the basic unit with the sequence number of 4 are respectively the characteristic information output by the basic unit with the sequence number of 2, and the characteristic information output by the compression unit with the sequence number of 3.

And the two inputs of the basic unit with the sequence number 5 are the characteristic information output by the compression unit with the sequence number 3 and the characteristic information output by the basic unit with the sequence number 4.

The compression unit with the sequence number 6 has two inputs, one of which is the characteristic information output by the basic unit with the sequence number 4, and the other of which is the characteristic information output by the basic unit with the sequence number 5.

And two inputs of the basic unit with the sequence number 7, wherein one input is the characteristic information output by the basic unit with the sequence number 5, and the other input is the characteristic information output by the compression unit with the sequence number 6.

And two inputs of the basic unit with the sequence number of 8 are characteristic information output by the compression unit with the sequence number of 6, and the other input is characteristic information output by the basic unit with the sequence number of 7.

Each cell includes 7 nodes, as shown in fig. 3, including two input nodes (input 1 and input 2), 4 intermediate nodes, and one output node. The input of each node is the output of all its predecessors, that is, if node 1, the input of node 1 is the output of node 0. If node 3, the inputs to node 3 are the outputs of node 0, node 1 and node 2. Two input nodes are connected to all intermediate nodes.

In the training process (i.e., the search process), a suitable edge is selected to connect the nodes, wherein the edge also includes the corresponding operation between the nodes. Each cell can be seen as a directed acyclic graph with 6 nodes inside, and the edges between the nodes represent possible operations, such as 3 × 3sep convolution, and no specific operation is known during initialization. The optional operations are 8, e.g., no operation, 3 × 3 max pooling, 3 × 3 mean pooling, residual concatenation, 3 × 3sep convolution, 5 × 5sep convolution, 3 × 3 hole convolution, and 5 × 5 hole convolution.

The search space is continuously relaxed and each edge is considered to be a mixture of all sub-operations (softmax weight superposition).

And then carrying out joint optimization, and updating the edge hyper-parameters (namely the architecture search task) and the architecture-independent network parameters on the mixed probability of the sub-operations.

And after the optimization is finished, directly taking the sub-operation with the maximum probability.

In the method, the basic unit and the compression unit with the optimal internal node connection relation and weight are trained by utilizing the network model of the differential network architecture. These cells are then connected to form a large network, and the superparameter layers can control how many cells are connected, for example, layer 20 means 20 cells are connected in series.

Wherein the input and output feature map sizes of the basic unit are kept consistent.

The size of the feature map output by the compression unit is reduced by half compared with the size of the feature map input.

In some embodiments, the base unit and the compression unit are identical in construction, but operate differently. For the convolution network, the inputs of the two input nodes are the outputs of the first two layers (layers) respectively, and for the circulation network (recurrents), the inputs are the input of the current layer and the state of the previous layer.

The input of each intermediate node is the output of its predecessor node, which is obtained by summing the corresponding relations of the edges. The relay node refers to a node to which the input terminal of each intermediate node is connected.

The output of the output nodes is obtained by combining the output of each intermediate node by using a concat function.

The edge represents an operation (such as convolution of 3 × 3), all edges in the middle of every two nodes exist and participate in training in the process of converging to obtain a structure, and finally, weighted average is carried out, wherein the weight is the training object of a network model of a differential network architecture, and the expected result is the most effective edge, so that the weight is the largest.

In some embodiments, a set of edges may be normalized by softmax, and it is known that each operation corresponds to a weight, i.e. the above training parameter, and we refer to these weights as a weight matrix, and the larger the weight, the more important the represented operation is in the set of edges.

And converging to obtain a weight matrix, wherein the edge with larger weight in the matrix has better effect after being left.

During the training process, the weight matrix is optimized by gradient descent through the search space defined earlier. At this time, a weight matrix is trained, so that the edge with large weight is retained, and therefore, a process of generating a final unit is required after the differential network structure is converged.

For each intermediate node, at most, the connection relationship with the two strongest relay nodes is reserved; and for the edges between every two nodes, only one edge with the largest weight is reserved, and assuming that one node has three predecessors, the edge with the highest weight between the predecessor node and the current node represents the robustness of the predecessor node, and the two most robust predecessor nodes are selected. Only the edge with the largest weight is reserved between the nodes.

In this embodiment, when a differential network architecture is used to search for a network model architecture, when a current unit is trained, the current unit is trained by using feature information output by two previous cascade units and a training parameter of a previous unit of the same type, so that the current unit can quickly learn information of the previous unit of the same type, and further accuracy of the network model is improved.

When the model training is performed in the above manner, the following technical solutions may be provided. Specifically, referring to fig. 4, fig. 4 is a schematic flowchart of another embodiment of the network model training method provided in the present application. The method comprises the following steps:

step 41: when the current cell is a basic cell, a first loss value between the basic cell and a previous basic cell is determined.

After the current basic unit is trained, whether the basic unit is close to the previous basic unit can be determined by calculating the loss value, so that whether the current basic unit learns the knowledge of the previous basic unit can be determined. Therefore, in the subsequent iterative training process, the basic unit is quickly close to the previous basic unit.

After the current basic unit is trained by using the training parameters and the feature information, the similarity between the current basic unit and the previous basic unit needs to be determined, so that whether the current basic unit is close to the previous basic unit can be determined.

In some embodiments, referring to fig. 5, step 41 may be the following flow:

step 411: and acquiring first characteristic information output by a previous basic unit and second characteristic information output by the basic unit.

Step 412: a first loss value is determined based on the first characteristic information and the second characteristic information.

In some embodiments, the feature information is a feature map, that is, the first feature information is a first feature map, the second feature information is a second feature map, and the number of channels between the first feature map and the second feature map of a part of the basic units is different, that is, the feature maps are not equal to each other, which is described in conjunction with fig. 2: if the feature maps output by the basic unit with the number 2 and the basic unit with the number 4 are reduced by half, the feature map output by the basic unit with the number 4 corresponds to the feature map output by the compression unit with the number 3.

For example, the present application proposes the following scheme, and referring to fig. 6, step 412 may be the following process:

step 4121: the second characteristic information is up-sampled.

The upsampling may employ a difference algorithm, such as a neighbor interpolation, a bilinear interpolation, a higher order interpolation, an edge-based image interpolation, and a region-based image interpolation. An appropriate algorithm can be selected for upsampling according to actual requirements.

Step 4122: and determining a first loss value by using the first characteristic information and the second characteristic information after the up-sampling.

The loss value may be determined by normalizing the first feature information and the upsampled second feature information by using a softmax function, respectively, to obtain a probability of each classification result in the first feature information and the second feature information.

And comparing the probability of each classification result in the first characteristic information and the second characteristic information to obtain a first loss value.

Step 42: when the current unit is a compression unit, a second loss value between the compression unit and a previous compression unit is determined.

After the current compression unit is trained, it can be determined whether the current compression unit is close to the previous compression unit by calculating the loss value, thereby determining whether the current compression unit learns the knowledge of the previous compression unit. Thereby, in the subsequent iterative training process, the compression unit is quickly close to the previous compression unit.

In some embodiments, referring to fig. 7, step 42 may be the following flow:

step 421: and acquiring third characteristic information output by a previous compression unit and fourth characteristic information output by the compression unit.

Step 422: a second loss value is determined based on the third characteristic information and the fourth characteristic information.

In some embodiments, the size of the feature map output by the compression unit is reduced by half, so that the number of channels between the third feature information and the fourth feature information is different, that is, the feature information is not equal to each other, and the following scheme is proposed in this application, and referring to fig. 8, step 422 may be the following process:

step 4221: the fourth feature information is upsampled.

Step 4222: and determining a second loss value by using the third characteristic information and the up-sampled fourth characteristic information.

The loss value may be determined by normalizing the third feature information and the up-sampled fourth feature information by using a softmax function, respectively, to obtain a probability of each classification result in the third feature information and the fourth feature information.

And comparing the probability of each classification result in the third characteristic information and the fourth characteristic information to obtain a second loss value.

Step 43: training the network model based on the first loss value and the second loss value.

In the iterative training of the network model, the current unit can quickly learn the information of the previous unit of the same type by means of quickly drawing the current unit and the previous unit of the same type, so that the precision of the network model is improved, the training is carried out based on the training parameters of the previous unit of the same type, the searching efficiency of the current unit can be improved, the searching time is shortened, and the training time of the network model is further improved.

It is to be understood that the terms "first," "second," and the like in this application are used for distinguishing between different objects and not for describing a particular order.

In an application scenario, the following is explained with reference to fig. 9:

when training the 6 basic units and 2 compression units of the differential network architecture, a self-distillation mode is also adopted to output the training parameters (i.e. the distillation information in fig. 9) of the previous unit of the same type to the current unit, so that the current unit is trained based on the training parameters.

To this end, distillation losses are added to the training losses of the entire network model, by means of which the deep units are continuously drawn towards the shallow units with a higher training degree. The problem of different channel numbers between shallow and deep cells can be aligned by the upsampling method. Thus, the distillation loss can be expressed as:

wherein the content of the first and second substances,

A_mand A_m+1Respectively representing the output characteristic diagrams of two units of the same type before and after, phi (phi) represents a softmax function, B (phi) represents an up-sampling method,

representing a distillation learning function. Wherein the content of the first and second substances,

wherein A is_miShowing a characteristic diagram A_mP denotes an exponential parameter, C_mIndicating the total number of channels. Where ρ ═ 2 can be made, and when ρ ═ 2 is equal to 2, training of the network model is more efficient.

The total loss in each iterative training of the network model can be expressed as:

wherein, y represents a prediction tag,

the real label is represented, gamma is a distillation coefficient, and the distillation coefficient can be set according to actual needs.

In the application of FIG. 9, a training technique of self-distillation is introduced, so that the final precision of the searched hyper-network is effectively improved. Through the self-distillation technology, more information can be provided for the training of the current unit under the limited resource limit, the training degree of each unit is improved, the performance of the sub-network obtained through training is closer to the real performance, and the consistency of the evaluation value of the sub-network in the super-network and the real performance of the sub-network is improved. It can be understood that the network model of the untrained differential network architecture is a hyper-network, and after training, each unit with the best node connection mode selected can be used as a sub-network.

Furthermore, the deep network can learn the information of the shallow network more quickly, so that most of training time is saved, the performance of the super network is the same as or better than that of the original super network in a relatively short time, the search efficiency is improved, and the search time is shortened.

Referring to fig. 10, fig. 10 is a schematic flowchart of another embodiment of a network model training method provided in the present application. The method comprises the following steps:

step 101: training samples are obtained.

Step 102: and inputting the training samples into the network model to train the network model.

Step 103: and acquiring a third loss value of the network model.

Step 104: and judging whether the third loss value meets a set threshold value.

When the third loss value satisfies the set threshold, step 105 is executed.

In other embodiments, the accuracy of the network model after the training may also be obtained during the training process of the network model, and step 105 is executed when the accuracy of the network model meets the preset accuracy.

Step 105: and acquiring the characteristic information output by the first two cascade units and the training parameters of the previous unit with the same type.

Step 106: and inputting the training parameters and the characteristic information into the current unit so that the current unit outputs the characteristic information to the next unit to train the network model.

Step 105 and step 106 have the same or similar technical solutions as any of the above embodiments, and are not described herein again.

In the subsequent iterative training process, model training is performed according to the manner of

steps

105 and 106 until the network model converges.

In this embodiment, by performing conventional training on the network model in the initial stage and executing step 105 only when the loss value satisfies the set threshold, the problem of accuracy degradation of the network model caused by too large offset of the network model due to the fact that the current unit is trained by directly using the training parameters and the feature information can be effectively prevented.

In the network model training, a corresponding training ending mode needs to be set, for example, the total iteration times can be set, and after the total iteration times reach, the training is ended.

In the present application, whether or not to end training is determined in the following manner. Referring to fig. 11, fig. 11 is a schematic flowchart of another embodiment of a training method of a network model provided in the present application. The method comprises the following steps:

step 111: and judging whether the training of the network model meets the convergence condition.

If yes, go to step 112, and if not, go to step 114, continue training according to the scheme of any of the above embodiments.

In the network model training, the set total iteration times are usually used to exceed several hundred times, and the training speed can be effectively improved by adopting the network model training method provided by the application, so that the training requirements can be met quickly, and the iterative training is not needed all the time. Therefore, step 111 needs to be performed after each iterative training.

Step 112: and finishing the training of the network model, and acquiring the corresponding node connection relation and weight in the basic unit and the compression unit.

Step 113: and constructing a network model by using the node connection relation and the weight meeting the preset requirement.

For example, for the intermediate node of each unit, the connection relationship with the two strongest relay nodes is kept at most; and for the edges between every two nodes, only one edge with the largest weight is reserved, and assuming that one node has three previous nodes, the edge with the highest weight between the previous node and the current node represents the robustness of the previous node, and the two previous nodes with the strongest weight are selected. Only the edge with the largest weight is reserved between the nodes.

Step 114: and continuing training.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an embodiment of an electronic device provided in the present application. The electronic device 120 comprises a processor 121 and a memory 122 coupled to the processor 121, wherein a computer program is stored in the memory 122, and the processor 121 is configured to execute the computer program to implement the following method:

acquiring feature information output by the first two cascade units and training parameters of the previous unit of the same type; the characteristic information is obtained by training a basic unit and/or a compression unit in the network model based on a training sample; and inputting the training parameters and the characteristic information into the current unit so that the current unit outputs the characteristic information to the next unit to train the network model.

Optionally, the processor 121 is further configured to execute a computer program to implement the method of any of the above embodiments, which specifically refers to the corresponding method, and is not described herein again.

Referring to fig. 13, fig. 13 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application. The computer-readable storage medium 130 stores a computer program 131, the computer program 131, when executed by a processor, implementing the method of:

Optionally, the computer program 131, when being executed by the processor, is configured to implement the method according to any of the embodiments, which specifically refers to the corresponding method described above, and is not described herein again.

In summary, when the current unit is trained, the current unit is trained by using the feature information output by the previous cascade unit and the training parameter of the previous unit of the same type, so that the current unit can rapidly learn the information of the previous unit of the same type, and further the accuracy of the network model is improved.

Secondly, because the network architecture parameter and the neural network weight are alternately updated by the differential-based network architecture searching method, the training process is unstable, the network performance fluctuates greatly, and the performance of the sub-network is evaluated by the differential network architecture searching technology based on the super-network, so the consistency of the super-network with lower training degree is poor, and the performance of the searched sub-network is also poor.

And the search process of the differential-based network architecture search method usually only comprises dozens of rounds, while the complete training of the sub-network usually requires hundreds of rounds. Therefore, in the search process, due to the search time and resource limitation, the different subnets in the hyper-network and the hyper-network are not sufficiently trained. Therefore, in the evaluation process based on the super network, the performance obtained by the sub network and the real performance have a large gap. By any of the technical schemes provided by the application, the training degree of the super network can be improved under limited resources, and the performance consistency problem of the super network and the sub network can be effectively improved, so that a better sub network structure can be effectively searched.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The integrated units in the other embodiments described above may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application or are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A training method of a network model, wherein the architecture of the network model is a differential network architecture, the differential network architecture comprises a base unit and a compression unit, and the method comprises the following steps:

acquiring feature information output by the first two cascade units and training parameters of the previous unit of the same type; wherein the feature information is obtained by training the basic unit and/or the compression unit in the network model based on training samples;

and inputting the training parameters and the characteristic information into a current unit so that the current unit outputs the characteristic information to a next unit to train the network model.

2. The method of claim 1,

the training parameters are extracted from the previous unit of the same type by means of self-distillation.

3. The method of claim 1, further comprising:

determining a first loss value between the base unit and a previous base unit when the current unit is the base unit;

determining a second loss value between the compression unit and a previous compression unit when the current unit is the compression unit;

training the network model based on the first loss value and the second loss value.

4. The method of claim 3,

the determining a first loss value between the base unit and a previous base unit comprises:

acquiring first characteristic information output by the previous basic unit and second characteristic information output by the basic unit;

determining the first loss value based on the first characteristic information and the second characteristic information.

5. The method of claim 4,

the determining the first loss value based on the first characteristic information and the second characteristic information includes:

upsampling the second feature information;

determining the first loss value using the first characteristic information and the second characteristic information after upsampling.

6. The method of claim 3,

the determining a second loss value between the compression unit and a previous compression unit comprises:

acquiring third characteristic information output by the previous compression unit and fourth characteristic information output by the compression unit;

determining the second loss value based on the third characteristic information and the fourth characteristic information.

7. The method of claim 6,

the determining the second loss value based on the third characteristic information and the fourth characteristic information includes:

upsampling the fourth feature information;

determining the second loss value using the third feature information and the up-sampled fourth feature information.

8. The method of claim 1, wherein before obtaining the feature information output by the first two cascaded units and the training parameters of the previous unit of the same type, the method comprises:

obtaining the training sample;

inputting the training samples to the network model to train the network model;

obtaining a third loss value of the network model;

judging whether the third loss value meets a set threshold value or not;

and if so, executing the step of acquiring the feature information output by the first two cascade units and the training parameters of the previous unit with the same type.

9. The method of claim 1, further comprising:

judging whether the training of the network model meets a convergence condition;

if so, finishing the training of the network model, and acquiring the corresponding node connection relation and weight in the basic unit and the compression unit;

and constructing the network model by using the node connection relation meeting the preset requirement and the weight.

10. An electronic device, characterized in that the electronic device comprises a processor and a memory coupled to the processor, in which a computer program is stored, the processor being configured to execute the computer program to implement the method according to any of claims 1-9.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1-9.