WO2023212997A1

WO2023212997A1 - Knowledge distillation based neural network training method, device, and storage medium

Info

Publication number: WO2023212997A1
Application number: PCT/CN2022/098769
Authority: WO
Inventors: 崔岩; 常青玲; 任飞; 徐世廷
Original assignee: 五邑大学; 广东四维看看智能设备有限公司; 中德（珠海）人工智能研究院有限公司; 珠海市四维时代网络科技有限公司
Priority date: 2022-05-05
Filing date: 2022-06-14
Publication date: 2023-11-09

Abstract

A knowledge distillation based neural network training method, a device, and a storage medium. The method comprises the following steps: constructing an untrained student network model and a trained teacher network model (S1000); according to training samples, said student network model and said teacher network model, obtaining a distillation loss function group (S2000), the loss function group comprising an encoding loss function, a decoding loss function and a prediction result loss function; and according to the distillation loss function group, training said student network model, so as to obtain a trained student network model (S3000). Knowledge distillation is performed on a student network model by means of a trained teacher network model, so that scenario information acquisition capability and data generalization capability of the student network model can be effectively improved; and a distillation loss function group consisting of multiple loss functions is acquired, and then the student network model is trained by means of the distillation loss function group, so that the accuracy of a student network model prediction result can be effectively improved.

Description

Neural network training methods, equipment and storage media based on knowledge distillation

Technical field

The invention relates to the field of artificial intelligence, and in particular to a neural network training method, equipment and storage medium based on knowledge distillation.

Background technique

With the development of deep learning, three-dimensional plane restoration and reconstruction technology is one of the current research tasks in the field of computer vision. The three-dimensional plane and restoration of a single image require segmenting the plane instance area of the scene from the image dimension and estimating each instance at the same time. The plane parameters of the area, non-planar areas will be represented by the depth estimated by the network model, and the three-dimensional plane recovery and reconstruction technology has broad application prospects in virtual reality, augmented reality, robots and other fields.

Plane restoration and reconstruction is an important research direction in three-dimensional plane restoration and reconstruction. Among related technologies, three-dimensional plane restoration and reconstruction methods focus on reconstruction accuracy and enhance the accuracy of neural network models by analyzing the edges of plane structures and their embedding with the scene. , but the neural network model used for plane restoration and reconstruction has the problem of losing scene structure information and lacking data generalization.

Contents of the invention

The present invention aims to solve at least one of the technical problems existing in the prior art. To this end, the present invention provides a neural network training method, equipment and storage medium based on knowledge distillation, which can improve the scene information acquisition capability and data generalization performance of the neural network model.

The first embodiment of the present invention provides a neural network training method based on knowledge distillation, which includes the following steps:

Build untrained student network models and trained teacher network models;

According to the training samples, student network model and teacher network model, a distillation loss function group is obtained, where the distillation loss function group includes the encoding loss function, the decoding loss function and the prediction result loss function;

The student network model is trained according to the distillation loss function group to obtain the trained student network model.

According to the above embodiments of the present invention, at least the following beneficial effects are achieved: by setting the trained teacher network model to perform knowledge distillation on the student network model, the scene information acquisition ability and data generalization ability of the student network model can be effectively improved, and according to The teacher network model and the student network model obtain a distilled loss function group composed of a variety of loss functions, and then train the student network model through the distilled loss function group, which can ensure the reliability of each processing link in the student network model, thereby effectively improving Accuracy of student network model prediction results.

According to some embodiments of the first aspect of the present invention, based on the training samples, the student network model and the teacher network model, a distillation loss function group is obtained, including:

Input the training samples into the student network model to obtain the student feature group;

Input the training samples into the teacher network model to obtain the teacher feature group;

According to the student feature group and the teacher feature group, a distillation loss function group is obtained.

According to some embodiments of the first aspect of the present invention, the training samples are input into the student network model to obtain the student feature group, including:

Input the training samples into the student network model to obtain a student feature group including student encoding features, student decoding features and student prediction result features.

According to some embodiments of the first aspect of the present invention, the training samples are input into the teacher network model to obtain the teacher feature group, including:

Input the training samples into the teacher network model to obtain a teacher feature group including teacher encoding features, teacher decoding features and teacher prediction result features. The teacher network model includes a teacher backbone network model, a teacher encoder and a teacher decoder. The teacher backbone network The model outputs teacher encoding features, the teacher encoder outputs teacher decoding features, and the teacher decoder outputs teacher prediction result features.

According to some embodiments of the first aspect of the present invention, according to the student feature group and the teacher feature group, a distillation loss function group is obtained, including:

According to the student coding characteristics and the teacher coding characteristics, the coding loss function is obtained;

According to the student decoding characteristics and the teacher decoding characteristics, the decoding loss function is obtained;

According to the characteristics of student prediction results and the characteristics of teacher prediction results, the prediction result loss function is obtained.

According to some embodiments of the first aspect of the invention, the student network model includes a student encoder and a student decoder;

Input the training samples into the student network model to obtain the student feature group including student encoding features, student decoding features and student prediction result features, including:

Input the training samples to the student encoder for downsampling encoding to obtain student encoding features;

According to the down-sampling encoding process of the student encoder, the fused feature layer is obtained;

Convolve the student coding features to obtain the student decoding features;

The student decoding features and fused feature layers are input to the student decoder, which is first fused and then upsampled and decoded to obtain the student prediction result features.

According to some embodiments of the first aspect of the present invention, according to the down-sampling encoding process of the student encoder, a fused feature layer is obtained, including:

According to each downsampling intermediate feature map formed by the downsampling encoding process of the student encoder, the fused feature layer at each scale is obtained.

According to some embodiments of the first aspect of the present invention, the student decoding features and the fusion feature layer are input to the student decoder, and are first fused and then upsampled and decoded to obtain student prediction result features, including:

The student decoding features are input to the student decoder for upsampling decoding, and each fusion feature layer is fused with the upsampled intermediate feature map at the corresponding scale during the upsampling decoding process of the student decoder to obtain the student decoding features.

A second embodiment of the present invention provides an electronic device, including:

A memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the neural network training method based on knowledge distillation according to any one of the first aspects is implemented.

Since the electronic device of the embodiment of the second aspect applies any one of the knowledge distillation-based neural network training methods of the first aspect, it has all the beneficial effects of the first aspect of the present invention.

A computer storage medium provided according to an embodiment of the third aspect of the present invention stores computer-executable instructions, and the computer-executable instructions are used to execute any one of the knowledge distillation-based neural network training methods of the first aspect.

Since the computer storage medium of the embodiment of the third aspect can execute any one of the knowledge distillation-based neural network training methods of the first aspect, it has all the beneficial effects of the first aspect of the present invention.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Description of the drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments taken in conjunction with the following drawings, in which:

Figure 1 is a main step diagram of the neural network training method based on knowledge distillation according to the embodiment of the present invention;

Figure 2 is a specific step diagram of step S2000 in the neural network training method based on knowledge distillation according to the embodiment of the present invention;

Figure 3 is a specific step diagram of step S2100 in the neural network training method based on knowledge distillation according to the embodiment of the present invention;

Figure 4 is a specific step diagram of step S2300 in the neural network training method based on knowledge distillation according to the embodiment of the present invention;

Figure 5 is a working principle diagram of the student network model in the neural network training method based on knowledge distillation according to the embodiment of the present invention;

Figure 6 is a working principle diagram of the teacher network model in the neural network training method based on knowledge distillation according to the embodiment of the present invention;

Figure 7 is a working principle diagram of the neural network training method based on knowledge distillation according to the embodiment of the present invention.

Detailed ways

In the description of the present invention, unless otherwise explicitly limited, words such as setting, installation, and connection should be understood in a broad sense. Those skilled in the art can reasonably determine the specific meaning of the above words in the present invention in combination with the specific content of the technical solution. In the description of the present invention, several means one or more, plural means two or more, greater than, less than, more than, etc. are understood to exclude the original number, and above, below, within, etc. are understood to include the original number. In addition, features defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of the present invention, unless otherwise specified, "plurality" means two or more.

With the development of deep learning, the field of computer vision has attracted the attention of more and more researchers. Three-dimensional plane restoration and reconstruction technology is one of the current research tasks in the field of computer vision. The three-dimensional plane and restoration of a single image need to start from the image dimension. Segment the plane instance area of the scene and estimate the plane parameters of each instance area. The non-planar area will be represented by the depth estimated by the network model. The three-dimensional plane recovery and reconstruction technology has broad application in the fields of virtual reality, augmented reality, robots and other fields. application prospects. The plane detection and restoration method of a single image requires simultaneous research on image depth, plane normal, plane segmentation, etc. The traditional three-dimensional plane restoration and reconstruction method based on artificial extraction of features only extracts the shallow texture information of the image and relies on The prior conditions of plane geometry have the disadvantage of weak generalization ability. In reality, indoor scenes are very complex. Multiple shadows produced by complex light and various folding obstructions will affect the quality of plane restoration and reconstruction, making it difficult for traditional methods to cope with the task of plane restoration and reconstruction of complex indoor scenes.

Currently, 3D reconstruction methods mainly generate point cloud data through 3D vision methods, then generate nonlinear scene surfaces by fitting relevant points, and then optimize the overall reconstruction model through global reasoning. Planar restoration and reconstruction are combined with visual instance segmentation methods. Identify the plane area of the scene and use three parameters in Cartesian coordinates and a segmentation mask to represent the plane, which has better reconstruction accuracy and effect. Three-dimensional reconstruction is achieved through segmented plane restoration and reconstruction. Segmented plane restoration and reconstruction is a multi-stage reconstruction method, and the accuracy of plane identification and parameter estimation will affect the results of the final model.

In related technologies, plane prediction reconstruction can be achieved through a variety of methods: the convolutional neural network architecture Planenet can infer a fixed number of plane instance masks and plane parameters from a single RGB image; it can also predict a fixed number of planes directly from the plane Learning in structure-induced deep modalities; the two-stage Mask R-CNN framework replaces object category classification with plane geometry prediction, and then refines the plane segmentation mask with a convolutional neural network; it can also predict pixel-by-pixel plane parameters using correlation The embedding method trains network parameters to map each pixel to the embedding space, and then clusters the embedded pixels into planar instances; the planar thinning method constrained by the Manhattan world assumption strengthens the planarization parameters by limiting the set relationship between planar instances refinement; a divide-and-conquer method is used to segment the panorama plane from the vertical and horizontal directions. In view of the difference in pixel distribution between the panorama and the ordinary image, this method can restore the distorted plane instance; the method PlaneTR based on the transformer module, through Adding center and edge features of plane instances can effectively improve the efficiency of plane detection. Planet is a deep neural network used to segmentally reconstruct plane depth maps from a single RGB image. Mask R-CNN is a neural network framework. The transformer module is a model based on a multi-head attention mechanism. PlaneTR is a user A model used to extract 3D planar features in the scene.

Plane restoration and reconstruction is an important research direction in 3D plane restoration and reconstruction. Most of the current 3D reconstruction methods first generate point cloud data through 3D vision methods, then generate nonlinear scene surfaces by fitting relevant points, and then optimize the whole through global reasoning. Reconstruction model, in related technologies, three-dimensional plane restoration and reconstruction methods focus on reconstruction accuracy, and enhance the accuracy of the model by analyzing the edges of the plane structure and the embeddedness with the scene. The neural network used for plane restoration and reconstruction has missing scenes. Structural information and lack of data generalization issues.

In order to solve the problems of lost scene information and lack of data generalization in neural networks used for plane restoration and reconstruction, the student network model obtained by knowledge distillation training of the present invention is used for plane restoration and reconstruction, which can avoid the loss of scene structure information. , the problem of low data generalization.

Knowledge distillation is a training framework. The student network model uses the softmax function output vector of the powerful teacher network model to learn as soft labels. The student network model is generally a lightweight small network model. The distillation process can effectively improve the lightweight network model. The prediction accuracy of The effectiveness of the distillation process, and because the features extracted by the neural network become more abstract with depth superposition, the student network model obtained through knowledge distillation has the problem of low prediction accuracy.

An attention map is derived from the original feature map to express knowledge, knowledge is transferred by matching the probability distribution in the feature space, factors are introduced as a more understandable intermediate representation, and the activation boundaries of hidden neurons are used for knowledge transfer. Plasticity decreases rapidly after the first few training stages, and the effectiveness of knowledge distillation is reduced.

The following describes the neural network training method, equipment and storage medium based on knowledge distillation of the present invention with reference to Figures 1 to 7. It can not only improve the scene structure information acquisition ability and data generalization ability of the obtained student network model, but also improve the student network model. The prediction accuracy of the network model.

Referring to Figure 1, the neural network training method based on knowledge distillation according to the first embodiment of the present invention includes the following steps:

Step S1000: Construct an untrained student network model and a trained teacher network model;

Step S2000: Obtain a distillation loss function group based on the training samples, the student network model and the teacher network model, where the distillation loss function group includes the encoding loss function, the decoding loss function and the prediction result loss function;

Step S3000: Train the student network model according to the distillation loss function group to obtain a trained student network model, where the trained student network model can be used to implement plane restoration and reconstruction.

By setting up the trained teacher network model to perform knowledge distillation on the student network model, the scene information acquisition ability and data generalization ability of the student network model can be effectively improved, and the acquisition is composed of a variety of loss functions based on the teacher network model and the student network model. The distillation loss function group, and then training the student network model through the distillation loss function group, can ensure the reliability of each processing link in the student network model, thereby effectively improving the accuracy of the student network model prediction results.

It can be understood that, referring to Figure 2, step S2000 is to obtain a distillation loss function group based on the training samples, the student network model and the teacher network model, including but not limited to the following steps:

Step S2100: Input the training samples into the student network model to obtain the student feature group;

Step S2200: Input the training samples into the teacher network model to obtain the teacher feature group;

Step S2300: Obtain a distillation loss function group based on the student feature group and the teacher feature group.

Using the same set of training samples, input them into the student network model and the teacher network model respectively, extract the features of the two models through the knowledge distillation method and construct a distillation loss function group, and use the distillation loss function group with multi-layer perception to train the student network model Training can effectively improve the performance of the trained student network model, and the student network model has high prediction accuracy.

It can be understood that the student network model includes a student encoder and a student decoder, and the student feature group includes student encoding features, student decoding features, and student prediction result features; step S2100, input the training samples into the student network model to obtain the student feature group , including but not limited to the following steps:

Step S2110: Input the training samples into the student network model to obtain a student feature group including student coding features, student decoding features and student prediction result features.

By setting the student network model to omit the student network model pre-training, and using the student encoder and student decoder to form the student network model, the structure of the student network model can be significantly simplified. The student network model is highly lightweight and can be used for training. When a good student network model is used for prediction, it can effectively improve its prediction speed, and it is fast when used for plane detection and recovery; using the intermediate features of the network also has certain learning potential, through a distillation loss function group that contains three groups of distillation loss functions. Iterative training of the student network model, that is, through a step-by-step distillation process, helps alleviate the negative impact of hard correlations. The finally obtained trained student network model can meet both real-time and high-precision prediction performance.

It can be understood that, referring to Figure 3, step S2110, the training samples are input into the student network model to obtain a student feature group including student coding features, student decoding features and student prediction result features, including but not limited to the following steps:

Step S2111: Input the training samples to the student encoder for downsampling encoding to obtain student encoding features;

Step S2112: Obtain the fusion feature layer according to the down-sampling encoding process of the student encoder;

Step S2113: Convolve the student coding features to obtain the student decoding features;

Step S2114: Input the student decoding features and the fused feature layer to the student decoder. The student decoding features and the fused feature layer are first fused and then upsampled and decoded to obtain student prediction result features.

The student encoder performs down-sampling encoding processing on the input data of the training samples. During the down-sampling process, a fast down-sampling strategy can be used to extract and identify features with a large enough perceptual domain, which can effectively improve the recognition speed. The down-sampling operation will This results in the loss of spatial information, and this lost information cannot be recovered during subsequent processing. The corresponding features are extracted as a feature fusion layer during the downsampling process of the student encoder, which is used to correspond to the upsampling and decoding process of the student decoder. The fusion of features can make corresponding compensation for the spatial information lost during the down-sampling process, and can effectively ensure the reliability of the student prediction result features obtained after up-sampling and decoding by the student decoder.

It can be understood that the working principle of the student network model is shown in Figure 5. In step S2112, the fusion feature layer is obtained according to the down-sampling encoding process of the student encoder, including but not limited to the following steps:

Step S2114, input the student decoding features and the fused feature layer to the student decoder. The student decoding features and the fused feature layer are first fused and then upsampled and decoded to obtain the student prediction result features, including but not limited to the following steps:

Input the student decoding features to the student decoder for upsampling decoding, and fuse each fusion feature layer with the upsampled intermediate feature map at the corresponding scale during the upsampling decoding process of the student decoder. The final output of the student decoder is student Predicted result characteristics.

When the student encoder downsamples the input data of the training sample, it usually needs to perform a multi-layer downsampling operation on the input data of the training sample. Once the number of downsampling layers is too many, most of the downsampling operation will be lost. Spatial information, since this lost information cannot be recovered during subsequent processing, the data provided by the upsampling process is seriously distorted, which will seriously affect the final prediction result. By using the downsampled shallow features as the fusion feature layer, each A fusion feature layer is fused with deep features sampled at the same scale, which can gradually restore spatial details, thereby effectively ensuring the reliability of the student decoding features output by the student decoder.

It can be understood that the working principle of the teacher network model is shown in Figure 6. In step S2200, the training samples are input into the teacher network model to obtain the teacher feature group, including but not limited to the following steps:

Step S2210: Input the training samples into the teacher network model to obtain a teacher feature group including teacher coding features, teacher decoding features and teacher prediction result features. The teacher network model includes a teacher backbone network model, a teacher encoder and a teacher decoder. The teacher backbone network model outputs teacher coding features. The teacher coding features are input to the teacher encoder and the teacher decoding features are output. The teacher decoding features are input to the teacher decoder and the teacher prediction result features are output. The teacher feature group includes teacher coding features, teacher decoding features and Teacher prediction outcome characteristics.

Specifically, in the teacher network model, training samples are input into the teacher backbone network model, the teacher backbone network model outputs teacher coding features, the teacher coding features are input into the teacher encoder, the teacher encoder outputs the teacher decoding features, and the teacher decoding features are input into In the teacher decoder, the teacher decoder outputs the teacher prediction result characteristics.

It can be understood that, referring to Figure 4, step S2300 is to obtain a distillation loss function group based on the student feature group and the teacher feature group, including but not limited to the following steps:

Step S2310: Obtain a coding loss function based on the student coding features and the teacher coding features, where the coding loss function is used to correct the downsampling coding of the student coder so that the student coder outputs more accurate student coding features;

Step S2320: Obtain a decoding loss function based on the student decoding features and the teacher decoding features, where the decoding loss function is used to correct the convolution before the student decoder to ensure the accuracy of the student decoding features input to the student decoder;

Step S2330: Obtain the prediction result loss function based on the student prediction result characteristics and the teacher prediction result characteristics. The prediction result loss function is used to correct the upsampling decoding of the student decoder so that the student network model outputs more accurate student predictions. Result characteristics.

The working principle diagram of the neural network training method based on knowledge distillation is shown in Figure 7. By extracting the corresponding output features in the three network layers in the teacher network model and the corresponding network layer in the student network model, and generating the corresponding distillation Loss function, the student network model is iteratively trained through three distillation loss functions corresponding to different network layers, that is, a direct and effective one-to-one matching is achieved between the corresponding levels of the student network model and the teacher network model, which can effectively ensure that students The accuracy of data processing at the corresponding network layer in the network model can effectively improve the performance of the student network model from an architectural perspective, and can effectively ensure the generalization of the student network model and the accuracy of the prediction results.

Knowledge distillation architecture with multiple student networks, using critical learning awareness KD (Knowledge Distillation) scheme to ensure the formation of key connections, allowing to effectively imitate the teacher's information flow, rather than just learning one student, allowing the student network model and teacher Direct and effective one-to-one matching between corresponding layers of the network model, adaptively divide the teacher network model and student network model into three parts, assign each part the adaptive parameters of the corresponding network layer, and perform knowledge distillation Learning, through semantic correction of shallow network feature associations, significantly improves the effectiveness of feature knowledge transfer, and uses the attention mechanism to achieve cross-layer rectification, which can alleviate the problem of semantic mismatch.

It can be understood that step S3000 is to train the student network model according to the distillation loss function group to obtain the trained student network model, including but not limited to the following steps:

The downsampling encoding of the student encoder is corrected according to the encoding loss function, the convolution before the student decoder is corrected according to the decoding loss function, and the upsampling decoding of the student decoder is corrected according to the prediction result loss function. The student network model is trained to obtain a trained student network model.

While training the network according to the task, the student network is auxiliary trained according to the distillation loss function group containing multiple intermediate layer loss functions. The estimation performance of the student network can be enhanced through transfer learning of the intermediate feature layer. Specifically, the encoding dimension, decoding dimension and prediction result dimension ensure that the teacher network model provides more reliable parameter learning for the student network model when capacity underflows.

It is understandable that the teacher network based on the transformer module can achieve global area detection. The teacher network model is set up based on the transformer module. The HR-Net model is used as the teacher backbone network model for feature extraction to generate high-dimensional low-scale Features are embedded as blocks. The HR-Net model is a high-resolution network. The size of the block embedding is p. The H×W pixel image is divided into a set of feature block embeddings.

S ⁰ ∈ R ^D and so on, where R ^D is the feature space output by the teacher backbone network model, and the number of feature blocks is

Finally, it is input into the transformer module with a total of 12 layers. The teacher network model includes a depth estimation branch. The depth estimation branch uses the multi-scale features of the teacher backbone network model and the teacher coding features as input sources to estimate the image depth through a top-down decoding structure. , this structure adopts the upsampling module of bilinear interpolation. The feature module after each sampling corresponds to the feature scale of the teacher backbone network model, that is, a 2 times upsampling mechanism is implemented to estimate the image depth, and the teacher backbone network model outputs the corresponding feature dimension. .

It can be understood that the final output of the teacher network model and the student network model is corrected by the L2 loss function. Since there is no maximum value function in the network, the L2 loss function is used to correct the features before the last activation layer in the corresponding network model. When training the student network model, the distillation loss function group and the L2 loss function can achieve more reliable correction effects. When applied to plane restoration and reconstruction, the prediction accuracy of the trained student network model is higher.

The neural network training method based on knowledge distillation of the first aspect of the present invention is described in detail below with a specific embodiment. It should be understood that the following description is only illustrative and does not specifically limit the invention.

Construct an untrained student network model and a trained teacher network model. The teacher network model is based on the teacher network model designed by the transformer module. The teacher network model includes a teacher backbone network model, a teacher encoder and a teacher decoder. Among them, the teacher backbone network The model uses the HR-Net model. The student network model includes a student encoder and a student decoder. The student network model omits the settings of its backbone network model;

Input the training samples to the student encoder for down-sampling encoding to obtain student encoding features. mobilenet-v3 is a lightweight network that uses mobilenet-v3 as a feature extractor. According to the down-sampling encoding process of the student encoder, The intermediate feature map is sampled to obtain the fusion feature layer. According to the fusion feature layer, the student encoding features are input to the student decoder for upsampling decoding to obtain the student decoding features;

Input the training samples into the teacher network model to obtain the teacher coding features of the teacher backbone network model, the teacher decoding features output by the teacher encoder, and the teacher prediction result features output by the teacher decoder; obtain the student coding features output by the student encoder, and the student The student decoding features after convolution processing before the decoder, and the student prediction result features output by the student decoder;

According to the student coding characteristics and teacher coding characteristics, the encoding loss function is obtained. According to the student decoding characteristics and the teacher decoding characteristics, the decoding loss function is obtained. According to the student prediction result characteristics and the teacher prediction result characteristics, the prediction result loss function is obtained;

The downsampling encoding of the student encoder is corrected according to the encoding loss function, the convolution before the student decoder is corrected according to the decoding loss function, and the upsampling decoding of the student decoder is corrected according to the prediction result loss function. The student network model is trained to obtain a trained student network model, which can be used to achieve plane restoration and reconstruction.

Among them, during the working process of the student network model, the training samples are input to the student encoder for down-sampling encoding to obtain the student encoding features; according to each down-sampling intermediate feature map generated during the down-sampling encoding process of the student encoder, each scale is obtained The fusion feature layer under The intermediate feature maps are upsampled for fusion, and the student decoder outputs the student prediction result features. At each decoding stage in the student decoder, the feature fusion module fuses the general-scale shallow features, that is, the fused feature layer, with the upsampled intermediate feature map in the upsampling decoding process, with a resolution of 1/32 respectively. , 1/16, 1/8, 1/4 and 1/2, which can ensure that features of the same scale have the same feature channel after each feature fusion. In the end, student encoding features, student decoding features and student prediction result features are processed separately. Transfer learning.

In addition, the second embodiment of the present invention also provides an electronic device. The electronic device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor.

The processor and memory may be connected via a bus or other means.

As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer executable programs. In addition, the memory may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, and the remote memory may be connected to the processor via a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.

The non-transient software programs and instructions required to implement the neural network training method based on knowledge distillation in the first embodiment are stored in the memory. When executed by the processor, the neural network based on knowledge distillation in the above embodiment is executed. The training method, for example, executes the above-described method steps S1000 to S3000, method steps S2100 to S2300, method step S2110, method steps S2111 to S2114, method step S2210, and method step S2310 to S2330.

The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, an embodiment of the present invention also provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are executed by a processor or controller, for example, by the above-mentioned Execution by a processor in the device embodiment can cause the above processor to execute the neural network training method based on knowledge distillation in the above embodiment, for example, execute the above-described method steps S1000 to S3000, method steps S2100 to S2300, and method steps S2110, method steps S2111 to S2114, method step S2210, and method steps S2310 to S2330.

Those of ordinary skill in the art can understand that all or some steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is known to those of ordinary skill in the art, the term computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. removable, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other medium used to store the desired information and that can be accessed by a computer. Additionally, it is known to those of ordinary skill in the art that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

In the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples" or the like is intended to be in conjunction with the implementation. An example or example describes a specific feature, structure, material, or characteristic that is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although the embodiments of the present invention have been shown and described, those of ordinary skill in the art will appreciate that various changes, modifications, substitutions and variations can be made to these embodiments without departing from the principles and purposes of the invention. The scope of the invention is defined by the claims and their equivalents.

Claims

A neural network training method based on knowledge distillation, which is characterized by including the following steps:

Build untrained student network models and trained teacher network models;

According to the training samples, the student network model and the teacher network model, a distillation loss function group is obtained, wherein the distillation loss function group includes an encoding loss function, a decoding loss function and a prediction result loss function;

The student network model is trained according to the distillation loss function group to obtain a trained student network model.
The neural network training method based on knowledge distillation according to claim 1, characterized in that the distillation loss function group is obtained based on the training samples, the student network model and the teacher network model, including:

Input the training samples into the student network model to obtain the student feature group;

Input the training samples into the teacher network model to obtain the teacher feature group;

According to the student feature group and the teacher feature group, the distillation loss function group is obtained.
The neural network training method based on knowledge distillation according to claim 2, characterized in that the said training samples are input into the student network model to obtain the student feature group, including:

The training samples are input into the student network model to obtain the student feature group including student encoding features, student decoding features and student prediction result features.
The neural network training method based on knowledge distillation according to claim 3, characterized in that the said training samples are input into the teacher network model to obtain the teacher feature group, including:

The training samples are input into the teacher network model to obtain the teacher feature group including teacher coding features, teacher decoding features and teacher prediction result features, where the teacher network model includes a teacher backbone network model, a teacher encoder and a teacher Decoder, the teacher backbone network model outputs the teacher encoding features, the teacher encoder outputs the teacher decoding features, and the teacher decoder outputs the teacher prediction result features.
The neural network training method based on knowledge distillation according to claim 4, wherein the distillation loss function group is obtained according to the student feature group and the teacher feature group, including:

According to the student coding characteristics and the teacher coding characteristics, a coding loss function is obtained;

According to the student decoding characteristics and the teacher decoding characteristics, a decoding loss function is obtained;

According to the characteristics of the student's prediction results and the characteristics of the teacher's prediction results, a prediction result loss function is obtained.
The neural network training method based on knowledge distillation according to claim 3, characterized in that the student network model includes a student encoder and a student decoder;

The training samples are input into the student network model to obtain the student feature group including student coding features, student decoding features and student prediction result features, including:

Input the training samples to the student encoder for downsampling encoding to obtain the student encoding features;

According to the down-sampling encoding process of the student encoder, a fused feature layer is obtained;

Perform convolution on the student coding features to obtain student decoding features;

The student decoding features and the fusion feature layer are input to the student decoder, and are first fused and then upsampled and decoded to obtain the student prediction result features.
The neural network training method based on knowledge distillation according to claim 6, characterized in that the fusion feature layer is obtained according to the down-sampling encoding process of the student encoder, including:

The fused feature layer at each scale is obtained according to each downsampled intermediate feature map formed by the downsampling encoding process of the student encoder.
The neural network training method based on knowledge distillation according to claim 7, characterized in that the student decoding features and the fusion feature layer are input to the student decoder, and are first fused and then upsampled and decoded. , obtain the characteristics of the student's predicted results, including:

Input the student decoding features to the student decoder for upsampling decoding, and fuse each fusion feature layer with the upsampled intermediate feature map at the corresponding scale during the upsampling decoding process of the student decoder. , the student decoder outputs the student prediction result characteristics.
An electronic device, characterized by including:

A memory, a processor and a computer program stored on the memory and executable on the processor. When the processor executes the computer program, the knowledge-based method as claimed in any one of claims 1 to 8 is implemented. Distilled neural network training method.
A computer storage medium, characterized in that computer-executable instructions are stored therein, and the computer-executable instructions are used to execute the neural network training method based on knowledge distillation according to any one of claims 1 to 8.