CN113963176B - Model distillation method and device, electronic equipment and storage medium - Google Patents

Model distillation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113963176B
CN113963176B CN202111265972.XA CN202111265972A CN113963176B CN 113963176 B CN113963176 B CN 113963176B CN 202111265972 A CN202111265972 A CN 202111265972A CN 113963176 B CN113963176 B CN 113963176B
Authority
CN
China
Prior art keywords
feature map
model
student model
student
similarity matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111265972.XA
Other languages
Chinese (zh)
Other versions
CN113963176A (en
Inventor
杨馥魁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111265972.XA priority Critical patent/CN113963176B/en
Publication of CN113963176A publication Critical patent/CN113963176A/en
Application granted granted Critical
Publication of CN113963176B publication Critical patent/CN113963176B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure provides a model distillation method, a device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning processing, and can be applied to scenes such as image processing and image recognition. The specific implementation scheme is as follows: acquiring a sample image, a teacher model and a student model; analyzing the sample image by using a teacher model to obtain a first feature map; analyzing the sample image by using the student model to obtain a second feature map; calculating the spatial features of the second feature map to obtain a spatial similarity matrix of the second feature map; weighting the second feature map and the spatial similarity matrix to obtain a third feature map; calculating the loss of the student model according to the first feature map and the third feature map; and adjusting training parameters of the student model according to the loss of the student model to obtain the trained student model. The present disclosure enables distillation of models.

Description

Model distillation method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the field of computer vision and deep learning, and may be applied to scenes such as image processing and image recognition.
Background
Model distillation is a common technique for model compression and migration, which aims to compress and migrate complex models into simpler models, while simplifying the computation while preserving the functionality of the complex models. In general, model distillation is to utilize a trained complex model to supervise and train a simpler model, so that the simpler model has the characteristics of the complex model, thereby realizing compression and migration of the complex model.
Disclosure of Invention
The present disclosure provides a model distillation method, apparatus, electronic device, and storage medium.
According to an aspect of the present disclosure, there is provided a model distillation method comprising:
acquiring a sample image, a teacher model and a student model;
analyzing the sample image by using a teacher model to obtain a first feature map; analyzing the sample image by using the student model to obtain a second feature map;
calculating the spatial features of the second feature map to obtain a spatial similarity matrix of the second feature map;
weighting the second feature map and the spatial similarity matrix to obtain a third feature map;
calculating the loss of the student model according to the first feature map and the third feature map;
and adjusting training parameters of the student model according to the loss of the student model to obtain the trained student model.
According to another aspect of the present disclosure, there is provided a model distillation apparatus comprising:
the acquisition module is used for acquiring a sample image, a teacher model and a student model;
the image analysis module is used for analyzing the sample image by using the teacher model to obtain a first feature map; analyzing the sample image by using the student model to obtain a second feature map;
the feature map calculation module is used for calculating the spatial features of the second feature map to obtain a spatial similarity matrix of the second feature map;
the feature map weighting module is used for weighting the second feature map and the space similarity matrix to obtain a third feature map;
the loss calculation module is used for calculating the loss of the student model according to the first characteristic diagram and the third characteristic diagram;
and the model training module is used for adjusting training parameters of the student model according to the loss of the student model to obtain the trained student model.
According to the model distillation method, the characteristic diagram of the teacher model is weighted by using the spatial similarity matrix obtained by calculating the characteristic diagram of the student model, and the loss of the student model is calculated according to the weighted characteristic diagram of the teacher model and the characteristic diagram of the student model. Finally, according to the loss of the student model, training parameters of the student model are adjusted, and the student model is trained to obtain the trained student model. Thereby achieving distillation of the model.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic flow diagram of one embodiment of a model distillation method provided in accordance with the present disclosure;
FIG. 2 is a schematic flow diagram of one possible implementation of step S13 in a model distillation method provided in accordance with the present disclosure;
FIG. 3 is a schematic flow diagram of one possible implementation of step S14 in a model distillation method provided in accordance with the present disclosure;
FIG. 4 is a schematic flow diagram of one possible implementation of step S15 in a model distillation method provided in accordance with the present disclosure;
FIG. 5 is a schematic structural view of a model distillation apparatus provided in accordance with the present disclosure;
fig. 6 is a block diagram of an electronic device for implementing a model distillation method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the related art, a trained complex model is used as a teacher model, a simpler model is used as a student model, and the characteristics of the teacher model output model are used for supervising the training of the student model, so that the student model has the function of the teacher model, and compression and migration of the teacher model are realized.
In the prior art, the structure of the model is generally divided into two types, namely a CNN (convolutional neural network) structure and a transducer (a model structure), wherein the feature of an image extracted by the model of the CNN structure is a local feature, and the global feature of the image can be extracted by the model of the transducer structure. Therefore, in practical applications, the distillation method of the teacher-student model can only be applied to models with the same structure, that is, the teacher model and the student model must be both CNN structures or transform structures. When the structures of the teacher model and the student model are different, the model distillation cannot be realized due to the difference of characteristic diagrams of different structures.
To solve this problem, the present disclosure provides a model distillation method comprising:
acquiring a sample image, a teacher model and a student model;
analyzing the sample image by using a teacher model to obtain a first feature map; analyzing the sample image by using the student model to obtain a second feature map;
calculating the spatial features of the second feature map to obtain a spatial similarity matrix of the second feature map;
weighting the second feature map and the spatial similarity matrix to obtain a third feature map;
calculating the loss of the student model according to the first feature map and the third feature map;
and adjusting training parameters of the student model according to the loss of the student model to obtain the trained student model.
From the above, by applying the model distillation method provided by the present disclosure, the feature map of the student model is weighted by using the spatial similarity matrix, and the loss of the student model is calculated according to the weighted feature map of the student model and the feature map of the teacher model, so that model distillation failure caused by the difference in extraction of the feature map of the heterogeneous model when the model structures of the teacher model and the student model are different can be avoided, and model migration across structures is realized. The training parameters of the student model are adjusted according to the loss of the obtained student model, and the student model is trained based on the adjusted training parameters, so that the output of the trained student model is as close as possible to the output of a teacher model, the student model further has the characteristics and functions of the teacher model, and model distillation from the teacher model with a complex structure irrelevant to the model structure to the student model with a simpler structure is completed.
The model distillation method provided in the present disclosure will be described in detail by way of specific examples.
The method of the embodiment of the disclosure is applied to the intelligent terminal, and can be implemented by the intelligent terminal, and in the actual use process, the intelligent terminal can be a computer, a smart phone and the like.
Referring to fig. 1, fig. 1 is a schematic flow chart of a model distillation method according to an embodiment of the disclosure, including the following steps S11-S16.
Step S11: a sample image is acquired, and a teacher model and a student model.
The sample image refers to a sample for training a teacher model and a student model, and is a plurality of images having respective image features collected in advance. The teacher model is a trained deep learning model and has more complex features or functions. The student model is a deep learning model with simpler characteristics or functions.
In one example, the teacher model and the student model may have different structures, e.g., the structure of the teacher model may be a transducer structure, e.g., the BERT model (Binary ERlang Term, a deep learning model), etc. The structure of the student model may then be a CNN structure, e.g., a LeNet-5 model (a deep learning model), an AlexNet model (a deep learning model), etc.
Step S12: analyzing the sample image by using a teacher model to obtain a first feature map; and analyzing the sample image by using the student model to obtain a second characteristic diagram.
The teacher model and the student model both have the characteristic extraction function of images, the sample images are respectively used as the input of the teacher model and the student model, and the teacher model and the student model are utilized to analyze a plurality of sample images, so that a first characteristic diagram corresponding to the teacher model and a second characteristic diagram corresponding to the student model can be obtained.
Step S13: and calculating the spatial characteristics of the second characteristic diagram to obtain a spatial similarity matrix of the second characteristic diagram.
The second feature map corresponding to the student model may represent a spatial feature of the student model, and spatial change is performed on the second feature map, so as to obtain a spatial similarity matrix of the second feature map. When the extracted feature of the student model is a local feature, the second feature map corresponding to the student model may represent only the local feature corresponding to the student model, and the spatial similarity matrix has a spatial feature similar to the second feature map.
In one example, the student model is of a CNN structure, the feature extracted by the student model of the CNN structure is a local feature, and the second feature map corresponding to the student model of the CNN structure also only represents the local feature corresponding to the student model, so that the spatial similarity matrix can complement other spatial features of the student model of the CNN structure that are not represented by the second feature map.
Step S14: and weighting the second feature map and the spatial similarity matrix to obtain a third feature map.
In the above description, the second feature map may only represent local features corresponding to the student model, and the spatial similarity matrix has spatial features similar to the second feature map, and weighting the second feature map and the spatial similarity matrix may obtain a third feature map, which may complement other spatial features of the student model that are not represented by the second feature map, that is, the third feature map may represent global features corresponding to the student model.
Step S15: and calculating the loss of the student model according to the first characteristic diagram and the third characteristic diagram.
In the process of training a student model, a supervised learning mode can be adopted, and the prediction accuracy of the model can be estimated by using the loss generated in the supervised learning process. The loss of the student model is calculated from the first feature map and the third feature map, and may be calculated using a loss function, such as a mean square error, a square loss, or the like.
Step S16: and adjusting training parameters of the student model according to the loss of the student model to obtain the trained student model.
According to the loss of the student model, training parameters of the student model are adjusted, training is continued on the basis of the training parameters of the student model after adjustment until the student model converges, and the training is finished, so that the trained student model can be obtained. The loss of the student model is calculated according to the first feature map corresponding to the teacher model and the third feature map corresponding to the student model, and the output result of the trained student model can be close to the output result of the teacher model under the supervision of the loss of the obtained student model, i.e. the trained student model can have the features and functions of the teacher model.
From the above, by applying the model distillation method provided by the present disclosure, the feature map of the student model is weighted by using the spatial similarity matrix, and the loss of the student model is calculated according to the weighted feature map of the student model and the feature map of the teacher model, so that model distillation failure caused by the difference in extraction of the feature map of the heterogeneous model when the model structures of the teacher model and the student model are different can be avoided, and model migration across structures is realized. The training parameters of the student model are adjusted according to the loss of the obtained student model, and the student model is trained based on the adjusted training parameters, so that the output of the trained student model is as close as possible to the output of a teacher model, the student model further has the characteristics and functions of the teacher model, and model distillation from the teacher model with a complex structure irrelevant to the model structure to the student model with a simpler structure is completed.
In one possible implementation, the dimensions of the second feature map include: the number of channels, the width of the second profile, the length of the second profile. Referring to fig. 2, the step S13 performs calculation of spatial features on the second feature map to obtain a spatial similarity matrix of the second feature map, including:
step S21: and normalizing the second feature map based on the channel number of the second feature map to obtain a normalized second feature map.
The three dimensions of the channel number of the second feature map, the width of the second feature map, and the length of the second feature map may be three different or identical matrices, which are respectively corresponding to the channel number matrix, the width matrix, and the length matrix. And normalizing the second feature map based on the channel number of the second feature map, namely normalizing the channel number matrix of the second feature map, namely dividing the channel number matrix by the modulus of the channel number matrix to obtain a normalized channel number matrix. The second feature map with the normalized channel number matrix and unchanged wide matrix and long matrix is the normalized feature map.
In one example, the dimensions of the second feature map may further include the number of images of the sample image, the dimensions of the second feature map may be expressed as (n, c, w, h), n representing the number of images, c representing the number of channels, w representing the width, and h representing the length.
Step S22: and carrying out dimension combination on the width and the length of the normalized second feature map to obtain a fourth feature map.
And combining the width and the length of the normalized second feature map in dimensions, namely multiplying the width matrix and the length matrix of the second feature map to obtain a width matrix and a length matrix as a new dimension, wherein other dimensions of the second feature map are unchanged, and a fourth feature map is obtained. For example, the dimensions of the second feature map include: the number of images, the normalized channel number matrix, the wide matrix, the long matrix, and the like included in the dimension of the fourth feature map are unchanged, i.e., the fourth feature map includes the number of images, the normalized channel number matrix, the wide matrix, and the long matrix.
Step S23: and transposing the fourth characteristic diagram to obtain a space similarity matrix.
The transposing of the fourth feature map may be that the rows and columns of the matrix included in the dimension of the fourth feature map are interchanged, so as to obtain a spatial similarity matrix. For example, each row in the wide-long matrix in the fourth feature map is interchanged with elements at positions in each column to obtain the spatial similarity matrix.
From the above, by applying the model distillation method provided by the disclosure, spatial features are calculated in three dimensions, namely, the number of channels included in the second feature map, the width of the second feature map, and the length of the second feature map, so as to obtain a spatial similarity matrix, so that features of the student model which cannot be completely represented in the second feature map can be enriched, and the features of the student model can be more comprehensively represented.
In an embodiment of the present disclosure, the step S23 transposes the fourth feature map to obtain a spatial similarity matrix, including:
the fourth feature map is transposed according to the following formula:
M=transpose(s_n)*s_n
where M is a spatial similarity matrix, s_n is a fourth feature map, and transfer () is a transpose function.
The dimensions of the fourth feature map include the image number, the normalized channel number matrix and the width-length matrix, and after the fourth feature map is transposed and multiplied by the fourth feature map before the transposition, the normalized channel number matrix in the dimensions of the obtained spatial similarity matrix can be converted into a matrix identical to the width-length matrix, and then the dimensions of the spatial similarity matrix can include the image number, the channel number and the width-length matrix and can be expressed as (n, w×h), where w×h represents the width-length matrix.
In one possible implementation manner, referring to fig. 3, step S14 weights the second feature map and the spatial similarity matrix to obtain a third feature map, where the step includes:
step S31: and carrying out dimension combination on the width and the length of the second feature map to obtain a dimension-combined second feature map.
And combining the width and the length of the second feature map in a dimension way, namely multiplying a wide matrix and a long matrix of the second feature map to obtain a wide matrix and a long matrix as a new dimension, wherein other dimensions of the second feature map are unchanged, and the second feature map after dimension combination is obtained.
Step S32: and weighting the second feature images after dimension combination by using the space similarity matrix to obtain weighted second feature images.
And weighting the second feature images after dimension combination by using the spatial similarity matrix, and multiplying the second feature images after dimension combination by the spatial similarity matrix to finish weighting, thereby obtaining weighted second feature images.
Step S33: and carrying out dimension reconstruction on the weighted second feature map to obtain a third feature map.
Wherein the dimensions of the third feature map include: the number of channels, the width of the third profile, the length of the third profile.
The dimensions of the spatial similarity matrix may include the number of images, the number of channels, and the width-length matrix, where the number of channels matrix is the same as the width-length matrix, and the dimensions of the second feature map after the dimensions are combined may include: the number of images, the number of channels and the width and length matrix, and the dimensions of the weighted second feature map include: the number of images, the number of channels and the width and length matrix are different from each other.
And carrying out dimension reconstruction on the weighted second feature map, namely restoring the wide-long matrix into a wide matrix and a long matrix, namely reconstructing the dimension of the weighted second feature map into the number of images, the number of channels, the width of the third feature map and the length of the third feature map.
In one embodiment of the present disclosure, the step S33 performs dimension reconstruction on the weighted second feature map to obtain a third feature map, including:
and carrying out dimension reconstruction on the weighted second feature map according to the following formula:
N=reshape(s*M)
wherein N is a third feature map, s is a second feature map, M is a spatial similarity matrix, s×m is a weighted second feature map, and reshape () is a dimension reconstruction function.
From the above, by applying the model distillation method provided by the present disclosure, the second feature map is weighted by using the spatial similarity matrix by transforming the dimensions of the second feature map, and the dimensions are reconstructed after weighting, so that the dimensions of the third feature map are reconstructed to include the number of channels, the width of the third feature map, and the length of the third feature map. And under the condition that errors of dimensions are avoided, compared with the second characteristic diagram, the third characteristic diagram complements the characteristics of the student model.
In one possible implementation manner, referring to fig. 4, the step S15 calculates the loss of the student model according to the first feature map and the third feature map, including:
step S41: regularization is carried out on the first feature map and the third feature map, and a regularization item of the student model is obtained.
Regularization is performed on the first feature map and the third feature map, which may be to calculate a gap between the first feature map and the third feature map to obtain a regularization term of the student model. The resulting regularization term may be used to represent an error between the output of the student model and the output of the teacher model.
Step S42: and adding the regular term of the student model into a loss function of the student model to obtain the loss of the student model.
In one embodiment of the present disclosure, regularizing the first feature map and the third feature map in the step S41 to obtain a regularized term of the student model includes:
regularizing the first feature map and the third feature map according to the following formula:
L=L 1 (t-N)+L 2 (t-N)
wherein L is a regular term of the student model, t is a first feature map, N is a third feature map, L 1 Is L 1 Regularization, which means absolute value processing of the difference between the first and third feature patterns, L 2 Is L 2 Regularization, which means squaring the difference between the first and third feature maps.
L is carried out on the first characteristic diagram and the third characteristic diagram 1 Regularization sum L 2 And adding the regularized parameters to obtain a regularized item of the student model, adding the regularized item to a loss function of the student model to obtain the loss of the student model, and continuously training the student model under the supervision of the loss of the student model.
From the above, by applying the model distillation method provided by the disclosure, the first feature map corresponding to the teacher model and the third feature map corresponding to the student model are regularized to calculate the loss of the student model, so that the error between the output result of the student model and the output result of the teacher model can be clearly represented. The student model is continuously trained under the supervision of the loss of the student model, so that the output result of the student model is as close as possible to the output result of the teacher model, the student model has the characteristics and the functions of the teacher model, and the distillation from the teacher model to the student model is completed.
In another aspect, referring to fig. 5, fig. 5 is a model distillation apparatus provided by the present disclosure, comprising:
an obtaining module 501, configured to obtain a sample image, a teacher model, and a student model;
the image analysis module 502 is configured to analyze the sample image by using the teacher model to obtain a first feature map; analyzing the sample image by using the student model to obtain a second feature map;
a feature map calculation module 503, configured to perform spatial feature calculation on the second feature map, to obtain a spatial similarity matrix of the second feature map;
a feature map weighting module 504, configured to weight the second feature map and the spatial similarity matrix to obtain a third feature map;
a loss calculation module 505, configured to calculate a loss of the student model according to the first feature map and the third feature map;
and the model training module 506 is configured to adjust training parameters of the student model according to the loss of the student model, so as to obtain a trained student model.
From the above, by applying the model distillation device provided by the disclosure, the spatial similarity matrix is utilized to weight the feature map of the student model, and the loss of the student model is calculated according to the weighted feature map of the student model and the feature map of the teacher model, so that model distillation failure caused by the difference of extraction of the feature maps of the heterogeneous model can be avoided when the model structures of the teacher model and the student model are different, and model migration across structures is realized. The training parameters of the student model are adjusted according to the loss of the obtained student model, and the student model is trained based on the adjusted training parameters, so that the output of the trained student model is as close as possible to the output of a teacher model, the student model further has the characteristics and functions of the teacher model, and model distillation from the teacher model with a complex structure irrelevant to the model structure to the student model with a simpler structure is completed.
In one embodiment of the present disclosure, the dimensions of the second feature map include: the number of channels, the width of the second feature map, the length of the second feature map; the feature map calculation module 503 includes:
the feature map normalization sub-module is used for normalizing the second feature map based on the channel number of the second feature map to obtain a normalized second feature map;
the feature map dimension merging sub-module is used for carrying out dimension merging on the width and the length of the normalized second feature map to obtain a fourth feature map;
and the feature map transposition sub-module is used for transposing the fourth feature map to obtain a space similarity matrix.
From the above, by applying the model distillation device provided by the disclosure, spatial feature calculation is performed on three dimensions, namely the number of channels included in the second feature map, the width of the second feature map and the length of the second feature map, so as to obtain a spatial similarity matrix, so that features of the student model which cannot be completely represented in the second feature map can be enriched, and the features of the student model can be more comprehensively represented.
In one embodiment of the disclosure, the feature map transpose submodule is specifically configured to:
transpose the fourth feature map according to the following formula:
M=transpose(s_n)*s_n
wherein M is the spatial similarity matrix, and s_n is the fourth feature map.
In one embodiment of the present disclosure, the feature map weighting module 504 includes:
the dimension merging sub-module is used for carrying out dimension merging on the width and the length of the second feature map to obtain a second feature map after the dimension merging;
the feature map weighting sub-module is used for weighting the second feature map after the dimension combination by utilizing the space similarity matrix to obtain a weighted second feature map;
the dimension reconstruction sub-module is configured to perform dimension reconstruction on the weighted second feature map to obtain the third feature map, where the dimension of the third feature map includes: the number of channels, the width of the third profile, the length of the third profile.
In one embodiment of the disclosure, the dimension reconstruction sub-module is specifically configured to:
and carrying out dimension reconstruction on the weighted second feature map according to the following formula:
N=reshape(s*M)
wherein N is the third feature map, s is the second feature map, M is the spatial similarity matrix, and s×m is the weighted second feature map.
From the above, by applying the model distillation device provided by the present disclosure, the second feature map is weighted by using the spatial similarity matrix by transforming the dimensions of the second feature map, and the dimensions are reconstructed after weighting, so that the dimensions of the third feature map are reconstructed to include the number of channels, the width of the third feature map, and the length of the third feature map. And under the condition that errors of dimensions are avoided, compared with the second characteristic diagram, the third characteristic diagram complements the characteristics of the student model.
In one embodiment of the present disclosure, the loss calculation module 505 includes:
the regular term obtaining sub-module is used for regularizing the first feature map and the third feature map to obtain a regular term of the student model;
and the loss obtaining submodule is used for adding the regular term of the student model into the loss function of the student model to obtain the loss of the student model.
In one embodiment of the disclosure, the regularization term obtaining submodule is specifically configured to:
regularizing the first feature map and the third feature map according to the following formula:
L=L 1 (t-N)+L 2 (t-N)
wherein L is a regularization term of the student model, t is the first feature map, and N is the third feature map.
From the above, by applying the model distillation device provided by the disclosure, the first feature map corresponding to the teacher model and the third feature map corresponding to the student model are regularized to calculate the loss of the student model, so that the error between the output result of the student model and the output result of the teacher model can be clearly represented. The student model is continuously trained under the supervision of the loss of the student model, so that the output result of the student model is as close as possible to the output result of the teacher model, the student model has the characteristics and the functions of the teacher model, and the distillation from the teacher model to the student model is completed.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as method XXX. For example, in some embodiments, method XXX may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. One or more of the steps of method XXX described above may be performed when a computer program is loaded into RAM 603 and executed by computing unit 601. Alternatively, in other embodiments, computing unit 601 may be configured to perform method XXX by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (17)

1. A model distillation method for image recognition, comprising:
acquiring a sample image, and a teacher model and a student model for image recognition;
analyzing the sample image by using the teacher model to obtain a first feature map, wherein the first feature map represents global features of the sample image; analyzing the sample image by using the student model to obtain a second feature map, wherein the second feature map characterizes local features of the sample image;
calculating the spatial features of the second feature map to obtain a spatial similarity matrix of the second feature map, wherein the spatial similarity matrix characterizes other spatial features which are not represented by the second feature map;
weighting the second feature map and the spatial similarity matrix to obtain a third feature map;
calculating the loss of the student model according to the first feature map and the third feature map;
according to the loss of the student model, training parameters of the student model are adjusted to obtain a trained student model;
and carrying out image recognition by using the trained student model.
2. The model distillation method for image recognition as recited in claim 1, wherein the dimensions of the second feature map include: the number of channels, the width of the second feature map, the length of the second feature map;
the calculating the spatial feature of the second feature map to obtain a spatial similarity matrix of the second feature map includes:
normalizing the second feature map based on the channel number of the second feature map to obtain a normalized second feature map;
dimension merging is carried out on the width and the length of the normalized second feature map to obtain a fourth feature map;
and transposing the fourth characteristic diagram to obtain a space similarity matrix.
3. The model distillation method for image recognition according to claim 2, wherein the transposing the fourth feature map to obtain a spatial similarity matrix includes:
transpose the fourth feature map according to the following formula:
M=transpose(s_n)*s_n
wherein M is the spatial similarity matrix, and s_n is the fourth feature map.
4. The model distillation method for image recognition according to claim 1, wherein said weighting the second feature map with the spatial similarity matrix to obtain a third feature map includes:
dimension merging is carried out on the width and the length of the second feature map, and a second feature map obtained after dimension merging is obtained;
weighting the second feature images after the dimension combination by using the space similarity matrix to obtain weighted second feature images;
and carrying out dimension reconstruction on the weighted second feature map to obtain the third feature map, wherein the dimension of the third feature map comprises: the number of channels, the width of the third profile, the length of the third profile.
5. The model distillation method for image recognition according to claim 4, wherein performing dimension reconstruction on the weighted second feature map to obtain the third feature map includes:
and carrying out dimension reconstruction on the weighted second feature map according to the following formula:
N=reshape(s*M)
wherein N is the third feature map, s is the second feature map, M is the spatial similarity matrix, and s×m is the weighted second feature map.
6. The model distillation method for image recognition according to claim 1, wherein the calculating the loss of the student model from the first feature map and the third feature map comprises:
regularizing the first feature map and the third feature map to obtain a regularized item of the student model;
and adding the regular term of the student model to a loss function of the student model to obtain the loss of the student model.
7. The model distillation method for image recognition according to claim 6, wherein regularizing the first feature map and the third feature map to obtain a regularized term of the student model includes:
regularizing the first feature map and the third feature map according to the following formula:
L=L 1 (t-N)+L 2 (t-N)
wherein L is a regularization term of the student model, t is the first feature map, and N is the third feature map.
8. A model distillation apparatus for image recognition, comprising:
the acquisition module is used for acquiring a sample image and a teacher model and a student model for image recognition;
the image analysis module is used for analyzing the sample image by utilizing the teacher model to obtain a first feature map, and the first feature map represents global features of the sample image; analyzing the sample image by using the student model to obtain a second feature map, wherein the second feature map characterizes local features of the sample image;
the feature map calculation module is used for calculating the spatial features of the second feature map to obtain a spatial similarity matrix of the second feature map, and the spatial similarity matrix characterizes other spatial features which are not represented by the second feature map;
the feature map weighting module is used for weighting the second feature map and the spatial similarity matrix to obtain a third feature map;
the loss calculation module is used for calculating the loss of the student model according to the first feature map and the third feature map;
the model training module is used for adjusting training parameters of the student model according to the loss of the student model to obtain a trained student model;
and the image recognition module is used for carrying out image recognition by using the trained student model.
9. The model distillation apparatus for image recognition as set forth in claim 8, wherein the dimensions of the second feature map include: the number of channels, the width of the second feature map, the length of the second feature map;
the feature map calculation module includes:
the feature map normalization sub-module is used for normalizing the second feature map based on the channel number of the second feature map to obtain a normalized second feature map;
the feature map dimension merging sub-module is used for carrying out dimension merging on the width and the length of the normalized second feature map to obtain a fourth feature map;
and the feature map transposition sub-module is used for transposing the fourth feature map to obtain a space similarity matrix.
10. The model distillation apparatus for image recognition according to claim 9, wherein the feature map transposition sub-module is specifically configured to:
transpose the fourth feature map according to the following formula:
M=transpose(s_n)*s_n
wherein M is the spatial similarity matrix, and s_n is the fourth feature map.
11. The model distillation apparatus for image recognition as set forth in claim 8, wherein the feature map weighting module includes:
the dimension merging sub-module is used for carrying out dimension merging on the width and the length of the second feature map to obtain a second feature map after the dimension merging;
the feature map weighting sub-module is used for weighting the second feature map after the dimension combination by utilizing the space similarity matrix to obtain a weighted second feature map;
the dimension reconstruction sub-module is configured to perform dimension reconstruction on the weighted second feature map to obtain the third feature map, where the dimension of the third feature map includes: the number of channels, the width of the third profile, the length of the third profile.
12. The model distillation apparatus for image recognition of claim 11, wherein the dimension reconstruction sub-module is specifically configured to:
and carrying out dimension reconstruction on the weighted second feature map according to the following formula:
N=reshape(s*M)
wherein N is the third feature map, s is the second feature map, M is the spatial similarity matrix, and s×m is the weighted second feature map.
13. The model distillation apparatus for image recognition as set forth in claim 8, wherein the loss calculation module includes:
the regular term obtaining sub-module is used for regularizing the first feature map and the third feature map to obtain a regular term of the student model;
and the loss obtaining submodule is used for adding the regular term of the student model into the loss function of the student model to obtain the loss of the student model.
14. The model distillation apparatus for image recognition of claim 13, wherein the regularization term acquisition sub-module is specifically configured to:
regularizing the first feature map and the third feature map according to the following formula:
L=L 1 (t-N)+L 2 (t-N)
wherein L is a regularization term of the student model, t is the first feature map, and N is the third feature map.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model distillation method for image recognition of any one of claims 1-7.
16. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the model distillation method for image recognition according to any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the model distillation method for image recognition according to any of claims 1-7.
CN202111265972.XA 2021-10-28 2021-10-28 Model distillation method and device, electronic equipment and storage medium Active CN113963176B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111265972.XA CN113963176B (en) 2021-10-28 2021-10-28 Model distillation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111265972.XA CN113963176B (en) 2021-10-28 2021-10-28 Model distillation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113963176A CN113963176A (en) 2022-01-21
CN113963176B true CN113963176B (en) 2023-07-07

Family

ID=79468149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111265972.XA Active CN113963176B (en) 2021-10-28 2021-10-28 Model distillation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113963176B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114549947A (en) * 2022-01-24 2022-05-27 北京百度网讯科技有限公司 Model training method, device, equipment and storage medium
CN114282690A (en) * 2022-01-27 2022-04-05 北京百度网讯科技有限公司 Model distillation method, device, equipment and storage medium
CN114445647A (en) * 2022-01-29 2022-05-06 北京百度网讯科技有限公司 Model training method and device for image processing
CN114677565B (en) * 2022-04-08 2023-05-05 北京百度网讯科技有限公司 Training method and image processing method and device for feature extraction network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255392A (en) * 2018-09-30 2019-01-22 百度在线网络技术(北京)有限公司 Video classification methods, device and equipment based on non local neural network
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN110070073A (en) * 2019-05-07 2019-07-30 国家广播电视总局广播电视科学研究院 Pedestrian's recognition methods again of global characteristics and local feature based on attention mechanism
CN112991173A (en) * 2021-03-12 2021-06-18 西安电子科技大学 Single-frame image super-resolution reconstruction method based on dual-channel feature migration network
CN113011562A (en) * 2021-03-18 2021-06-22 华为技术有限公司 Model training method and device
CN113343803A (en) * 2021-05-26 2021-09-03 北京百度网讯科技有限公司 Model training method, device, equipment and storage medium
EP3879457A2 (en) * 2020-12-15 2021-09-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for model distillation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9053558B2 (en) * 2013-07-26 2015-06-09 Rui Shen Method and system for fusing multiple images
CA3076424A1 (en) * 2019-03-22 2020-09-22 Royal Bank Of Canada System and method for knowledge distillation between neural networks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN109255392A (en) * 2018-09-30 2019-01-22 百度在线网络技术(北京)有限公司 Video classification methods, device and equipment based on non local neural network
CN110070073A (en) * 2019-05-07 2019-07-30 国家广播电视总局广播电视科学研究院 Pedestrian's recognition methods again of global characteristics and local feature based on attention mechanism
EP3879457A2 (en) * 2020-12-15 2021-09-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for model distillation
CN112991173A (en) * 2021-03-12 2021-06-18 西安电子科技大学 Single-frame image super-resolution reconstruction method based on dual-channel feature migration network
CN113011562A (en) * 2021-03-18 2021-06-22 华为技术有限公司 Model training method and device
CN113343803A (en) * 2021-05-26 2021-09-03 北京百度网讯科技有限公司 Model training method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TinyBERT: Distilling BERT for Natural Language Understanding;Xiaoqi Jiao;arxiv;1-12 *

Also Published As

Publication number Publication date
CN113963176A (en) 2022-01-21

Similar Documents

Publication Publication Date Title
CN113963176B (en) Model distillation method and device, electronic equipment and storage medium
CN113343803A (en) Model training method, device, equipment and storage medium
CN114020950B (en) Training method, device, equipment and storage medium for image retrieval model
CN113538235B (en) Training method and device for image processing model, electronic equipment and storage medium
CN114693934B (en) Training method of semantic segmentation model, video semantic segmentation method and device
CN113657483A (en) Model training method, target detection method, device, equipment and storage medium
KR20220116395A (en) Method and apparatus for determining pre-training model, electronic device and storage medium
CN115482395A (en) Model training method, image classification method, device, electronic equipment and medium
CN115496970A (en) Training method of image task model, image recognition method and related device
KR20220042315A (en) Method and apparatus for predicting traffic data and electronic device
CN117746125A (en) Training method and device of image processing model and electronic equipment
CN114120454A (en) Training method and device of living body detection model, electronic equipment and storage medium
CN115457365B (en) Model interpretation method and device, electronic equipment and storage medium
CN116310356B (en) Training method, target detection method, device and equipment of deep learning model
CN113361621B (en) Method and device for training model
CN114943995A (en) Training method of face recognition model, face recognition method and device
CN113642654B (en) Image feature fusion method and device, electronic equipment and storage medium
CN115759209A (en) Neural network model quantification method and device, electronic equipment and medium
CN113554550B (en) Training method and device for image processing model, electronic equipment and storage medium
CN112784967B (en) Information processing method and device and electronic equipment
CN115578261A (en) Image processing method, deep learning model training method and device
US11681920B2 (en) Method and apparatus for compressing deep learning model
CN113887435A (en) Face image processing method, device, equipment, storage medium and program product
CN114707638A (en) Model training method, model training device, object recognition method, object recognition device, object recognition medium and product
CN112990046A (en) Difference information acquisition method, related device and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant