WO2021227804A1 - 一种模型训练方法及相关设备 - Google Patents

一种模型训练方法及相关设备 Download PDF

Info

Publication number
WO2021227804A1
WO2021227804A1 PCT/CN2021/088787 CN2021088787W WO2021227804A1 WO 2021227804 A1 WO2021227804 A1 WO 2021227804A1 CN 2021088787 W CN2021088787 W CN 2021088787W WO 2021227804 A1 WO2021227804 A1 WO 2021227804A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
feature
classification
loss
target
Prior art date
Application number
PCT/CN2021/088787
Other languages
English (en)
French (fr)
Inventor
唐福辉
张晓鹏
钮敏哲
王子辰
韩建华
田奇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021227804A1 publication Critical patent/WO2021227804A1/zh
Priority to US17/986,081 priority Critical patent/US20230075836A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects

Definitions

  • This application relates to the field of image processing, and in particular to a model training method and related equipment.
  • Target detection refers to the classification and positioning of target objects in the image. As shown in FIG. 1, the umbrella 101 in the image can be classified and positioned through the target detection, and the person 102 in the image can also be classified and positioned.
  • the application of target detection on images is very wide, such as autonomous driving, safe city, and mobile phone terminals, so there are high requirements for detection accuracy and speed.
  • Target detection on images is usually achieved through neural networks. However, although the accuracy of large neural network detection is high, the speed is very slow; the detection speed of small neural network is fast and the accuracy is very low.
  • the embodiments of the present application disclose a model training method and related equipment, which can be applied in artificial intelligence, computer vision and other fields for image detection.
  • the method and related equipment can improve the prediction efficiency and accuracy of the network.
  • an embodiment of the present application provides a model training method, which includes:
  • the second feature information in the target image is extracted through the feature extraction layer of the second network, wherein the first network and the second network are both classification networks, and the depth of the first network is greater than that of the second network depth;
  • the local features of the target object in the feature information extracted by the first network are highlighted through the Gaussian mask, and the local features of the target object in the feature information extracted by the second network are highlighted, and then based on the target object in the two networks Determine the feature loss based on the local features of, and then train the second network based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the method further includes:
  • the training the second network according to the feature loss to obtain the target network includes:
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the training the second network according to the feature loss to obtain the target network include:
  • the second network after training is trained by a third network to obtain a target network, wherein the depth of the third network is greater than the depth of the first network.
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • an embodiment of the present application provides a model training method, which includes:
  • the depth of a network, the depth of the first network is greater than the depth of the second network.
  • the third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the training of the second network based on the first network includes:
  • the Gaussian mask is used to highlight the local features of the target object in the feature information extracted by the first network, and highlight the local features of the target object in the feature information extracted by the second network, and then according to the two networks
  • the feature loss is determined with respect to the local features of the target object, and the second network is subsequently trained based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the method further includes:
  • the training the second network according to the feature loss to obtain the intermediate network includes:
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the first The network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is The second network is shared with the first network, or the first network is shared with the second network.
  • the proposal is a proposal for all the regions in the set of region proposals, or a proposal for a normal region belonging to the target object in the set of region proposals.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the method further include:
  • Training the second network according to the feature loss and the classification loss to obtain a target network includes:
  • the The feature loss training the second network, after obtaining the target network further includes:
  • the target network is sent to the model using device, where the target network is used to predict the content in the image.
  • an image detection method which includes:
  • the target network is a network obtained by training a second network through the first network
  • the parameters used to train the second network through the first network include feature loss
  • the feature loss To be determined based on the first local feature and the second local feature
  • the first local feature is a feature about the target object extracted from the first feature information through a Gaussian mask
  • the second local feature is a feature about the target object extracted from the first feature information through a Gaussian mask.
  • the feature about the target object extracted from the second feature information where the first feature information is feature information in the target image extracted by the feature extraction layer of the first network, and the second feature information is The feature information in the target image extracted by the feature extraction layer of the second network, the first network and the second network are both classification networks, and the depth of the first network is greater than that of the first network 2.
  • the depth of the network
  • the local features of the target object in the feature information extracted by the first network are highlighted through the Gaussian mask, and the local features of the target object in the feature information extracted by the second network are highlighted, and then based on the target object in the two networks Determine the feature loss based on the local features of, and then train the second network based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the parameters used to train the second network further include classification loss, where the classification loss is based on the first classification prediction value and the second classification loss. If the classification prediction value is determined, the first classification prediction value is the classification prediction value proposed by the target region in the region proposal set generated by the classification layer of the first network, and the second classification prediction value is passed through the first network The classification prediction value of the target area proposal in the area proposal set generated by the classification layer of the second network.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the target network specifically trains the second network through the first network, and passes through the first network.
  • the third network further performs training on the trained network, wherein the depth of the third network is greater than the depth of the first network.
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • an image detection method which includes:
  • the target network is a network obtained by training a second network through multiple network iterations
  • the multiple networks are all classification networks
  • the multiple networks include at least a first network and a third network
  • the third network is used to train the intermediate network after the first network trains the second network to obtain the intermediate network, wherein the depth of the third network is greater than the depth of the first network, The depth of the first network is greater than the depth of the second network;
  • the third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the parameters used when the first network trains the second network include feature loss, where the feature loss is based on the first local Feature and a second local feature, the first local feature is a feature about the target object extracted from the first feature information through a Gaussian mask, and the second local feature is a feature from the second feature information through a Gaussian mask
  • the first feature information is the feature information of the target image extracted by the feature extraction layer of the first network
  • the second feature information is the feature information of the target image extracted by the second The feature information in the target image extracted by the feature extraction layer of the network.
  • the Gaussian mask is used to highlight the local features of the target object in the feature information extracted by the first network, and highlight the local features of the target object in the feature information extracted by the second network, and then according to the two networks
  • the feature loss is determined with respect to the local features of the target object, and the second network is subsequently trained based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the parameters used when the first network trains the second network include classification loss.
  • the classification loss is determined according to the first classification prediction value and the second classification prediction value
  • the first classification prediction value is the classification prediction value proposed by the target region in the region proposal set generated by the classification layer of the first network
  • the second classification prediction value is a classification prediction value proposed by the target region in the region proposal set generated by the classification layer of the second network.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the first The network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is The second network is shared with the first network, or the first network is shared with the second network.
  • the target area The proposal is a proposal for all the regions in the set of region proposals, or a proposal for a normal region belonging to the target object in the set of region proposals.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the parameters used by the second network also include the regression loss and RPN loss of the second network, where the regression loss and RPN loss of the second network are the truth label proposed according to the region in the target image and the The second network determines the predicted value of the proposed prediction for the region in the target image.
  • the acquisition target Network including:
  • an embodiment of the present application provides a model training device, which includes:
  • the feature extraction unit is configured to extract the first feature information in the target image through the feature extraction layer of the first network
  • the feature extraction unit is further configured to extract the second feature information in the target image through the feature extraction layer of the second network, wherein the first network and the second network are both classification networks, and the first network The depth of the network is greater than the depth of the second network;
  • the first optimization unit is configured to extract the features of the target object in the first feature information through a Gaussian mask to obtain the first local feature;
  • a second optimization unit configured to extract the feature of the target object in the second feature information through a Gaussian mask to obtain a second local feature
  • a first determining unit configured to determine a feature loss through the first local feature and the second local feature
  • the weight adjustment unit is configured to train the second network according to the feature loss to obtain a target network.
  • the local features of the target object in the feature information extracted by the first network are highlighted through the Gaussian mask, and the local features of the target object in the feature information extracted by the second network are highlighted, and then based on the target object in the two networks Determine the feature loss based on the local features of, and then train the second network based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the apparatus further includes:
  • a first generating unit configured to generate the first classification prediction value proposed by the target area in the area proposal set through the classification layer of the first network
  • a second generation unit configured to generate a second classification prediction value proposed by the target area in the area proposal set through the classification layer of the second network
  • a second determining unit configured to determine a classification loss according to the first classification prediction value and the second classification prediction value
  • the weight adjustment unit is specifically configured to train the second network according to the feature loss and the classification loss to obtain a target network.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the second network is trained according to the feature loss to obtain a target network ,
  • the weight adjustment unit is specifically used for:
  • the second network after training is trained by a third network to obtain a target network, wherein the depth of the third network is greater than the depth of the first network.
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • an embodiment of the present application provides a model training device, which includes:
  • the first training unit is used to train the second network based on the first network to obtain the intermediate network
  • the second training unit is configured to train the intermediate network based on a third network to obtain a target network, wherein the first network, the second network, and the third network are all classification networks, and the third network
  • the depth of the network is greater than the depth of the first network, and the depth of the first network is greater than the depth of the second network.
  • the third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the training of the second network based on the first network to obtain the intermediate network includes:
  • the Gaussian mask is used to highlight the local features of the target object in the feature information extracted by the first network, and highlight the local features of the target object in the feature information extracted by the second network, and then according to the two networks
  • the feature loss is determined with respect to the local features of the target object, and the second network is subsequently trained based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the apparatus further includes:
  • the first generating unit is configured to generate the first classification prediction value of the target area proposal in the area proposal set through the classification layer of the first network;
  • a second generation unit configured to generate a second classification prediction value proposed by the target region in the region proposal set through the classification layer of the second network
  • a second determining unit configured to determine a classification loss according to the first classification prediction value and the second classification prediction value
  • the training of the second network according to the feature loss to obtain the intermediate network is specifically:
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the first The network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is The second network is shared with the first network, or the first network is shared with the second network.
  • the target area The proposal is a proposal for all the regions in the set of region proposals, or a proposal for a normal region belonging to the target object in the set of region proposals.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the apparatus further include:
  • the third determining unit is configured to determine the regression loss sum of the second network according to the truth label proposed by the region in the target image and the predicted value predicted by the second network for the region in the target image.
  • RPN loss
  • the weight adjustment unit is specifically configured to train the second network according to the feature loss, the classification loss, the regression loss, and the RPN loss to obtain a target network.
  • the sending unit is configured to send the target network to the model using device after the weight adjustment unit trains the second network according to the feature loss to obtain the target network, where the target network is used to predict the content in the image .
  • an image detection device which includes:
  • the acquiring unit is configured to acquire a target network, where the target network is a network obtained by training a second network through the first network, and the parameters used for training the second network through the first network include feature loss ,
  • the feature loss is determined based on a first local feature and a second local feature
  • the first local feature is a feature about the target object extracted from the first feature information through a Gaussian mask
  • the second local feature To extract features about the target object from the second feature information through a Gaussian mask
  • the first feature information is feature information in the target image extracted through the feature extraction layer of the first network
  • the second feature information is the feature information in the target image extracted through the feature extraction layer of the second network
  • the first network and the second network are both classification networks, and the information of the first network
  • the depth is greater than the depth of the second network;
  • the recognition unit is used for recognizing the content in the image through the target network.
  • the local features of the target object in the feature information extracted by the first network are highlighted through the Gaussian mask, and the local features of the target object in the feature information extracted by the second network are highlighted, and then based on the target object in the two networks Determine the feature loss based on the local features of, and then train the second network based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the parameters used to train the second network further include classification loss, wherein the classification loss is based on the first classification prediction value and the second classification loss. If the classification prediction value is determined, the first classification prediction value is the classification prediction value proposed by the target region in the region proposal set generated by the classification layer of the first network, and the second classification prediction value is passed through the first network The classification prediction value of the target area proposal in the area proposal set generated by the classification layer of the second network.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the target network specifically trains the second network through the first network , And further perform training on the trained network through a third network, wherein the depth of the third network is greater than the depth of the first network.
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • an image detection device which includes:
  • the acquiring unit is configured to acquire a target network, wherein the target network is a network obtained by training a second network through multiple network iterations, the multiple networks are all classification networks, and the multiple networks include at least the first network.
  • Network and a third network the third network is used to train the intermediate network after the first network trains the second network to obtain the intermediate network, wherein the depth of the third network is greater than that of the first network
  • the depth of a network, the depth of the first network is greater than the depth of the second network;
  • the recognition unit is used for recognizing the content in the image through the target network.
  • the third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the parameters used when the first network trains the second network include feature loss, where the feature loss is based on the first local Feature and a second local feature, the first local feature is a feature about the target object extracted from the first feature information through a Gaussian mask, and the second local feature is a feature from the second feature information through a Gaussian mask.
  • the first feature information is the feature information of the target image extracted by the feature extraction layer of the first network
  • the second feature information is the feature information of the target image extracted by the second The feature information in the target image extracted by the feature extraction layer of the network.
  • the Gaussian mask is used to highlight the local features of the target object in the feature information extracted by the first network, and highlight the local features of the target object in the feature information extracted by the second network, and then according to the two networks
  • the feature loss is determined with respect to the local features of the target object, and the second network is subsequently trained based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the parameters used when the first network trains the second network include classification loss, where all The classification loss is determined according to the first classification prediction value and the second classification prediction value, and the first classification prediction value is the classification prediction value proposed by the target region in the region proposal set generated by the classification layer of the first network
  • the second classification prediction value is a classification prediction value proposed by the target region in the region proposal set generated by the classification layer of the second network.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the first The network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is The second network is shared with the first network, or the first network is shared with the second network.
  • the target area The proposal is a proposal for all the regions in the set of region proposals, or a proposal for a normal region belonging to the target object in the set of region proposals.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • training the first The parameters used by the second network also include the regression loss and RPN loss of the second network, where the regression loss and RPN loss of the second network are the truth labels proposed based on the region in the target image and the The second network determines the predicted value of the proposed prediction for the region in the target image.
  • the acquiring unit Specifically used for:
  • an embodiment of the present application provides a model training device.
  • the model training device includes a memory and a processor, the memory is used to store a computer program, and the processor is used to call the computer program to implement the first aspect, or The second aspect, or any one of the possible implementation manners of the first aspect, or the method described in any one of the possible implementation manners of the second aspect.
  • an embodiment of the present application provides a model using device, the model using device includes a memory and a processor, the memory is used to store a computer program, and the processor is used to call the computer program to implement the third aspect, or the first
  • the fourth aspect, or any possible implementation manner of the third aspect, or the method described in any possible implementation manner of the fourth aspect is used to implement the third aspect, or the first.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, and when the computer program runs on a processor, the above-mentioned first aspect is implemented, or The method described in the second aspect, or the third aspect, or the fourth aspect, or any possible implementation manner of any one of them.
  • FIG. 1 is a schematic diagram of an image detection scene provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of a model distillation scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of another model distillation scenario provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of yet another model distillation scenario provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of yet another model distillation scenario provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a model training architecture provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of another model distillation scenario provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of another image detection scene provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another image detection scene provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a model training method provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the principle of a Gaussian mask provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of yet another model distillation scenario provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a model training device provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of another model training device provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of an image detection device provided by an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of another image detection device provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a model training device provided by an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of yet another model training device provided by an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of a model using device provided by an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of another model using device provided by an embodiment of the present application.
  • the feature information of the image is extracted through the feature extraction layer of the large neural network, and then sent to the classification layer of the large neural network to generate classification soft labels; and the feature information of the image is extracted through the feature extraction layer of the small neural network , And then sent to the classification layer of the small neural network to generate classification soft labels; then the classification loss is determined by the classification soft labels generated by the two networks; then the classification loss is used to guide the training of the classification layer of the small neural network.
  • the guidance of the large neural network to the small neural network is not comprehensive enough (such as ignoring the guidance of the feature extraction layer), and it is not refined enough, so the effect after the guidance is not ideal.
  • the information of the feature extraction layer of the large neural network is transferred to the small neural network, and then the small neural network with the information of the feature extraction layer of the large neural network is compressed to obtain a new thinner .
  • W T in Figure 4 is the model weight of the large neural network (teacher network), different layers can have different weights, for example, Etc.
  • W S in Figure 4 is the model weight of the small neural network (student network), and different layers can have different weights, for example, Wait.
  • a new small neural network needs to be constructed, and the number of layers of the constructed new small neural network becomes more, which has a certain degree of complexity.
  • the features extracted through the feature extraction layer are the features in the entire image, and there is a lot of background noise, which leads to irrational detection results.
  • the area near the target area in the image is selected as the distillation area, and the small neural network is allowed to learn the feature extraction layer expression of the large neural network in the distillation area.
  • the expression of the learned feature extraction layer is still not ideal; and the small neural network only learns the expression of the feature extraction layer, so the performance improvement is limited.
  • the following embodiments of the present application further provide related architectures, equipment, and methods to further improve the effect of the large neural network guiding small neural network training.
  • Figure 6 is a schematic diagram of a model training architecture provided by an embodiment of the present application.
  • the architecture includes a model training device 601 and one or more model use devices 602.
  • the model training device 601 and the model use device 602 The communication is carried out in a wired or wireless manner, so the model training device 601 can send the trained model (or network) for predicting the target object in the image to the model using device 602; accordingly, the model using device 602 Predict the target object in the image to be predicted through the received model.
  • the model using device 602 may feed back the result predicted based on the model to the model training device 601, so that the model training device 601 can further train the model based on the predicted result of the model using device 602; retrain The good model can be sent to the model using device 602 to update the original model.
  • the model training device 601 may be a device with strong computing capabilities, for example, a server, or a server cluster composed of multiple servers.
  • the model training device 601 can include many neural networks.
  • a neural network with more layers can be called a large neural network compared to a neural network with fewer layers.
  • a neural network with fewer layers is compared with a neural network with more layers. It can be called a small neural network, that is, the depth of the first network is greater than the depth of the second network.
  • the model training device 601 includes a first network 701 and a second network 702.
  • the first network 701 may be a neural network larger than the second network 702.
  • Both the first network 701 and the second network 702 It includes a feature extraction layer (also called a feature layer) and a classification layer (also called a classifier, or a classification head).
  • the feature extraction layer is used to extract feature information in the image
  • the classification layer is used to extract feature information based on the extracted features. Classify the target objects in the image.
  • the first network 701 can be used as a teacher network
  • the second network 702 can be used as a student network
  • the first network 701 guides the training of the second network 702. This process can be referred to as distillation.
  • the idea of guiding the second network 702 by the first network 701 includes the following three technical points:
  • the first network 701 and the second network 702 select the same region proposal set, for example, by sharing a region proposal network (Region proposal network, RPN), so that both the first network and the second network have a region proposal set. Therefore, the first network 701 and the second network 702 can generate soft labels based on the same area proposal, and then obtain the binary cross entropy loss (BCEloss) based on the soft labels generated by the first network 701 and the soft labels generated by the second network 702, and then The training of the classification layer of the second network 702 is guided by the binary cross entropy loss (BCEloss).
  • BCEloss binary cross entropy loss
  • first network 701 is a 101-layer (res101) neural network
  • second network 702 is a 50-layer (res50) neural network
  • the technical points based on the first and/or second above pass the first
  • the target neural network can be marked as res101-50
  • the target neural network is further trained through the third neural network
  • the target neural network is trained through the third neural network.
  • the principle of network training is the same as the principle of training the second network 702 through the first network 701, and will not be repeated here.
  • the third neural network here is a neural network larger than the first network 701, for example, the third neural network is a 152-layer (res152) neural network.
  • the model using device 602 is a device that needs to recognize (or detect) images, such as handheld devices (e.g., mobile phones, tablets, palmtop computers, etc.), vehicle-mounted devices (e.g., cars, bicycles, electric cars, airplanes, etc.). Ships, etc.), wearable devices (such as smart watches (such as iWatch, etc.), smart bracelets, pedometers, etc.), smart home equipment (such as refrigerators, TVs, air conditioners, electric meters, etc.), smart robots, workshop equipment, etc. Wait.
  • handheld devices e.g., mobile phones, tablets, palmtop computers, etc.
  • vehicle-mounted devices e.g., cars, bicycles, electric cars, airplanes, etc.
  • Ships, etc. e.g., etc.
  • wearable devices such as smart watches (such as iWatch, etc.), smart bracelets, pedometers, etc.), smart home equipment (such as refrigerators, TVs, air conditioners, electric meters, etc.), smart robots, workshop
  • model using equipment 602 is a car and a mobile phone as examples for illustration.
  • Autonomous driving or computer driving of automobiles is a very popular topic at present.
  • the number of cars in the world continues to increase, road congestion and driving accidents have caused great losses to people's property and social property.
  • Human factors are the main factors that cause traffic accidents. How to reduce human errors, intelligent obstacle avoidance and reasonable planning are important topics to improve driving safety.
  • the emergence of autonomous driving has made all of this possible. It can perceive the surrounding environment and navigate without human operation.
  • major companies around the world have begun to pay attention to and develop autonomous driving systems, such as Google, Tesla, and Baidu.
  • Autonomous driving technology has become a strategic commanding height for countries to compete for.
  • the visual perception system on the car acts as the human eye (ie, computer vision).
  • the first network, the second network, the third network, the fourth network, etc. The detection network can also be called a classification network, or a detection model, or a detection module, etc.) automatically detect the images collected by the camera to determine the surrounding area of the car The object and location (that is, the target object in the image is detected).
  • a car recognizes objects in an image through the detection network. If someone is found in the image and is close to the car, the car can be controlled to slow down or stop to avoid casualties; if there are other cars in the image, then you can Properly control the speed of the car to avoid rear-end collision; if you find that there is an object in the image that is hitting the car quickly, you can control the car to avoid it by shifting or changing lanes.
  • the car recognizes the objects in the image through the detection network. If there are traffic routes on the road (such as double yellow lines, single yellow lines, lane dividing lines, etc.), then the driving state of the car can be prejudged, if the prejudge is found The car may press the line, then the car can be controlled accordingly to avoid the line; of course, it is also possible to decide how to change lanes based on this information in the case of identifying the lane dividing line and its position, and the control of the remaining traffic lines is analogized by analogy .
  • traffic routes on the road such as double yellow lines, single yellow lines, lane dividing lines, etc.
  • the car recognizes objects in the image through the detection network, and then uses the recognized object as the target to measure the car's driving speed, acceleration, turning angle and other information.
  • the mobile phone was originally used as a communication tool to facilitate people's communication. With the development of the global economy and the improvement of people's quality of life, everyone's pursuit of mobile phone experience and performance is also getting higher and higher. In addition to entertainment, navigation, shopping, and photography, detection and recognition functions have also received a lot of attention. At present, the recognition technology of detecting the target object in the image has been applied in many mobile apps, including Meitu Xiuxiu, Moman Camera, Shenpai, Camera360, Alipay and so on. Developers only need to call the authorized mobile SDK package for face detection, face key point detection, and face analysis to automatically identify the face identity in photos and videos (that is, detect the target object in the image).
  • the mobile phone locates and recognizes the objects in the picture through the detection network (for example, the target network mentioned in the subsequent method embodiments, the detection network may also be called a detection model, or a detection module, or a detection and recognition system, etc.) to find people Face (the position of the face will also be obtained), and at the same time, it is judged whether the face in the image is the face of a specific person (that is, the computer perspective). If so, you can proceed to the next step, such as mobile payment, face login, etc.
  • the detection network for example, the target network mentioned in the subsequent method embodiments, the detection network may also be called a detection model, or a detection module, or a detection and recognition system, etc.
  • Devices such as cars, mobile phones, etc. perceive the surrounding environmental information through the computer's perspective, and then perform corresponding intelligent control based on the environmental information.
  • the intelligent control based on the computer's perspective is an implementation of artificial intelligence.
  • FIG. 10 is a schematic flowchart of a model training method provided by an embodiment of the present application. The method can be implemented based on the model training system shown in FIG. 6. The method includes but is not limited to the following steps:
  • Step S1001 The model training device extracts the feature information in the target image through the feature extraction layer of the first network and the feature extraction layer of the second network respectively.
  • the first network is used as a teacher network
  • the second network is used as a student network.
  • the feature extraction layer of the first network and the second network The feature extraction layer of the network extracts feature information for the same image.
  • the same image can be called the target image.
  • the layer of the first network is larger than the layer of the second network.
  • the first network may be a 101-layer (res101) neural network
  • the second network may be a 50-layer (res50) neural network.
  • the feature information in the embodiments of the present application can be represented by vectors or other machine-recognizable ways.
  • Step S1002 The model training device highlights the feature of the target object in the first feature information through a Gaussian mask.
  • the inventor of the present application found that in the process of target object detection (ie recognition) based on neural networks, the gain in detection performance is largely derived from the feature information extracted by the feature extraction layer (backbone layer), so the simulation of the feature extraction layer It is an important part of model training.
  • the introduction of the Gaussian mask is actually a process of highlighting the features of the target object in the target image and suppressing the features of the background outside the target object; it is also a process of highlighting the response to the target object and weakening the edge information.
  • Gaussian masks can not only suppress the features of the background outside the rectangular frame where the target object is located (generally the smallest rectangular frame that can frame the target object), but also suppress the features in the rectangular frame other than the target object.
  • the characteristics of the background thus highlighting the characteristics of the target object to the maximum extent.
  • the Gaussian mask can highlight the characters in the target image (that is, the target object), and weaken the features of the background other than the characters.
  • the model data of the Gaussian mask is the feature information corresponding to the target object in the coordinates. A large protrusion is formed, and the height of the background information in the coordinates is close to zero, which is approximately on a plane with a height of zero.
  • (x, y) is the coordinates of the pixel in the target image
  • B is a positive region proposal of the target object in the target image
  • the geometric specification of the positive region proposal B is w ⁇ h
  • the coordinates of the center point of the positive example area proposal B are (x 0 , y 0 )
  • this Gaussian mask is only valid for the target truth box, and all the background outside the box is filtered out.
  • Gaussian mask value M(x,y) of the pixel (x,y) can be expressed by formula 1-2:
  • N p is the number of positive region proposals for the target object in the target image
  • M 1 (x, y) is the Gaussian mask of the midpoint (x, y) of the first positive region proposal.
  • the value of the film, M 2 (x, y) is the value of the Gaussian mask at the midpoint (x, y) of the second positive region proposal, Is the value of the Gaussian mask at the midpoint (x, y) of the N p positive region proposal, and the rest can be deduced by analogy. It can be seen from formula 1-2 that the value of the Gaussian mask of a certain pixel (x, y) in the target object takes the maximum value among multiple values.
  • the feature information in the target image extracted through the feature extraction layer of the first network is called the first feature information
  • the feature about the target object in the first feature information highlighted by the Gaussian mask is called the first feature information. Local characteristics.
  • Step S1003 The model training device highlights the feature of the target object in the second feature information through a Gaussian mask.
  • the feature information in the target image extracted through the feature extraction layer of the second network is called second feature information
  • the feature about the target object in the second feature information highlighted by the Gaussian mask is called second feature information. Local characteristics.
  • Step S1004 The model training device determines a feature loss according to the first local feature and the second local feature.
  • the above-mentioned first local feature is the feature of the target object in the target image obtained by the first network
  • the second local feature is the feature of the target object in the target image obtained by the second network.
  • the first local feature is the same as
  • the difference between the second local features can reflect the difference between the feature extraction layer of the first network and the feature extraction layer of the second network.
  • the feature loss also called distillation loss in the embodiments of the present application can reflect the difference between the second local feature and the first local feature.
  • the first local feature can be expressed as The second local feature can be expressed as
  • the feature loss L b can be calculated by formula 1-3:
  • A is introduced to achieve the normalization operation;
  • the specification of the target image is WxH
  • Mij represents the value of the Gaussian mask of a pixel in the target image
  • i can take values from 1 to W accordingly, and j can follow this Values from 1 to H
  • the Gaussian mask value Mij of any pixel (i, j) in the target image can be calculated by the above formula 1-1 and formula 1-2, and will not be repeated here.
  • Represents the feature of the pixel (i, j) extracted by the second network Represents the feature of the pixel (i, j) extracted by the first network
  • C represents the number of channels of the feature map when the first and second networks extract feature information in the target image.
  • Step S1005 The model training device generates the proposed classification prediction value of the target area in the area proposal set through the classification layer of the first network.
  • the first network and the second network share the RPN so that both the first network and the second network have the area Proposal collection.
  • the shared RPN may include 2000 area proposals, and the set of area proposals includes 512 of the 2000 area proposals.
  • the same detectors can be configured in the first network and the second network, so that the first The network and the second network can extract the same 512 area proposals from the 2000 shared area proposals, that is, the area proposal set.
  • FIG. 12 illustrates a schematic flow diagram of the first network and the second network to select the same area proposal by sharing the RPN and configuring the same detector, and all the selected area proposals are collectively referred to as the area proposal set.
  • the RPN is the RPN shared by the first network and the second network, and may be shared by the second network to the first network, or shared by the first network to the second network, or shared in other ways .
  • the target area proposal is all the area proposals in the area proposal set, that is, the positive example area proposal of the target object and the negative example area proposal of the target object are included.
  • which are positive regional proposals and which are negative regional proposals in the regional proposal set can be pre-marked by humans or automatically marked by a machine; the usual division criterion is, if which regional proposal is related to the target If the coincidence degree of the rectangular frame where the object is located (usually the smallest rectangular frame of the target object) exceeds the set reference threshold (for example, it can be set to 50%, or other values), the area proposal can be classified as The positive example area proposal of the target object, otherwise the area proposal is classified as the negative example area proposal of the target object.
  • the target area proposal is a normal area proposal belonging to the target object in the area proposal set.
  • the classification prediction value of the target region proposal in the region proposal set generated by the classification layer of the first network is the first classification prediction value to facilitate subsequent description.
  • Step S1006 The model training device generates the proposed classification prediction value of the target region in the region proposal set through the classification layer of the second network.
  • the classification prediction value of the target region proposal in the region proposal set generated by the classification layer of the second network is the second classification prediction value to facilitate subsequent description.
  • the classification prediction of the region proposal 1 generated by the classification label of the first network
  • the value represents the probability that the object in the region proposal 1 is classified as a person is 0.8, the probability of being classified as a tree is 0.3, and the probability of being classified as a car is 0.1.
  • the classification prediction values obtained by the classification layers in different networks for the same region proposal classification may be different, because the model parameters of different networks are generally different, and their predictive capabilities are generally different.
  • Step S1007 The model training device determines the classification loss according to the first classification prediction value and the second classification prediction value.
  • the embodiment of the application selects the same set of regional proposals, so that the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same regional proposal, and the same regional proposal is based on the same method.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the first predicted value.
  • the classification loss of the second network actually uses the first predicted value as a soft label to determine the classification loss of the classification layer of the second network; in this way, the loss of the second network relative to the first network can be maximized, so the model The training effect is better.
  • the classification loss satisfies the relationship shown in formula 1-4:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of positive area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • Step S1008 The model training device trains the second network according to the feature loss and the classification loss to obtain a target network.
  • the meaning of training the second network according to the feature loss and the classification loss mentioned in the embodiment of this application to obtain the target network is: the target network is obtained by training the second network, and is used in the training process
  • the parameters include but are not limited to the feature loss and the classification loss, that is, other parameters other than these two may be used; and the training process may only use the feature loss and the classification loss obtained based on the first network (ie Distilling the second network based on the first network to obtain the target network), it may also use not only the feature loss and the classification loss obtained based on the first network, but also the information obtained based on other networks (one or more) (that is, based on the first network). One network and other networks distill the second network to obtain the target network).
  • the total loss L may be determined based on the feature loss L b and the classification loss L cls , and then the total loss Train this second network.
  • a part of the model parameters in the second network (for example, the model parameters of the feature extraction layer) can be trained through feature loss, and another part of the model parameters in the second network (for example, the model parameters of the classification layer) can be trained through the classification loss.
  • is a preset or pre-trained weight balance factor.
  • Case 2 Determine the regression loss and RPN loss of the second network according to the proposed truth label of the region in the target image and the predicted value predicted by the second network for the region in the target image; that is, The second network does not rely on the first network to train to obtain the regression loss L reg and the RPN loss L rpn , and then combine the regression loss L reg , the RPN loss L rpn , the above-mentioned feature loss L b and the classification loss L cls to obtain the total loss L, optional, the calculation method of the total loss L is shown in formula 1-6.
  • the order of determining the classification loss and the feature loss is not limited, and can be performed at the same time, or the process of determining the classification loss can be executed first, and the process of determining the feature loss can also be executed first.
  • the model training device trains the second network according to the feature loss and the classification loss, and the result is not the final target network, but an intermediate network, which will be passed through other networks ( For example, the third network) trains the intermediate network.
  • This process can be regarded as the use of progressive distillation for the second network.
  • the principle is as follows: After distilling the second network based on the above-mentioned first network (that is, obtaining the feature loss and classification loss according to the first network and the second network, and then training the second network according to the feature loss and classification loss) to obtain the intermediate network , You can further distill the intermediate network through a third network with more layers than the first network.
  • the principle of distilling the intermediate network through the third network is the same as that of distilling the second network through the first network.
  • the principle is the same; in the follow-up, networks with more layers can be gradually used to further distill the newly trained network until the distillation of the second network reaches the expected goal, thereby obtaining the target network.
  • first network 701 is a 101-layer (res101) neural network
  • second network 702 is a 50-layer (res50) neural network
  • the technical points based on the first and/or second above pass the first
  • a network 701 distills the second network 702 to obtain an intermediate neural network (which can be marked as res101-50)
  • the intermediate neural network (res101-50) is further distilled through the third neural network (that is, through the first network 701 and
  • the third network distills the second network 702) the principle of distilling the intermediate neural network through the third neural network is the same as the principle of distilling the second network 702 through the first network 701, and will not be repeated here.
  • the third neural network is a neural network larger than the first network 701, for example, a 152-layer (res152) neural network.
  • one, or two, or three of the three technical points exemplified above may be used.
  • the technology of progressive distillation (the first of the above three technology points) can be used, but there are no special restrictions on how to extract features and how to use the region proposal (that is, whether to use the above three technology points).
  • the 2nd and 3rd are not limited).
  • Step S1009 The model training device sends the target network to the model using device.
  • Step S1010 The model using device receives the target network sent by the model training device.
  • the target network is used to predict (or detect, or estimate) the content in the image (that is, to identify the target in the image), for example, to identify whether there is any in the image Face, what is the specific position of the face in the image when it exists; or identify whether there are road obstacles in the image, what is the position of the obstacle in the image when it exists, and so on.
  • the model usage device 602 refers to the introduction of the model usage device 602 in the architecture shown in FIG. 6.
  • the inventor of the present application has performed verification on two standard detection data sets, the two detection data sets are: COCO2017 data set and BDD100k data set.
  • the COCO data set contains 80 object categories, 110,000 training pictures and 5,000 verification pictures
  • the BDD100k data set contains 10 categories, with a total of 100,000 pictures.
  • coco's evaluation criteria are used for evaluation, that is, the category average accuracy (mAP).
  • Table 1 shows different distillation strategies.
  • the networks (or models) with layers res18, res50, res101, and res152 have been pre-trained on the COCO data set, and you only need to use the above embodiment for distillation That's it.
  • res50-18 represents the network obtained by distilling the 18-layer network through the 50-layer network
  • res101-18 represents the network obtained by distilling the 18-layer network through the 101-layer network
  • res101-50 -18 represents the network obtained by further distilling the network res101-18 obtained after distillation through the 101-layer network
  • res152-101-50-18 represents the further distillation of the network obtained after distillation through the 152-layer network res101-50-18 The resulting network.
  • the network res50 can be regarded as the aforementioned first network
  • the network res18 can be regarded as the aforementioned second network
  • the network 101 can be regarded as the aforementioned third network
  • the network res152 can be regarded as the fourth network, which is a ratio of The third network has more layers.
  • Table 2 shows the evaluation results of different networks on the COCO dataset.
  • the accuracy of the network Res50-18 detection has been significantly improved compared to the original network res18, which is an increase of 2.8%, and the accuracy of the network res101-18 detection
  • the accuracy of the network res18 is improved by 3.2 points.
  • the network res101-50-18 obtained by the progressive distillation method is further improved than the network re50-18 obtained by a single distillation.
  • the network res152 Compared with the network res18, the detection accuracy of -101-50-18 has been improved by 4.4%, and the distilled mAP reached 0.366, which has surpassed the network res50 detection accuracy of 0.364.
  • the method of the embodiment of the present application gradually distills the network res18 to make the distilled network res18 Exceed the performance of network res50.
  • MAP is the mean average precision
  • AP50 is the mean precision when the detection evaluation function (Intersection over Union, IOU) is greater than 0.5
  • AP75 is the mean precision when the IOU is greater than 0.75
  • Aps is the mean precision of small objects
  • Apm is The mean precision of medium objects
  • Apl is the mean precision of large objects.
  • the network res50 and the network res101 are used as the teacher network, and the network res18 is used as the student network.
  • +1 means that only the first technical point above is used (that is, the Gaussian mask is used to highlight the target object), and +2 means the second technical point above is used ( That is, select the same set of regional proposals), and +3 means to use the third technical point above (ie, progressive distillation).
  • the student network res18 in the network res18(+1) and network res18(+1+2) is not pre-trained on the COCO data set, and the student network res18 of res18(+1+2+3) is on COCO
  • Pre-training is conducive to narrowing the distance between it and the teacher network, which is equivalent to a scheme of progressive distillation. It can be seen that with continuous improvement, the effect of distillation is gradually improving, which also proves the effectiveness of the above three technical points.
  • the original network res18 (as a student network) and the network res50 (as a teacher network) have a mAP (accuracy) gap of 2.1 percentage points.
  • the detection mAP (accuracy) of the original network res18 has increased by 1.5%, which is only 0.6% behind the teacher network res50, which makes up for the mAP (accuracy) gap of nearly 75%, and the effect is very obvious.
  • the Gaussian mask is used to highlight the local features of the target object in the feature information extracted by the first network, and the local features of the target object in the feature information extracted by the second network, and then according to the two networks
  • the feature loss is determined on the local features of the target object in, and the second network is subsequently trained based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal.
  • the region proposals based on the same the two The difference in the predicted value generated by the network is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the classification loss for training the second network based on the difference between the first predicted value and the second predicted value; In this way, the loss of the second network relative to the first network can be maximized. Therefore, training the second model based on the classification loss can make the classification result of the second network closer to the first network, and the model distillation effect is very good. .
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • FIG. 13 is a schematic structural diagram of a model training device 130 provided by an embodiment of the present application.
  • the model training device 130 may be the model training device in the foregoing method embodiment or a device in the model training device.
  • the model training device 130 may include a feature extraction unit 1301, a first optimization unit 1302, a second optimization unit 1303, a first determination unit 1304, and a weight adjustment unit 1305.
  • the detailed description of each unit is as follows.
  • the feature extraction unit 1301 is configured to extract the first feature information in the target image through the feature extraction layer of the first network;
  • the feature extraction unit 1301 is further configured to extract second feature information in the target image through the feature extraction layer of the second network, wherein the first network and the second network are both classification networks, and the first network The depth of a network is greater than the depth of the second network;
  • the first optimization unit 1302 is configured to extract the feature of the target object in the first feature information through a Gaussian mask to obtain the first local feature;
  • the second optimization unit 1303 is configured to extract the feature of the target object in the second feature information through a Gaussian mask to obtain a second local feature;
  • the first determining unit 1304 is configured to determine the feature loss according to the first local feature and the second local feature;
  • the weight adjustment unit 1305 is configured to train the second network according to the feature loss to obtain a target network.
  • the local features of the target object in the feature information extracted by the first network are highlighted through the Gaussian mask, and the local features of the target object in the feature information extracted by the second network are highlighted, and then based on the target object in the two networks Determine the feature loss based on the local features of, and then train the second network based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the device further includes:
  • a first generating unit configured to generate the first classification prediction value proposed by the target area in the area proposal set through the classification layer of the first network
  • a second generation unit configured to generate a second classification prediction value proposed by the target area in the area proposal set through the classification layer of the second network
  • a second determining unit configured to determine a classification loss according to the first classification prediction value and the second classification prediction value
  • the weight adjustment unit is specifically configured to train the second network according to the feature loss and the classification loss to obtain a target network.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the second network is trained according to the feature loss to obtain the target network
  • the weight adjustment unit is specifically configured to:
  • the second network after training is trained by a third network to obtain a target network, wherein the depth of the third network is greater than the depth of the first network.
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the first network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is shared by the second network to the first network, or shared by the first network to the second network.
  • the target area proposal is all area proposals in the set of area proposals, or A positive example area proposal belonging to the target object in the area proposal set.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the device further includes:
  • the third determining unit is configured to determine the regression loss sum of the second network according to the truth label proposed by the region in the target image and the predicted value predicted by the second network for the region in the target image.
  • RPN loss
  • the weight adjustment unit is specifically configured to train the second network according to the feature loss, the classification loss, the regression loss, and the RPN loss to obtain a target network.
  • the device further includes:
  • the sending unit is configured to send the target network to the model using device after the weight adjustment unit trains the second network according to the feature loss to obtain the target network, where the target network is used to predict the content in the image .
  • FIG. 14 is a schematic structural diagram of a model training device 140 provided by an embodiment of the present application.
  • the model training device 140 may be the model training device in the foregoing method embodiment or a device in the model training device.
  • the model training device 140 may include a first training unit 1401 and a second training unit 1402, wherein the detailed description of each unit is as follows.
  • the first training unit 1401 is configured to train the second network based on the first network to obtain the intermediate network;
  • the second training unit 1402 is configured to train the intermediate network based on a third network to obtain a target network, wherein the first network, the second network, and the third network are all classification networks, and the first network
  • the depth of the three networks is greater than the depth of the first network, and the depth of the first network is greater than the depth of the second network.
  • the third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the training of the second network based on the first network to obtain the intermediate network includes:
  • the Gaussian mask is used to highlight the local features of the target object in the feature information extracted by the first network, and highlight the local features of the target object in the feature information extracted by the second network, and then according to the two networks
  • the feature loss is determined with respect to the local features of the target object, and the second network is subsequently trained based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the device further includes:
  • the first generating unit is configured to generate the first classification prediction value of the target area proposal in the area proposal set through the classification layer of the first network;
  • a second generation unit configured to generate a second classification prediction value proposed by the target region in the region proposal set through the classification layer of the second network
  • a second determining unit configured to determine a classification loss according to the first classification prediction value and the second classification prediction value
  • the training of the second network according to the feature loss to obtain the intermediate network is specifically:
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the first network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is shared by the second network to the first network, or shared by the first network to the second network.
  • the target area proposal is all area proposals in the area proposal set, or is a normal area proposal belonging to the target object in the area proposal set.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the device further includes:
  • the third determining unit is configured to determine the regression loss sum of the second network according to the truth label proposed by the region in the target image and the predicted value predicted by the second network for the region in the target image.
  • RPN loss
  • the weight adjustment unit is specifically configured to train the second network according to the feature loss, the classification loss, the regression loss, and the RPN loss to obtain a target network.
  • the device further includes:
  • the sending unit is configured to send the target network to the model using device after the weight adjustment unit trains the second network according to the feature loss to obtain the target network, where the target network is used to predict the content in the image .
  • Figure 15 is a schematic structural diagram of an image detection device 150 provided by an embodiment of the present application.
  • the image detection device 150 may include an acquisition unit 1501 and an identification unit 1502, wherein the detailed description of each unit is as follows.
  • the obtaining unit 1501 is configured to obtain a target network, where the target network is a network obtained by training a second network through a first network, and the parameters used to train the second network through the first network include features Loss, the feature loss is determined based on a first local feature and a second local feature, the first local feature is a feature about the target object extracted from the first feature information through a Gaussian mask, the second local The feature is the feature about the target object extracted from the second feature information through a Gaussian mask, and the first feature information is the feature information in the target image extracted through the feature extraction layer of the first network, so
  • the second feature information is the feature information in the target image extracted by the feature extraction layer of the second network, the first network and the second network are both classification networks, and the first network Is greater than the depth of the second network;
  • the recognition unit 1502 is configured to recognize the content in the image through the target network.
  • the local features of the target object in the feature information extracted by the first network are highlighted through the Gaussian mask, and the local features of the target object in the feature information extracted by the second network are highlighted, and then based on the target object in the two networks Determine the feature loss based on the local features of, and then train the second network based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the parameters used to train the second network further include classification loss, where the classification loss is determined according to the first classification prediction value and the second classification prediction value, and the first The classification prediction value is the classification prediction value proposed by the target region in the region proposal set generated by the classification layer of the first network, and the second classification prediction value is the region generated by the classification layer of the second network The proposed classification prediction value of the target area in the proposal set.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the target network is specifically a network after the second network is trained through the first network, and the trained network is further trained through the third network, wherein the first network The depth of the three networks is greater than the depth of the first network.
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the first network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is shared by the second network to the first network, or shared by the first network to the second network.
  • the target area proposal is all area proposals in the area proposal set, or is a normal area proposal belonging to the target object in the area proposal set.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the parameters used for training the second network further include regression loss and RPN loss of the second network, wherein the regression loss and RPN loss of the second network are based on the The true value label of the region proposal in the target image and the predicted value of the prediction of the region proposal in the target image are determined by the second network.
  • the acquiring unit is specifically configured to:
  • FIG. 16 is a schematic structural diagram of an image detection device 160 provided by an embodiment of the present application.
  • the image detection device 160 may be the model-using device in the foregoing method embodiment or a device in the model-using device.
  • the image detection device 160 may include an acquisition unit 1601 and an identification unit 1602, wherein the detailed description of each unit is as follows.
  • the acquiring unit 1601 is configured to acquire a target network, where the target network is a network obtained by training a second network through multiple network iterations, the multiple networks are all classification networks, and the multiple networks include at least the first A network and a third network, the third network is used to train the intermediate network after the first network trains the second network to obtain the intermediate network, wherein the depth of the third network is greater than that of the The depth of the first network, the depth of the first network is greater than the depth of the second network;
  • the recognition unit 1602 is configured to recognize the content in the image through the target network.
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the parameters used when the first network trains the second network include feature loss, where the feature loss is determined according to the first local feature and the second local feature, so
  • the first local feature is a feature about the target object extracted from the first feature information through a Gaussian mask
  • the second local feature is a feature about the target object extracted from the second feature information through a Gaussian mask
  • the first feature information is the feature information in the target image extracted through the feature extraction layer of the first network
  • the second feature information is the feature information extracted through the feature extraction layer of the second network. Feature information in the target image.
  • the Gaussian mask is used to highlight the local features of the target object in the feature information extracted by the first network, and highlight the local features of the target object in the feature information extracted by the second network, and then according to the two networks
  • the feature loss is determined with respect to the local features of the target object, and the second network is subsequently trained based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the parameters used when the first network trains the second network include classification loss, where the classification loss is determined according to the first classification prediction value and the second classification prediction value.
  • the first classification prediction value is a classification prediction value proposed by a target region in a region proposal set generated by the classification layer of the first network
  • the second classification prediction value is a classification prediction value passed through the classification layer of the second network The generated classification prediction value of the target region proposal in the region proposal set.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the first network and the second network adopt a shared area proposal network (RPN)
  • RPN shared area proposal network
  • the RPN is shared by the second network to the first network, or shared by the first network to the second network.
  • the target area proposal is all area proposals in the area proposal set, or is a normal area proposal belonging to the target object in the area proposal set.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the parameters used for training the second network further include regression loss and RPN loss of the second network, wherein the regression loss and RPN loss of the second network are based on the The true value label of the region proposal in the target image and the predicted value of the prediction of the region proposal in the target image are determined by the second network.
  • the acquiring unit is specifically configured to:
  • FIG. 17 is a schematic structural diagram of a model training device 170 provided by an embodiment of the present application.
  • the model training device 170 includes a processor 1701, a memory 1702, and a communication interface 1703.
  • the processor 1701, the memory 1702, and The communication interfaces 1703 are connected to each other through a bus.
  • the memory 1702 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or Portable read-only memory (compact disc read-only memory, CD-ROM), the memory 1702 is used for related computer programs and data.
  • the communication interface 1703 is used to receive and send data.
  • the processor 1701 may be one or more central processing units (CPUs).
  • CPUs central processing units
  • the CPU may be a single-core CPU or a multi-core CPU.
  • the processor 1701 in the model training device 170 is configured to read the computer program code stored in the memory 1702, and perform the following operations:
  • the local features of the target object in the feature information extracted by the first network are highlighted through the Gaussian mask, and the local features of the target object in the feature information extracted by the second network are highlighted, and then based on the target object in the two networks Determine the feature loss based on the local features of, and then train the second network based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the processor is further configured to:
  • the training the second network according to the feature loss to obtain the target network includes:
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the processor is specifically configured to:
  • the second network after training is trained by a third network to obtain a target network, wherein the depth of the third network is greater than the depth of the first network.
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the first network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is shared by the second network to the first network, or shared by the first network to the second network.
  • the target area proposal is all area proposals in the area proposal set, or is a normal area proposal belonging to the target object in the area proposal set.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the processor is further configured to:
  • Training the second network according to the feature loss and the classification loss to obtain a target network includes:
  • the processor is further configured to:
  • the target network is sent to the model using device through the communication interface 1703, where the target network is used to predict the content in the image.
  • each operation may also correspond to the corresponding description of the method embodiment shown in FIG. 10.
  • FIG. 18 is a schematic structural diagram of a model training device 180 provided by an embodiment of the present application.
  • the model training device 180 includes a processor 1801, a memory 1802, and a communication interface 1803.
  • the processor 1801, the memory 1802, and The communication interfaces 1803 are connected to each other through a bus.
  • the memory 1802 includes but is not limited to random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or Portable read-only memory (compact disc read-only memory, CD-ROM), the memory 1802 is used for related computer programs and data.
  • the communication interface 1803 is used to receive and send data.
  • the processor 1801 may be one or more central processing units (CPUs).
  • CPUs central processing units
  • the CPU may be a single-core CPU or a multi-core CPU.
  • the processor 1801 in the model training device 180 is configured to read the computer program code stored in the memory 1802, and perform the following operations:
  • Training the second network based on the first network to obtain the intermediate network in each embodiment of this application, training the second network based on the first network is essentially distilling the second network through the first network, and training based on the third network has been trained by the first network
  • the essence of the second network is the second network that has been distilled by the first network through the third network, which is explained here.
  • the depth of a network, the depth of the first network is greater than the depth of the second network.
  • the third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the processor is specifically configured to:
  • the Gaussian mask is used to highlight the local features of the target object in the feature information extracted by the first network, and highlight the local features of the target object in the feature information extracted by the second network, and then according to the two networks
  • the feature loss is determined with respect to the local features of the target object, and the second network is subsequently trained based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the processor 1801 is further configured to:
  • the processor is specifically configured to:
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the first network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is shared by the second network to the first network, or shared by the first network to the second network.
  • the target area proposal is all area proposals in the area proposal set, or is a normal area proposal belonging to the target object in the area proposal set.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the method further includes:
  • Training the second network according to the feature loss and the classification loss to obtain a target network includes:
  • the method further includes:
  • the target network is sent to the model using device through the communication interface 1803, where the target network is used to predict the content in the image.
  • each operation may also correspond to the corresponding description of the method embodiment shown in FIG. 10.
  • FIG. 19 is a schematic structural diagram of a model using device 190 provided by an embodiment of the present application.
  • the model using device 190 may also be called an image detection device or other names.
  • the model using device 190 includes a processor 1901.
  • the processor 1901, the memory 1902 and the communication interface 1903 are connected to each other through a bus.
  • the memory 1902 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or Portable read-only memory (compact disc read-only memory, CD-ROM), the memory 1902 is used for related computer programs and data.
  • the communication interface 1903 is used to receive and send data.
  • the processor 1901 may be one or more central processing units (CPUs). When the processor 1901 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.
  • CPUs central processing units
  • This model uses the processor 1901 in the device 190 to read the computer program code stored in the memory 1902, and perform the following operations:
  • the target network is a network obtained by training a second network through the first network
  • the parameters used to train the second network through the first network include feature loss
  • the feature loss To be determined based on the first local feature and the second local feature
  • the first local feature is a feature about the target object extracted from the first feature information through a Gaussian mask
  • the second local feature is a feature about the target object extracted from the first feature information through a Gaussian mask.
  • the feature about the target object extracted from the second feature information where the first feature information is feature information in the target image extracted by the feature extraction layer of the first network, and the second feature information is The feature information in the target image extracted by the feature extraction layer of the second network, the first network and the second network are both classification networks, and the depth of the first network is greater than that of the first network 2.
  • the depth of the network
  • the local features of the target object in the feature information extracted by the first network are highlighted through the Gaussian mask, and the local features of the target object in the feature information extracted by the second network are highlighted, and then based on the target object in the two networks Determine the feature loss based on the local features of, and then train the second network based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the parameters used to train the second network further include classification loss, where the classification loss is determined according to the first classification prediction value and the second classification prediction value, and the first The classification prediction value is the classification prediction value proposed by the target region in the region proposal set generated by the classification layer of the first network, and the second classification prediction value is the region generated by the classification layer of the second network The proposed classification prediction value of the target area in the proposal set.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the target network is specifically a network after the second network is trained through the first network, and the trained network is further trained through the third network, wherein the first network The depth of the three networks is greater than the depth of the first network.
  • a third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the first network and the second network share the area proposal network (RPN) so that both the first network and the second network have the area proposal set.
  • RPN area proposal network
  • the RPN is shared by the second network to the first network, or shared by the first network to the second network.
  • the target area proposal is all area proposals in the area proposal set, or is a normal area proposal belonging to the target object in the area proposal set.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the parameters used for training the second network further include regression loss and RPN loss of the second network, wherein the regression loss and RPN loss of the second network are based on the The true value label of the region proposal in the target image and the predicted value of the prediction of the region proposal in the target image are determined by the second network.
  • the processor is specifically configured to:
  • the target network sent by the model training device is received through the communication interface 1903, where the model training device is used for training to obtain the target network.
  • each operation may also correspond to the corresponding description of the method embodiment shown in FIG. 10.
  • FIG. 20 is a schematic structural diagram of a model using device 200 provided by an embodiment of the present application.
  • the model using device 200 may also be called an image detection device or other names.
  • the model using device 200 includes a processor 2001, The memory 2002 and the communication interface 2003.
  • the processor 2001, the memory 2002 and the communication interface 2003 are connected to each other through a bus.
  • the memory 2002 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or Portable read-only memory (compact disc read-only memory, CD-ROM), the memory 2002 is used for related computer programs and data.
  • the communication interface 2003 is used to receive and send data.
  • the processor 2001 may be one or more central processing units (CPU).
  • CPU central processing units
  • the CPU may be a single-core CPU or a multi-core CPU.
  • the processor 2001 in the model use device 200 is used to read the computer program code stored in the memory 2002 and perform the following operations:
  • the target network is a network obtained by training a second network through multiple network iterations
  • the multiple networks are all classification networks
  • the multiple networks include at least a first network and a third network
  • the third network is used to train the intermediate network after the first network trains the second network to obtain the intermediate network, wherein the depth of the third network is greater than the depth of the first network, The depth of the first network is greater than the depth of the second network;
  • the third network with more layers is further used to further train the trained second network, which can stably improve the performance of the second network.
  • the parameters used when the first network trains the second network include feature loss, where the feature loss is determined according to the first local feature and the second local feature, so
  • the first local feature is a feature about the target object extracted from the first feature information through a Gaussian mask
  • the second local feature is a feature about the target object extracted from the second feature information through a Gaussian mask
  • the first feature information is the feature information in the target image extracted through the feature extraction layer of the first network
  • the second feature information is the feature information extracted through the feature extraction layer of the second network. Feature information in the target image.
  • the Gaussian mask is used to highlight the local features of the target object in the feature information extracted by the first network, and highlight the local features of the target object in the feature information extracted by the second network, and then according to the two networks
  • the feature loss is determined with respect to the local features of the target object, and the second network is subsequently trained based on the feature loss.
  • the background noise of the image (including the background noise outside the box of the target object and the background noise inside the box) of the image is filtered through the Gaussian mask, and the feature loss obtained on this basis can better reflect the second network and the first network Therefore, training the second network based on the feature loss can make the expression of the features of the second network closer to the first network, and the model distillation effect is very good.
  • the parameters used when the first network trains the second network include classification loss, where the classification loss is determined according to the first classification prediction value and the second classification prediction value.
  • the first classification prediction value is a classification prediction value proposed by a target region in a region proposal set generated by the classification layer of the first network
  • the second classification prediction value is a classification prediction value passed through the classification layer of the second network The generated classification prediction value of the target region proposal in the region proposal set.
  • the classification layer of the first network and the classification layer of the second network generate classification prediction values based on the same region proposal, and propose the same region based on the same region proposal.
  • the difference between the predicted values generated by the two networks is generally caused by the difference in the model parameters of the two networks. Therefore, the embodiment of the present application determines the difference between the first predicted value and the second predicted value for training the second The classification loss of the network; in this way, the loss of the second network relative to the first network can be maximized, so training the second model based on the classification loss can make the classification result of the second network closer to the first network, Model distillation works well.
  • the first network and the second network share the area proposal network RPN so that both the first network and the second network have the area proposal set.
  • the RPN is shared by the second network to the first network, or shared by the first network to the second network.
  • the target area proposal is all area proposals in the area proposal set, or is a normal area proposal belonging to the target object in the area proposal set.
  • the classification loss L cls satisfies the following relationship:
  • K is the total number of area proposals in the area proposal set
  • N p is the total number of normal area proposals belonging to the target object in the area proposal set
  • y m is the truth label corresponding to the m-th region proposal in the region proposal set
  • is the preset weight balance factor.
  • the parameters used for training the second network further include regression loss and RPN loss of the second network, wherein the regression loss and RPN loss of the second network are based on the The true value label of the region proposal in the target image and the predicted value of the prediction of the region proposal in the target image are determined by the second network.
  • the acquiring the target network includes:
  • each operation may also correspond to the corresponding description of the method embodiment shown in FIG. 10.
  • An embodiment of the present application also provides a chip system.
  • the chip system includes at least one processor, a memory, and an interface circuit.
  • the memory, the transceiver, and the at least one processor are interconnected by wires, and the at least one memory
  • a computer program is stored therein; when the computer program is executed by the processor, the method flow shown in FIG. 10 is realized.
  • the embodiment of the present application also provides a computer-readable storage medium in which a computer program is stored.
  • a computer program is stored.
  • the method flow shown in FIG. 10 is realized.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product runs on a processor, the method flow shown in FIG. 10 is realized.
  • the computer program can be stored in a computer readable storage medium.
  • the computer program During execution, it may include the procedures of the foregoing method embodiments.
  • the aforementioned storage media include: ROM or random storage RAM, magnetic disks or optical disks and other media that can store computer program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

一种模型训练方法及相关装置,可用于人工智能、计算机视觉等领域,用来进行图像检测,该方法包括:分别通过第一网络的特征提取层和第二网络的特征提取层提取目标图像中的特征信息;并分别通过高斯掩膜进一步提取特征信息中关于目标物体的特征,得到第一局部特征和第二局部特征;再通过第一局部特征和第二局部特征确定特征损失;并且通过第一网络和第二网络基于同样的区域提议集合进行预测得到第一分类预测值和第二分类预测值,再根据第一分类预测值和第二分类预测值得到分类损失;之后根据分类损失和特征损失对第二网络训练,得到目标网络。采用上述方法能够使得用于检测图像的目标网络的预测速度更快、预测准确度更高。

Description

一种模型训练方法及相关设备
本申请要求于2020年05月15日提交中国专利局、申请号为202010412910.6、申请名称为“一种模型训练方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理领域,尤其涉及一种模型训练方法及相关设备。
背景技术
[根据细则91更正 27.04.2021] 
目标检测指的是在图像中对目标物体进行分类和定位,如图1所示,通过目标检测可以对图像中的伞101进行分类和定位,也可以对图像中的人102进行分类和定位。对图像进行目标检测的应用非常广泛,如自动驾驶、平安城市、以及手机终端等,因此对检测准确度和速度都有很高的要求。对图像进行目标检测通常是通过神经网络来实现的,然而,大神经网络检测的准确度虽高,速度却很慢;小神经网络检测的速度快,准确度很低。
如何训练出一个检测速度更快、检测结果更准确的神经网络是本领域的技术人员正在研究的技术问题。
发明内容
本申请实施例公开了一种模型训练方法及相关设备,可以应用于人工智能、计算机视觉等领域中,用来进行图像检测,该方法及相关设备能够提高网络的预测效率和准确度。
第一方面,本申请实施例提供一种模型训练方法,该方法包括:
通过第一网络的特征提取层提取目标图像中的第一特征信息;
通过第二网络的特征提取层提取目标图像中的第二特征信息,其中,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
通过所述第一局部特征和所述第二局部特征确定特征损失;
根据所述特征损失训练所述第二网络,得到目标网络。
在上述方法中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
结合第一方面,在第一方面的第一种可能的实现方式中,所述方法还包括:
通过所述第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
通过所述第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
根据所述第一分类预测值和所述第二分类预测值确定分类损失;
所述根据所述特征损失训练所述第二网络,得到目标网络,包括:
根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
结合第一方面,或者第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述根据所述特征损失训练所述第二网络,得到目标网络,包括:
根据所述特征损失训练所述第二网络;
通过第三网络对经过训练后的所述第二网络进行训练,得到目标网络,其中,所述第三网络的深度大于所述第一网络的深度。
在该可能的实现方式中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
第二方面,本申请实施例提供一种模型训练方法,该方法包括:
基于第一网络训练第二网络得到中间网络;
基于第三网络训练所述中间网络,得到目标网络,其中,所述第一网络、所述第二网络和所述第三网络均为分类网络,且所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度。
在上述方法中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
结合第二方面,在第二方面的第一种可能的实现方式中,所述基于第一网络训练第二网络包括:
通过第一网络的特征提取层提取目标图像中的第一特征信息;
通过第二网络的特征提取层提取目标图像中的第二特征信息;
通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
通过所述第一局部特征和所述第二局部特征确定特征损失;
根据所述特征损失训练所述第二网络,得到所述中间网络。
在该可能的实现方式中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。 通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述方法还包括:
通过第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
通过第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
根据所述第一分类预测值和所述第二分类预测值确定分类损失;
所述根据所述特征损失训练所述第二网络,得到所述中间网络,包括:
根据所述特征损失和所述分类损失训练所述第二网络,得到所述中间网络。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
结合第一方面,或者第二方面,或者第一方面的任一种可能的实现方式,或者第二方面的任一种可能的实现方式,在又一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
结合第一方面,或者第二方面,或者第一方面的任一种可能的实现方式,或者第二方面的任一种可能的实现方式,在又一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
结合第一方面,或者第二方面,或者第一方面的任一种可能的实现方式,或者第二方面的任一种可能的实现方式,在又一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
结合第一方面,或者第二方面,或者第一方面的任一种可能的实现方式,或者第二方面的任一种可能的实现方式,在又一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000001
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述 目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000002
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000003
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000004
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000005
表示基于
Figure PCTCN2021088787-appb-000006
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000007
表示基于
Figure PCTCN2021088787-appb-000008
Figure PCTCN2021088787-appb-000009
得到的二值交叉熵损失,β为预设的权重平衡因子。
结合第一方面,或者第二方面,或者第一方面的任一种可能的实现方式,或者第二方面的任一种可能的实现方式,在又一种可能的实现方式中,所述方法还包括:
根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值,确定所述第二网络的回归损失和RPN损失;
根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络,包括:
根据所述特征损失、所述分类损失、所述回归损失和所述RPN损失训练所述第二网络,得到目标网络。
结合第一方面,或者第二方面,或者第一方面的任一种可能的实现方式,或者第二方面的任一种可能的实现方式,在又一种可能的实现方式中,所述根据所述特征损失训练所述第二网络,得到目标网络之后,还包括:
向模型使用设备发送所述目标网络,其中所述目标网络用于预测图像中的内容。
第三方面,本申请实施例提供一种图像检测方法,该方法包括:
获取目标网络,其中,所述目标网络为通过第一网络对第二网络进行训练后得到的网络,通过所述第一网络训练所述第二网络用到的参数包括特征损失,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
通过所述目标网络识别图像中的内容。
在上述方法中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
结合第三方面,在第三方面的第一种可能的实现方式中,训练所述第二网络用到的参数还包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
结合第三方面,或者第三方面的第一种可能的实现方式,在第二种可能的实现方式中,所述目标网络具体为通过所述第一网络对第二网络进行训练,并通过第三网络对训练得到的网络进一步进行训练之后的网络,其中,所述第三网络的深度大于所述第一网络的深度。
在该可能的实现方式中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
第四方面,本申请实施例提供一种图像检测方法,该方法包括:
获取目标网络,其中,所述目标网络为通过多个网络迭代对第二网络进行训练得到的网络,所述多个网络均为分类网络,所述多个网络至少包括第一网络和第三网络,所述第三网络用于在所述第一网络对第二网络进行训练得到中间网络后对所述中间网络进行训练,其中,所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度;
通过所述目标网络识别图像中的内容。
在上述方法中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
结合第四方面,在第四方面的第一种可能的实现方式中,所述第一网络对第二网络进行训练时用到的参数包括特征损失,其中,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息。
在该可能的实现方式中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸 馏效果很好。
结合第四方面的第一种可能的实现方式,在第四方面的第二种可能的实现方式中,所述第一网络对第二网络进行训练时用到的参数包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
结合第三方面,或者第四方面,或者第三方面的任一种可能的实现方式,或者第四方面的任一种可能的实现方式,在又一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
结合第三方面,或者第四方面,或者第三方面的任一种可能的实现方式,或者第四方面的任一种可能的实现方式,在又一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
结合第三方面,或者第四方面,或者第三方面的任一种可能的实现方式,或者第四方面的任一种可能的实现方式,在又一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
结合第三方面,或者第四方面,或者第三方面的任一种可能的实现方式,或者第四方面的任一种可能的实现方式,在又一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000010
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000011
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000012
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000013
为所述第一网络的分类层对所述区域提议集合中第n个 属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000014
表示基于
Figure PCTCN2021088787-appb-000015
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000016
表示基于
Figure PCTCN2021088787-appb-000017
Figure PCTCN2021088787-appb-000018
得到的二值交叉熵损失,β为预设的权重平衡因子。
结合第三方面,或者第四方面,或者第三方面的任一种可能的实现方式,或者第四方面的任一种可能的实现方式,在又一种可能的实现方式中,训练所述第二网络用到的参数还包括所述第二网络的回归损失和RPN损失,其中,所述第二网络的回归损失和RPN损失为根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值确定的。
结合第三方面,或者第四方面,或者第三方面的任一种可能的实现方式,或者第四方面的任一种可能的实现方式,在又一种可能的实现方式中,所述获取目标网络,包括:
接收模型训练设备发送的目标网络,其中所述模型训练设备用于训练得到所述目标网络。
第五方面,本申请实施例提供一种模型训练装置,该装置包括:
特征提取单元,用于通过第一网络的特征提取层提取目标图像中的第一特征信息;
所述特征提取单元,还用于通过第二网络的特征提取层提取目标图像中的第二特征信息,其中,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
第一优化单元,用于通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
第二优化单元,用于通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
第一确定单元,用于通过所述第一局部特征和所述第二局部特征确定特征损失;
权重调整单元,用于根据所述特征损失训练所述第二网络,得到目标网络。
在上述方法中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
结合第五方面,在第五方面的第一种可能的实现方式中,所述装置还包括:
第一生成单元,用于通过所述第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
第二生成单元,用于通过所述第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
第二确定单元,用于根据所述第一分类预测值和所述第二分类预测值确定分类损失;
所述权重调整单元具体用于:根据所述特征损失和所述分类损失训练所述第二网络, 得到目标网络。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
结合第五方面,或者第五方面的任一种可能的实现方式,在第五方面的第二种可能的实现方式中,所述在根据所述特征损失训练所述第二网络,得到目标网络,所述权重调整单元具体用于:
根据所述特征损失训练所述第二网络;
通过第三网络对经过训练后的所述第二网络进行训练,得到目标网络,其中,所述第三网络的深度大于所述第一网络的深度。
在该可能的实现方式中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
第六方面,本申请实施例提供一种模型训练装置,该装置包括:
第一训练单元,用于基于第一网络训练第二网络得到中间网络;
第二训练单元,用于基于第三网络训练所述中间网络,得到目标网络,其中,所述第一网络、所述第二网络和所述第三网络均为分类网络,且所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度。
在上述方法中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
结合第六方面,在第六方面的第一种可能的实现方式中,所述基于第一网络训练第二网络得到中间网络包括:
通过第一网络的特征提取层提取目标图像中的第一特征信息;
通过第二网络的特征提取层提取目标图像中的第二特征信息;
通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
通过所述第一局部特征和所述第二局部特征确定特征损失;
根据所述特征损失训练所述第二网络,得到所述中间网络。
在该可能的实现方式中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
结合第六方面,或者第六方面的第一种可能的实现方式,在第六方面的第二种可能的实现方式中,所述装置还包括:
第一生成单元,用于通过第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
第二生成单元,用于通过第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
第二确定单元,用于根据所述第一分类预测值和所述第二分类预测值确定分类损失;
所述根据所述特征损失训练所述第二网络,得到所述中间网络,具体为:
根据所述特征损失和所述分类损失训练所述第二网络,得到所述中间网络。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
结合第五方面,或者第六方面,或者第五方面的任一种可能的实现方式,或者第六方面的任一种可能的实现方式,在又一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
结合第五方面,或者第六方面,或者第五方面的任一种可能的实现方式,或者第六方面的任一种可能的实现方式,在又一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
结合第五方面,或者第六方面,或者第五方面的任一种可能的实现方式,或者第六方面的任一种可能的实现方式,在又一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
结合第五方面,或者第六方面,或者第五方面的任一种可能的实现方式,或者第六方面的任一种可能的实现方式,在又一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000019
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000020
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标 签,
Figure PCTCN2021088787-appb-000021
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000022
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000023
表示基于
Figure PCTCN2021088787-appb-000024
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000025
表示基于
Figure PCTCN2021088787-appb-000026
Figure PCTCN2021088787-appb-000027
得到的二值交叉熵损失,β为预设的权重平衡因子。
结合第五方面,或者第六方面,或者第五方面的任一种可能的实现方式,或者第六方面的任一种可能的实现方式,在又一种可能的实现方式中,所述装置还包括:
第三确定单元,用于根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值,确定所述第二网络的回归损失和RPN损失;
所述权重调整单元具体用于:根据所述特征损失、所述分类损失、所述回归损失和所述RPN损失训练所述第二网络,得到目标网络。
结合第五方面,或者第六方面,或者第五方面的任一种可能的实现方式,或者第六方面的任一种可能的实现方式,在又一种可能的实现方式中,还包括:
发送单元,用于在所述权重调整单元根据所述特征损失训练所述第二网络,得到目标网络之后,向模型使用设备发送所述目标网络,其中所述目标网络用于预测图像中的内容。
第七方面,本申请实施例提供一种图像检测装置,该装置包括:
获取单元,用于获取目标网络,其中,所述目标网络为通过第一网络对第二网络进行训练后得到的网络,通过所述第一网络训练所述第二网络用到的参数包括特征损失,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
识别单元,用于通过所述目标网络识别图像中的内容。
在上述方法中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
结合第七方面,在第七方面的第一种可能的实现方式中,训练所述第二网络用到的参数还包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议 的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
结合第七方面,或者第七方面的第一种可能的实现方式,在第七方面的第二种可能的实现方式中,所述目标网络具体为通过所述第一网络对第二网络进行训练,并通过第三网络对训练得到的网络进一步进行训练之后的网络,其中,所述第三网络的深度大于所述第一网络的深度。
在该可能的实现方式中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
第八方面,本申请实施例提供一种图像检测装置,该装置包括:
获取单元,用于获取目标网络,其中,所述目标网络为通过多个网络迭代对第二网络进行训练得到的网络,所述多个网络均为分类网络,所述多个网络至少包括第一网络和第三网络,所述第三网络用于在所述第一网络对第二网络进行训练得到中间网络后对所述中间网络进行训练,其中,所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度;
识别单元,用于通过所述目标网络识别图像中的内容。
在上述方法中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
结合第八方面,在第八方面的第一种可能的实现方式中,所述第一网络对第二网络进行训练时用到的参数包括特征损失,其中,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息。
在该可能的实现方式中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
结合第八方面的第一种可能的实现方式,在第八方面的第二种可能的实现方式中,所 述第一网络对第二网络进行训练时用到的参数包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
结合第七方面,或者第八方面,或者第七方面的任一种可能的实现方式,或者第八方面的任一种可能的实现方式,在又一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
结合第七方面,或者第八方面,或者第七方面的任一种可能的实现方式,或者第八方面的任一种可能的实现方式,在又一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
结合第七方面,或者第八方面,或者第七方面的任一种可能的实现方式,或者第八方面的任一种可能的实现方式,在又一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
结合第七方面,或者第八方面,或者第七方面的任一种可能的实现方式,或者第八方面的任一种可能的实现方式,在又一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000028
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000029
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000030
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000031
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000032
表示基于
Figure PCTCN2021088787-appb-000033
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000034
表示基于
Figure PCTCN2021088787-appb-000035
Figure PCTCN2021088787-appb-000036
得到的二值交叉熵损失,β为预设的权 重平衡因子。
结合第七方面,或者第八方面,或者第七方面的任一种可能的实现方式,或者第八方面的任一种可能的实现方式,在又一种可能的实现方式中:训练所述第二网络用到的参数还包括所述第二网络的回归损失和RPN损失,其中,所述第二网络的回归损失和RPN损失为根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值确定的。
结合第七方面,或者第八方面,或者第七方面的任一种可能的实现方式,或者第八方面的任一种可能的实现方式,在又一种可能的实现方式中,所述获取单元具体用于:
接收模型训练设备发送的目标网络,其中所述模型训练设备用于训练得到所述目标网络。
第九方面,本申请实施例提供一种模型训练设备,该模型训练设备包括存储器和处理器,该存储器用于存储计算机程序,该处理器用于调用该计算机程序来实现上述第一方面,或者第二方面,或者第一方面的任一种可能的实现方式,或者第二方面的任一种可能的实现方式所描述的方法。
第十方面,本申请实施例提供一种模型使用设备,该模型使用设备包括存储器和处理器,该存储器用于存储计算机程序,该处理器用于调用该计算机程序来实现上述第三方面,或者第四方面,或者第三方面的任一种可能的实现方式,或者第四方面的任一种可能的实现方式所描述的方法。
第十一方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,当所述计算机程序在处理器上运行时,实现上述第一方面,或者第二方面,或者第三方面,或者第四方面,或者其中任一个方面的任一种可能的实现方式所描述的方法。
附图说明
以下对本申请实施例用到的附图进行介绍。
图1是本申请实施例提供的一种图像检测的场景示意图;
图2是本申请实施例提供的一种模型蒸馏的场景示意图;
图3是本申请实施例提供的又一种模型蒸馏的场景示意图;
图4是本申请实施例提供的又一种模型蒸馏的场景示意图;
图5是本申请实施例提供的又一种模型蒸馏的场景示意图;
图6是本申请实施例提供的一种模型训练的架构示意图;
图7是本申请实施例提供的又一种模型蒸馏的场景示意图;
图8是本申请实施例提供的又一种图像检测的场景示意图;
图9是本申请实施例提供的又一种图像检测的场景示意图;
图10是本申请实施例提供的一种模型训练方法的流程示意图;
图11是本申请实施例提供的一种高斯掩膜的原理示意图;
图12是本申请实施例提供的又一种模型蒸馏的场景示意图;
图13是本申请实施例提供的一种模型训练装置的结构示意图;
图14是本申请实施例提供的又一种模型训练装置的结构示意图;
图15是本申请实施例提供的一种图像检测装置的结构示意图;
图16是本申请实施例提供的又一种图像检测装置的结构示意图;
图17是本申请实施例提供的一种模型训练设备的结构示意图;
图18是本申请实施例提供的又一种模型训练设备的结构示意图;
图19是本申请实施例提供的一种模型使用设备的结构示意图;
图20是本申请实施例提供的又一种模型使用设备的结构示意图。
具体实施方式
下面结合本申请实施例中的附图对本申请实施例进行描述。
在人工智能、计算机视觉等领域中,通常涉及图像检测,即对图像中的物体(或者说目标)进行识别,图像检测通常是通过神经网络来实现的,在面临如何训练出一个检测速度更快、检测结果更准确的神经网络来进行目标检测的技术问题时,可以采用模型蒸馏的方式来解决,如图2所示,具体是让大神经网络指导小神经网络的训练,或者说小神经网络模拟大神经网络的表现,从而将大神经网络检测结果更准确的性能赋予小神经网络,使得小神经网络既具备检测速度快,又具备检测结果更准确的优点。
下面例举几种通过大神经网络指导小神经网络训练的方案。
例如,如图3所示,通过大神经网络的特征提取层提取图像的特征信息,然后送到大神经网络的分类层产生分类软标签;以及通过小神经网络的特征提取层提取图像的特征信息,然后送到小神经网络的分类层产生分类软标签;接着通过两个网络产生的分类软标签确定分类损失;然后通过分类损失指导小神经网络的分类层的训练。然而,这种方案中大神经网络对小神经网络的指导不够全面(如忽略了对特征提取层的指导),也不够精细化,因此指导之后的效果并不理想。
再如,如图4所示,将大神经网络的特征提取层的信息传递给小神经网络,然后对具有大神经网络的特征提取层的信息的小神经网络进行压缩,得到一个新的更瘦、更深的小神经网络,图4中的W T为大神经网络(老师网络)的模型权重,不同层可以有不同的权重,例如,
Figure PCTCN2021088787-appb-000037
等;图4中的W S为小神经网络(学生网络)的模型权重,不同层可以有不同的权重,例如,
Figure PCTCN2021088787-appb-000038
等。然而,这种方案中需要构建新的小神经网络,且构建的新的小神经网络的层数变得更多,具有一定的复杂性。并且,这种方案中通过特征提取层提取的是整幅图中的特征,存在较多背景噪声,从而导致检测结果不理性。
再如,如图5所示,选择图像中目标区域附近的区域作为蒸馏区域,让小神经网络学习大神经网络在该蒸馏区域的特征提取层表达,然而蒸馏区域中依然存在很多无用的背景噪声,因此学习的特征提取层表达依旧不理想;并且,小神经网络仅对特征提取层的表达进行了学习,因此性能提升有限。
鉴于大神经网络指导小神经网络训练依旧存在较多局限性,本申请实施例下面进一步提供相关的架构、设备、及方法来进一步提升大神经网络指导小神经网络训练的效果。
请参数图6,图6是本申请实施例提供的一种模型训练的架构示意图,该架构包括模型训练设备601和一个或多个模型使用设备602,该模型训练设备601与该模型使用设备602之间通过有线或者无线的方式进行通信,因此该模型训练设备601可以将训练出的用于预测图像中的目标物体的模型(或者网络)发送给模型使用设备602;相应地,模型使用设备602通过接收到的模型来预测待预测图像中的目标物体。
可选的,该模型使用设备602可以将基于该模型预测的结果反馈给上述模型训练设备601,使得模型训练设备601可以进一步基于该模型使用设备602的预测结果对模型做进一步的训练;重新训练好的模型可以发送给模型使用设备602对原来的模型进行更新。
该模型训练设备601 可以为具有较强计算能力的设备,例如,一个服务器,或者由多个服务器组成的一个服务器集群。该模型训练设备601中可以包括很多神经网络,其中层比较多的神经网络相对于层比较少的神经网络而言可以称为大神经网络,层比较少的神经网络相对于层比较多的神经网络而言可以称为小神经网络,即第一网络的深度大于第二网络的深度。
如图7所示,该模型训练设备601中包括第一网络701和第二网络702,该第一网络701可以为比第二网络702大的神经网络,第一网络701和第二网络702均包括特征提取层(也可以称为特征层)和分类层(也可以称分类器,或者分类头),其中,特征提取层用于提取图像中的特征信息,分类层用于基于提取的特征信息对图像中的目标物体进行分类。
本申请实施例中,可以将第一网络701作为老师网络,将第二网络702作为学生网络,由第一网络701来指导第二网络702的训练,该过程可以称为蒸馏。本申请实施例中,该第一网络701指导第二网络702的思路包括如下三个技术点:
1、通过第一网络701的特征提取层和第二网络702的特征提取层分别提取特征信息,然后通过高斯掩膜突出将这两个网络的特征提取层提取的特征信息中关于目标物体的特征信息;再通过第一网络提取的关于目标物体的特征信息和第二网络提取的关于目标物体的特征信息确定特征损失,之后通过该特征损失指导第二网络702的特征提取层的训练。
2、第一网络701与第二网络702选取同样的区域提议集合,例如,通过共享区域提议网络(Region proposal network,RPN)的方式来使得该第一网络和第二网络均具有区域提议集合,因此第一网络701和第二网络702可以基于同样的区域提议生成软标签,然后基于第一网络701生成的软标签与第二网络702生成的软标签得到二值交叉熵损失(BCEloss),之后通过二值交叉熵损失(BCEloss)来指导第二网络702的分类层的训练。
3、采用渐进式蒸馏的方式指导第二网络702的训练。例如,假若上述第一网络701为一个101层(res101)的神经网络,第二网络702为一个50层(res50)的神经网络,那么基于上述第1和/或第2项的技术点通过第一网络701对第二网络702进行训练得到目标神经网络(可以标记为res101-50)之后,进一步通过第三神经网络对目标神经网络(res101-50)进行训练,通过第三神经网络对目标神经网络进行训练的原理,与通过第一网络701对第二网络702进行训练的原理相同,此处不再赘述。这里的第三神经网络是一个比第一网络701更大的神经网络,例如,该第三神经网络为一个152层(res152)的神经网络。
以上第1项、第2项、第3项的实现将在后续的方法实施例中做更详细的阐述。
该模型使用设备602 为需要对图像进行识别(或者说检测)的设备,例如,手持设备 (例如,手机、平板电脑、掌上电脑等)、车载设备(例如,汽车、自行车、电动车、飞机、船舶等)、可穿戴设备(例如智能手表(如iWatch等)、智能手环、计步器等)、智能家居设备(例如,冰箱、电视、空调、电表等)、智能机器人、车间设备,等等。
下面分别以模型使用设备602为汽车和手机为例进行举例说明。
汽车实现无人驾驶或者电脑驾驶,是目前非常流行的课题。随着经济的发展,全球汽车数量的不断增加,道路拥挤以及驾驶事故给人民财产和社会财产造成了很大的损失。人为因素是造成交通事故的主要因素,如何降低人为的失误,智能的避障以及合理的规划是提高驾驶安全系数的重要课题。自动驾驶的出现,给这一切带来了可能,它不需要人类操作即能感知周围的环境并导航。目前,全球各大公司都开始关注和开发自动驾驶系统,如谷歌、特斯拉、百度等。自动驾驶技术已成为各国争抢的战略制高点。由于相机设备具有价格便宜、使用方便等优势,构建以视觉感知为主的感知系统是很多公司的研发方向。如图8所示,是自动驾驶过程中车载相机采集的路面场景,汽车上的视觉感知系统充当人类的眼睛(即计算机视觉),通过检测网络(例如,后续方法实施例中提及的目标网络、第一网络、第二网络、第三网络、第四网络等,该检测网络也可以称作分类网络、或者检测模型,或者检测模块等)自动对相机采集的图像进行检测,从而确定汽车周围的物体和位置(即检测图像中的目标物体)。
例如,汽车通过检测网络识别图像中的物体,如果发现图像中有人,且距离汽车较近,那么汽车可以控制减速,或者控制停车,以避免造成人员伤亡;如果发现图像中有其他汽车,那么可以适当控制汽车的行驶速度,避免追尾;如果发现图像中有物体正在快速撞向汽车,那么可以控制器汽车通过移位或者变道等方式进行避让。
再如,汽车通过检测网络识别图像中的物体,如果发现路面有交通路线(如双黄线、单黄线、车道分界线等),那么可以对汽车的行驶状态进行预判,如果预判发现汽车可能会压线,那么可对汽车进行相应控制以避免压线;当然也可以在识别车道分界线及其位置的情况下,据此信息来决策如何进行变道,其余交通线路的控制依次类推。
再如,汽车通过检测网络识别图像中的物体,然后将识别出的某个物体作为标的来测算汽车的行驶速度、加速度、转弯角度等信息。
手机原本只是被作为一种通讯工具,方便人们的沟通。随着全球经济的发展,人们生活质量的提高,大家对手机的体验感以及性能的追求也越来越高。除了娱乐、导航、购物、拍照之外,检测和识别功能也受到了很大的关注。目前检测图像中的目标物体的识别技术已经在很多手机APP中应用,包括美图秀秀、魔漫相机、神拍手、Camera360、支付宝扫脸支付等等。开发者只需要调用授权的人脸检测、人脸关键点检测和人脸分析的移动端SDK包,就可以自动识别出照片、视频中的人脸身份(即检测图像中的目标物体)。如图9所示,是手机相机采集到的人脸图片。手机通过检测网络(例如,后续方法实施例中提及的目标网络,该检测网络也可以称作检测模型,或者检测模块,或者检测识别系统等)对图片中的物体进行定位和识别,找到人脸(还会获取人脸位置),同时判断图像中的人脸是否是某个特定人的人脸(即计算机视角),如果是就可以进行下一步的操作,如移动支付、人脸登录等多种功能。
设备(如汽车、手机等)通过计算机视角感知周围的环境信息,从而基于该环境信息 进行相应的智能控制,基于计算机视角完成的智能控制是人工智能的一种实现。
请参见图10,图10是本申请实施例提供的一种模型训练方法的流程示意图,该方法可以基于图6所示的模型训练系统来实现,该方法包括但不限于如下步骤:
步骤S1001:模型训练设备分别通过第一网络的特征提取层和第二网络的特征提取层提取目标图像中的特征信息。
具体地,第一网络用于作为老师网络,第二网络用于作为学生网络,在第一网络指导第二网络训练的过程中,即模型蒸馏过程中,第一网络的特征提取层和第二网络的特征提取层针对同样的图像提取特征信息,为了方便描述,可以称该同样的图像可以称之为目标图像。
可选的,第一网络的层大于第二网络的层,例如,该第一网络可以为一个101层(res101)的神经网络,第二网络可以为一个50层(res50)的神经网络。
本申请实施例中的特征信息可以通过向量来表示,或者其他机器可识别的方式来表示。
步骤S1002:模型训练设备通过高斯掩膜突出第一特征信息中关于目标物体的特征。
本申请发明人发现,在基于神经网络的目标物体检测(即识别)过程中,检测性能的增益很大程度上来自于特征提取层(backbone层)提取的特征信息,因此对特征提取层的模拟是模型训练的重要环节。本申请实施例中,引入高斯掩膜实际就是突出目标图像中目标物体的特征,镇压目标物体以外的背景的特征的过程;实际也是突出对目标物体的响应,弱化边缘信息的过程。需要说明的是,采用高斯掩膜的方式不仅可以抑制目标物体所在的矩形框(一般为能够框柱目标物体的最小矩形框)外的背景的特征,还可以抑制矩形框内除目标物体以外的背景的特征,因此最大限度地突出了目标物体的特征。
如图11所示,通过高斯掩膜可以突出目标图像中的人物(即目标物体),弱化人物以外的背景的特征,表现在高斯掩膜的模型数据上就是目标物体对应的特征信息在坐标中形成一个较大的突起,背景的信息在坐标中的高度接近于零,近似于处在一个高度为零的平面上。
为了便于理解,下面例举针对目标图像的高斯掩膜定义,具体如公式1-1所示:
Figure PCTCN2021088787-appb-000039
公式1-1中,(x,y)为目标图像中的像素点的坐标,B为该目标图像中的目标物体的一个正例区域提议,该正例区域提议B的几何规格为w×h,正例区域提议B的中心点坐标为(x 0,y 0),σ x和σ y分别为x轴和y轴上的衰减因子,可选的,为了方便起见,可以设置σ x=σ y,这个高斯掩膜只对目标真值框有效,框外的背景全部滤掉。当存在多个关于目标物体的正例区域提议时,目标物体中的像素点(x,y)可能存在多个高斯掩膜值(分别对应不同的正例区域提议),这时可选择该多个高斯掩膜值中的最大值作为该像素点(x,y)的高斯掩膜值M(x,y),高斯掩膜值M(x,y)可以通过公式1-2表示:
Figure PCTCN2021088787-appb-000040
公式1-2中,N p为该目标图像中的目标物体的正例区域提议的数量,M 1(x,y)是其中第 一个正例区域提议中点(x,y)的高斯掩膜的值,M 2(x,y)是其中第二个正例区域提议中点(x,y)的高斯掩膜的值,
Figure PCTCN2021088787-appb-000041
是其中第N p个正例区域提议中点(x,y)的高斯掩膜的值,其余依次类推。从公式1-2以看出,目标物体中某个像素点(x,y)的高斯掩膜的值取多个值中的最大值。
为了便于描述,称通过第一网络的特征提取层提取到的所述目标图像中的特征信息为第一特征信息,称通过高斯掩膜突出的第一特征信息中关于目标物体的特征为第一局部特征。
步骤S1003:模型训练设备通过高斯掩膜突出第二特征信息中关于所述目标物体的特征。
为了便于描述,称通过第二网络的特征提取层提取到的所述目标图像中的特征信息为第二特征信息,称通过高斯掩膜突出的第二特征信息中关于目标物体的特征为第二局部特征。
步骤S1004:模型训练设备通过所述第一局部特征和所述第二局部特征确定特征损失。
本申请实施例中,上述第一局部特征为第一网络得到的针对目标图像中目标物体的特征,第二局部特征为第二网络得到的针对目标图像中目标物体的特征,第一局部特征与第二局部特征之间的差异,能够反映第一网络的特征提取层与第二网络的特征提取层之间的差异。本申请实施例中的特征损失(也称蒸馏损失)能够体现第二局部特征相较第一局部特征的差异。
可选的,第一局部特征可以表示为
Figure PCTCN2021088787-appb-000042
第二局部特征可以表示为
Figure PCTCN2021088787-appb-000043
可以通过公式1-3来计算该特征损失L b
Figure PCTCN2021088787-appb-000044
本申请实施例中,
Figure PCTCN2021088787-appb-000045
引入A是为了实现归一化操作;其中目标图像的规格为WxH,M ij表示目标图像中一个像素点的高斯掩膜的值,i可以依此从1到W取值,j可以依此从1到H取值,目标图像中的任意一个像素点(i,j)的高斯掩膜的值M ij可以通过上述公式1-1和公式1-2计算得到,此处不再赘述。另外,
Figure PCTCN2021088787-appb-000046
表示第二网络提取的像素点(i,j)的特征,
Figure PCTCN2021088787-appb-000047
表示第一网络提取的像素点(i,j)的特征;C代表第一网络、第二网络提取目标图像中的特征信息时特征图的通道数。
步骤S1005:模型训练设备通过第一网络的分类层生成区域提议集合中的目标区域提议的分类预测值。
具体地,可以采用相应的算法或者策略来保证第一网络和第二网络均具有该区域提议集合,例如,第一网络和第二网络共享RPN来使得第一网络和第二网络均具有该区域提议集合。例如,共享的RPN中可以包括2000个区域提议,而该区域提议集合包括该2000个区域提议中的512个区域提议,可以在第一网络和第二网络中配置相同的检测器,使得第一网络和第二网络能够从2000个共享的区域提议中提取出同样的512个区域提议,即区域 提议集合。图12示意了一种通过共享RPN以及配置同样的检测器来使得第一网络和第二网络选出同样的区域提议(proposal)的流程示意图,选出的全部区域提议统称为区域提议集合。
所述RPN为所述第一网络和所述第二网络共享的RPN,既可以是第二网络共享给第一网络的,也可以是第一网络共享给第二网络的,或者其他方式共享的。
在一种可选的方案中,目标区域提议为所述区域提议集合中的全部区域提议,即包括目标物体的正例区域提议和目标物体的负例区域提议。可选的,区域提议集合中哪些是正例区域提议,哪些是负例区域提议,可以是人为预先标记好的,也可以是机器自动标记好的;通常的划分标准是,如果哪个区域提议与目标物体所在的矩形框(通常为框柱该目标物体的最小矩形框)的重合度超过设定的参考阈值(例如,可设置为50%,或其他值),则可以将该区域提议归类为目标物体的正例区域提议,否则将该区域提议归类为目标物体的负例区域提议。
在又一种可选的方案中,目标区域提议为区域提议集合中属于所述目标物体的正例区域提议。
本申请实施例中,可以称通过第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值为第一分类预测值以方便后续描述。
步骤S1006:模型训练设备通过第二网络的分类层生成区域提议集合中的目标区域提议的分类预测值。
本申请实施例中,可以称通过第二网络的分类层生成的区域提议集合中的目标区域提议的分类预测值为第二分类预测值以方便后续描述。
无论是上述第一分类预测值,还是上述第二分类预测值,其均用于表征对相应区域提议的分类倾向,或者说概率,例如,第一网络的分类标签生成的区域提议1的分类预测值表征了该区域提议1内的物体被分类为人的概率为0.8,被分类为树的概率为0.3,被分类为汽车的概率为0.1。需要说明的是,不同的网络中的分类层针对同一个区域提议分类得到的分类预测值可能不同,因为不同网络的模型参数一般不同,它们的预测能力一般存在差异。
步骤S1007:模型训练设备根据第一分类预测值和第二分类预测值确定分类损失。
具体地,本申请实施例正是通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失,实际上是将第一预测值作为软标签来确定第二网络的分类层的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此模型训练效果更好。
在一种可选的方案中,该分类损失满足公式1-4所示的关系:
Figure PCTCN2021088787-appb-000048
在公式1-4中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合 中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000049
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000050
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000051
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000052
表示基于
Figure PCTCN2021088787-appb-000053
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000054
表示基于
Figure PCTCN2021088787-appb-000055
Figure PCTCN2021088787-appb-000056
得到的二值交叉熵损失,β为预设的权重平衡因子。
步骤S1008:模型训练设备根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络。
本申请实施例中提到的根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络的含义是:目标网络是通过对第二网络进行训练得到的,训练过程中用到的参数包括但不限于该特征损失和该分类损失,即还可能用到这两项以外的其他参数;并且训练过程中可能仅用到了基于第一网络得到的该特征损失和该分类损失(即基于第一网蒸馏第二网络得到目标网络),也可能不仅用到了基于第一网络得到的该特征损失和该分类损失,还用到了基于其他网络(一个或多个)得到信息(即基于第一网络和其他网络蒸馏第二网络得到目标网络)。
可选的,在根据所述特征损失和所述分类损失训练所述第二网络的过程中,可以基于所述特征损失L b和所述分类损失L cls确定总损失L,然后通过总损失来训练该第二网络。可选的,也可以通过特征损失训练第二网络中的一部分模型参数(例如,特征提取层的模型参数),通过分类损失训练第二网络中的又一部分模型参数(例如,分类层的模型参数)。当然,还可以通过其他方式来使用特征损失和分类损失训练所述第二网络。
针对通过总损失L来训练第二网络的情况,下面例举两种可选的计算总损失的案例。
案例1,通过公式1-5计算总损失L。
L=δL b+L cls   公式1-5
在公式1-5中,δ是预设或者预先训练出的权重平衡因子。
案例2,根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值,确定所述第二网络的回归损失和RPN损失;即第二网络在不依赖第一网络的情况下,训练得到回归损失L reg和RPN损失L rpn,然后结合回归损失L reg、RPN损失L rpn、上述特征损失L b和分类损失L cls得到总损失L,可选的,总损失L的计算方式如公式1-6所示。
L=δL b+L cls+L reg+L rpn     公式1-6
本申请实施例中,确定分类损失和特征损失的先后顺序不做限定,可以同时执行,也可以确定分类损失的流程先执行,还可以确定特征损失的流程先执行。
在一种可选的方案中,模型训练设备根据所述特征损失和所述分类损失训练所述第二网络,得到还不是最终的目标网络,而是一个中间网络,后续还要通过其他网络(如第三 网络)再对该中间网络进行训练,这个过程可以看做是针对第二网络采用渐进式蒸馏。原理如下:在基于上述第一网络对第二网络进行蒸馏(即根据第一网络和第二网络得到特征损失和分类损失,然后根据特征损失和分类损失对第二网络进行训练)得到中间网络后,还可以进一步通过一个比第一网络的层更多的第三网络对该中间网络再次进行蒸馏,通过第三网络对中间网络进行蒸馏的原理,与通过第一网络对第二网络进行蒸馏的原理相同;后续还可以逐步使用层数更多的网络对新训练出网络做进一步蒸馏,直至对第二网络的蒸馏达到预期目标,从而得到目标网络。
例如,假若上述第一网络701为一个101层(res101)的神经网络,第二网络702为一个50层(res50)的神经网络,那么基于上述第1和/或第2项的技术点通过第一网络701对第二网络702进行蒸馏得到中间神经网络(可以标记为res101-50)之后,进一步通过第三神经网络对中间神经网络(res101-50)进行蒸馏(即依次通过第一网络701和第三网络对第二网络702进行蒸馏),通过第三神经网络对中间神经网络进行蒸馏的原理,与通过第一网络701对第二网络702进行蒸馏的原理相同,此处不再赘述。第三神经网络是一个比第一网络701更大的神经网络,例如,一个152层(res152)的神经网络。
需要说明的是,在图10所示的方法实施例中,可以采用上述例举的三个技术点中的一个,或者两个,或者三个。例如,可以采用渐进式蒸馏的技术点(上述三个技术点中的第1个),但是针对具体如何提取特征,以及具体如何使用区域提议不做特殊限定(即对是否使用上述三个技术点中的第2个,第3个不做限定)。
步骤S1009:模型训练设备向模型使用设备发送目标网络。
步骤S1010:模型使用设备接收模型训练设备发送的目标网络。
具体地,该模型使用设备接收到该目标网络后,通过该目标网络来预测(或者说检测,或者说估计)图像中的内容(即识别图像中的目标),例如,识别图像中的是否存在人脸,当存在时人脸在图像中的位置具体是什么;或者识别图像中是否存在道路障碍物,当存在时障碍物在图像中的位置是什么,等等。具体使用场景可参照图6所示的架构中关于模型使用设备602的介绍。
为了验证上述实施例的效果,本申请发明人在两个标准的检测数据集上进行了验证,这两个检测数据集为:COCO2017数据集、BDD100k数据集。其中COCO数据集包含了80个物体类别,11万张训练图片和5000张验证图片,BDD100k数据集包含了10个类别,一共具有10万张图片。对这两个数据集,都采用coco的评估标准进行评估,即类别平均准确度(mAP)。
表1展示了不同的蒸馏策略方案,其中的层数分别为res18,res50,res101,res152的网络(或说模型)都已经在COCO数据集上进行了预训练,只需采用上述实施例进行蒸馏即可。
表1 不同的蒸馏策略
策略编号 老师网络 学生网络 蒸馏后的学生网络
1 res50 res18 res50-18
2 res101 res18 res101-18
3 res101 res50-18 res101-50-18
4 res152 res101-50-18 res152-101-50-18
在表1中,res50-18表示通过50层的网络对18层的网络进行蒸馏后得到的网络;res101-18表示通过101层的网络对18层的网络进行蒸馏后得到的网络;res101-50-18表示通过101层的网络对蒸馏后得到的网络res101-18进一步蒸馏所得到的网络;res152-101-50-18表示通过152层的网络对蒸馏后得到的网络res101-50-18进一步蒸馏所得到的网络。可选的,网络res50可以认为是上述第一网络,网络res18可以认为是上述第二网络,网络101可以认为是上述第三网络,网络res152可以认为是第四网络,该第四网络是一个比第三网络的层数更多的网络。在第二网络依次经过第一网络、第三网络、第四网络蒸馏之后得到的网络res152-101-50-18就可以发送给上述模型使用设备来对图像中的目标进行检测。
表2展示了不同网络在COCO数据集上的评估结果,其中,网络Res50-18检测的准确度相比于原始网络res18有了明显提高,提升了2.8个百分点,网络res101-18检测的准确度比网络res18的准确度提升了3.2个点,采用渐进式的蒸馏方法得到的网络res101-50-18比单次蒸馏得到的网络re50-18有了进一步的提升,值得一提的是,网络res152-101-50-18检测的准确度相比于网络res18提升得特别多,有4.4个百分点,并且蒸馏后的mAP达到了0.366,已经超越了网络res50检测的准确度0.364。换言之,虽然网络res18比网络res50具有更少的网络层数,且原始mAP相差的很多,有4.2个百分点,但是本申请实施例的方法通过对网络res18进行渐进式蒸馏,使得蒸馏后的网络res18超越了网络res50的性能。
表2 在COCO数据集上不同网络的性能评估结果
网络 MAP AP50 AP75 Aps(小) Apm(中) Apl(大)
res18 0.322 0.534 0.34 0.183 0.354 0.411
res50 0.364 0.581 0.393 0.216 0.398 0.468
res101 0.385 0.602 0.417 0.225 0.429 0.492
res152 0.409 0.622 0.448 0.241 0.453 0.532
res50-18 0.35 0.56 0.373 0.187 0.384 0.459
res101-18 0.354 0.563 0.38 0.186 0.387 0.473
res101-50-18 0.358 0.567 0.387 0.184 0.392 0.479
res152-101-50-18 0.366 0.574 0.396 0.184 0.399 0.5
在表2中,MAP为平均精度均值,AP50为检测评价函数(Intersection over Union,IOU)大于0.5时的精度均值,AP75为IOU大于0.75时的精度均值,Aps为小物体的精度均值,Apm为中物体的精度均值,Apl为大物体的精度均值。
如表3所示。分别采用网络res50和网络res101作为老师网络,网络res18作为学生网络,+1表示只采用上述第1项技术点(即采用高斯掩膜突出目标物体),+2表示采用上述第2项技术点(即选取同样的区域提议集合),+3表示采用上述第3项技术点(即渐进式蒸馏)。其中网络res18(+1)和网络res18(+1+2)中的学生网络res18是未在COCO数据集上进行预训练的,res18(+1+2+3)的学生网络res18是在COCO上进行了预训练,有利于拉近它与老师网络之间的差异距离,相当于渐进式蒸馏的一种方案。可以看到,随着不断地改进,蒸馏的效果在逐步提升,也证明了上述三项技术点的有效性。
表3 每个技术点的实验效果
  res18 res18(+1) res18(+1+2) res18(+1+2+3)
老师网络res50 0.322 0.342 0.344 0.35
老师网络res101 0.322 0.347 0.349 0.354
为了验证本申请在不同数据上的适用性,在BDD100k数据集上也进行了对比实验,结果如表4所示。原始网络res18(作为学生网络)与网络res50(作为老师网络)之间有2.1个百分点的mAP(准确度)差距,采用本申请实施例的方法进行蒸馏后,蒸馏后得到的网络res50-18比原始的网络res18的检测mAP(准确度)提升了1.5个百分点,与老师网络res50只差0.6个百分点,弥补了近75%的mAP(准确度)差距,效果非常明显。
表4 BDD100k数据集上的性能评估结果
网络 mAP AP50 AP75 Aps Apm Apl
res18 0.321 0.619 0.289 0.159 0.374 0.513
res50 0.342 0.644 0.314 0.171 0.399 0.546
res50-18 0.336 0.636 0.309 0.166 0.393 0.539
在图10所描述的方法中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
进一步地,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
进一步地,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
上述详细阐述了本申请实施例的方法,下面提供了本申请实施例的装置。
请参见图13,图13是本申请实施例提供的一种模型训练装置130的结构示意图,该模型训练装置130可以为上述方法实施例中的模型训练设备或者该模型训练设备中的器件,该模型训练装置130可以包括特征提取单元1301、第一优化单元1302、第二优化单元1303、第一确定单元1304和权重调整单元1305,其中,各个单元的详细描述如下。
特征提取单元1301,用于通过第一网络的特征提取层提取目标图像中的第一特征信息;
所述特征提取单元1301,还用于通过第二网络的特征提取层提取目标图像中的第二特征信息,其中,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
第一优化单元1302,用于通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
第二优化单元1303,用于通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
第一确定单元1304,用于通过所述第一局部特征和所述第二局部特征确定特征损失;
权重调整单元1305,用于根据所述特征损失训练所述第二网络,得到目标网络。
在上述方法中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述装置还包括:
第一生成单元,用于通过所述第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
第二生成单元,用于通过所述第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
第二确定单元,用于根据所述第一分类预测值和所述第二分类预测值确定分类损失;
所述权重调整单元具体用于:根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述在根据所述特征损失训练所述第二网络,得到目标网 络,所述权重调整单元具体用于:
根据所述特征损失训练所述第二网络;
通过第三网络对经过训练后的所述第二网络进行训练,得到目标网络,其中,所述第三网络的深度大于所述第一网络的深度。
在该可能的实现方式中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
在一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
在一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
在一种可能的实现方式,或者第六方面的任一种可能的实现方式,在又一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
在一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000057
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000058
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000059
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000060
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000061
表示基于
Figure PCTCN2021088787-appb-000062
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000063
表示基于
Figure PCTCN2021088787-appb-000064
Figure PCTCN2021088787-appb-000065
得到的二值交叉熵损失,β为预设的权重平衡因子。
在一种可能的实现方式中,所述装置还包括:
第三确定单元,用于根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值,确定所述第二网络的回归损失和RPN损失;
所述权重调整单元具体用于:根据所述特征损失、所述分类损失、所述回归损失和所述RPN损失训练所述第二网络,得到目标网络。
在一种可能的实现方式中,所述装置还包括:
发送单元,用于在所述权重调整单元根据所述特征损失训练所述第二网络,得到目标网络之后,向模型使用设备发送所述目标网络,其中所述目标网络用于预测图像中的内容。
需要说明的是,各个单元的实现及有益效果还可以对应参照图10所示的方法实施例的 相应描述。
请参见图14,图14是本申请实施例提供的一种模型训练装置140的结构示意图,该模型训练装置140可以为上述方法实施例中的模型训练设备或者该模型训练设备中的器件,该模型训练装置140可以包括第一训练单元1401和第二训练单元1402,其中,各个单元的详细描述如下。
第一训练单元1401,用于基于第一网络训练第二网络得到中间网络;
第二训练单元1402,用于基于第三网络训练所述中间网络,得到目标网络,其中,所述第一网络、所述第二网络和所述第三网络均为分类网络,且所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度。
在上述方法中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
在一种可能的实现方式中,所述基于第一网络训练第二网络得到中间网络包括:
通过第一网络的特征提取层提取目标图像中的第一特征信息;
通过第二网络的特征提取层提取目标图像中的第二特征信息;
通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
通过所述第一局部特征和所述第二局部特征确定特征损失;
根据所述特征损失训练所述第二网络,得到所述中间网络。
在该可能的实现方式中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述装置还包括:
第一生成单元,用于通过第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
第二生成单元,用于通过第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
第二确定单元,用于根据所述第一分类预测值和所述第二分类预测值确定分类损失;
所述根据所述特征损失训练所述第二网络,得到所述中间网络,具体为:
根据所述特征损失和所述分类损失训练所述第二网络,得到所述中间网络。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失; 通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
在一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
在一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
在一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000066
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000067
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000068
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000069
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000070
表示基于
Figure PCTCN2021088787-appb-000071
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000072
表示基于
Figure PCTCN2021088787-appb-000073
Figure PCTCN2021088787-appb-000074
得到的二值交叉熵损失,β为预设的权重平衡因子。
在一种可能的实现方式中,所述装置还包括:
第三确定单元,用于根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值,确定所述第二网络的回归损失和RPN损失;
所述权重调整单元具体用于:根据所述特征损失、所述分类损失、所述回归损失和所述RPN损失训练所述第二网络,得到目标网络。
在一种可能的实现方式中,该装置还包括:
发送单元,用于在所述权重调整单元根据所述特征损失训练所述第二网络,得到目标网络之后,向模型使用设备发送所述目标网络,其中所述目标网络用于预测图像中的内容。
需要说明的是,各个单元的实现及有益效果还可以对应参照图10所示的方法实施例的相应描述。
请参见图15,图15是本申请实施例提供的一种图像检测装置150的结构示意图,该图像检测装置150可以为上述方法实施例中的模型使用设备或者该模型使用设备中的器件,该图像检测装置150可以包括获取单元1501和识别单元1502,其中,各个单元的详细描 述如下。
获取单元1501,用于获取目标网络,其中,所述目标网络为通过第一网络对第二网络进行训练后得到的网络,通过所述第一网络训练所述第二网络用到的参数包括特征损失,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
识别单元1502,用于通过所述目标网络识别图像中的内容。
在上述方法中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,训练所述第二网络用到的参数还包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述目标网络具体为通过所述第一网络对第二网络进行训练,并通过第三网络对训练得到的网络进一步进行训练之后的网络,其中,所述第三网络的深度大于所述第一网络的深度。
在该可能的实现方式中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
在一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
在一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
在一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
在一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000075
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000076
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000077
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000078
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000079
表示基于
Figure PCTCN2021088787-appb-000080
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000081
表示基于
Figure PCTCN2021088787-appb-000082
Figure PCTCN2021088787-appb-000083
得到的二值交叉熵损失,β为预设的权重平衡因子。
在一种可能的实现方式中:训练所述第二网络用到的参数还包括所述第二网络的回归损失和RPN损失,其中,所述第二网络的回归损失和RPN损失为根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值确定的。
在一种可能的实现方式中,所述获取单元具体用于:
接收模型训练设备发送的目标网络,其中所述模型训练设备用于训练得到所述目标网络。
需要说明的是,各个单元的实现及有益效果还可以对应参照图10所示的方法实施例的相应描述。
请参见图16,图16是本申请实施例提供的一种图像检测装置160的结构示意图,该图像检测装置160可以为上述方法实施例中的模型使用设备或者该模型使用设备中的器件,该图像检测装置160可以包括获取单元1601和识别单元1602,其中,各个单元的详细描述如下。
获取单元1601,用于获取目标网络,其中,所述目标网络为通过多个网络迭代对第二网络进行训练得到的网络,所述多个网络均为分类网络,所述多个网络至少包括第一网络和第三网络,所述第三网络用于在所述第一网络对第二网络进行训练得到中间网络后对所述中间网络进行训练,其中,所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度;
识别单元1602,用于通过所述目标网络识别图像中的内容。
在上述方法中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三 网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
在一种可能的实现方式中,所述第一网络对第二网络进行训练时用到的参数包括特征损失,其中,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息。
在该可能的实现方式中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述第一网络对第二网络进行训练时用到的参数包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式,或者第八方面的任一种可能的实现方式,在又一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
在一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
在一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
在一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000084
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述 目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000085
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000086
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000087
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000088
表示基于
Figure PCTCN2021088787-appb-000089
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000090
表示基于
Figure PCTCN2021088787-appb-000091
Figure PCTCN2021088787-appb-000092
得到的二值交叉熵损失,β为预设的权重平衡因子。
在一种可能的实现方式中:训练所述第二网络用到的参数还包括所述第二网络的回归损失和RPN损失,其中,所述第二网络的回归损失和RPN损失为根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值确定的。
在一种可能的实现方式中,所述获取单元具体用于:
接收模型训练设备发送的目标网络,其中所述模型训练设备用于训练得到所述目标网络。
需要说明的是,各个单元的实现及有益效果还可以对应参照图10所示的方法实施例的相应描述。
请参见图17,图17是本申请实施例提供的一种模型训练设备170的结构示意图,该模型训练设备170包括处理器1701、存储器1702和通信接口1703,所述处理器1701、存储器1702和通信接口1703通过总线相互连接。
存储器1702包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器1702用于相关计算机程序及数据。通信接口1703用于接收和发送数据。
处理器1701可以是一个或多个中央处理器(central processing unit,CPU),在处理器1701是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。
该模型训练设备170中的处理器1701用于读取所述存储器1702中存储的计算机程序代码,执行以下操作:
通过第一网络的特征提取层提取目标图像中的第一特征信息;
通过第二网络的特征提取层提取目标图像中的第二特征信息;
通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
通过所述第一局部特征和所述第二局部特征确定特征损失;
根据所述特征损失训练所述第二网络,得到目标网络。
在上述方法中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高 斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述处理器还用于:
通过所述第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
通过所述第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
根据所述第一分类预测值和所述第二分类预测值确定分类损失;
所述根据所述特征损失训练所述第二网络,得到目标网络,包括:
根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,在根据所述特征损失训练所述第二网络,得到目标网络方面,所述处理器具体用于:
根据所述特征损失训练所述第二网络;
通过第三网络对经过训练后的所述第二网络进行训练,得到目标网络,其中,所述第三网络的深度大于所述第一网络的深度。
在该可能的实现方式中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
在一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
在一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
在一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
在一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000093
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000094
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标 签,
Figure PCTCN2021088787-appb-000095
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000096
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000097
表示基于
Figure PCTCN2021088787-appb-000098
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000099
表示基于
Figure PCTCN2021088787-appb-000100
Figure PCTCN2021088787-appb-000101
得到的二值交叉熵损失,β为预设的权重平衡因子。
在一种可能的实现方式中,所述处理器还用于:
根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值,确定所述第二网络的回归损失和RPN损失;
根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络,包括:
根据所述特征损失、所述分类损失、所述回归损失和所述RPN损失训练所述第二网络,得到目标网络。
在一种可能的实现方式中,在根据所述特征损失训练所述第二网络,得到目标网络之后,所述处理器还用于:
通过通信接口1703向模型使用设备发送所述目标网络,其中所述目标网络用于预测图像中的内容。
需要说明的是,各个操作的实现还可以对应参照图10所示的方法实施例的相应描述。
请参见图18,图18是本申请实施例提供的一种模型训练设备180的结构示意图,该模型训练设备180包括处理器1801、存储器1802和通信接口1803,所述处理器1801、存储器1802和通信接口1803通过总线相互连接。
存储器1802包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器1802用于相关计算机程序及数据。通信接口1803用于接收和发送数据。
处理器1801可以是一个或多个中央处理器(central processing unit,CPU),在处理器1801是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。
该模型训练设备180中的处理器1801用于读取所述存储器1802中存储的计算机程序代码,执行以下操作:
基于第一网络训练第二网络得到中间网络;本申请各个实施例中,基于第一网络训练第二网络实质就是通过第一网络蒸馏第二网络,以及基于第三网络训练已被第一网络训练的第二网络实质就是通过第三网络蒸馏已被第一网络蒸馏的第二网络,此处统一说明。
基于第三网络训练所述中间网络,得到目标网络,其中,所述第一网络、所述第二网络和所述第三网络均为分类网络,且所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度。
在上述方法中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
在一种可能的实现方式中,在基于第一网络训练第二网络得到中间网络方面,所述处理器具体用于:
通过第一网络的特征提取层提取目标图像中的第一特征信息;
通过第二网络的特征提取层提取目标图像中的第二特征信息;
通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
通过所述第一局部特征和所述第二局部特征确定特征损失;
根据所述特征损失训练所述第二网络,得到所述中间网络。
在该可能的实现方式中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述处理器1801还用于:
通过第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
通过第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
根据所述第一分类预测值和所述第二分类预测值确定分类损失;
在根据所述特征损失训练所述第二网络,得到所述中间网络方面,所述处理器具体用于:
根据所述特征损失和所述分类损失训练所述第二网络,得到所述中间网络。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
在一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
在一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
在一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000102
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000103
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000104
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000105
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000106
表示基于
Figure PCTCN2021088787-appb-000107
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000108
表示基于
Figure PCTCN2021088787-appb-000109
Figure PCTCN2021088787-appb-000110
得到的二值交叉熵损失,β为预设的权重平衡因子。
在一种可能的实现方式中,所述方法还包括:
根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值,确定所述第二网络的回归损失和RPN损失;
根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络,包括:
根据所述特征损失、所述分类损失、所述回归损失和所述RPN损失训练所述第二网络,得到目标网络。
在一种可能的实现方式中,所述根据所述特征损失训练所述第二网络,得到目标网络之后,还包括:
通过通信接口1803向模型使用设备发送所述目标网络,其中所述目标网络用于预测图像中的内容。
需要说明的是,各个操作的实现还可以对应参照图10所示的方法实施例的相应描述。
请参见图19,图19是本申请实施例提供的一种模型使用设备190的结构示意图,该模型使用设备190也可以称为图像检测设备或者其他名称,该模型使用设备190包括处理器1901、存储器1902和通信接口1903,所述处理器1901、存储器1902和通信接口1903通过总线相互连接。
存储器1902包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器1902用于相关计算机程序及数据。通信接口1903用于接收和发送数据。
处理器1901可以是一个或多个中央处理器(central processing unit,CPU),在处理器1901是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。
该模型使用设备190中的处理器1901用于读取所述存储器1902中存储的计算机程序 代码,执行以下操作:
获取目标网络,其中,所述目标网络为通过第一网络对第二网络进行训练后得到的网络,通过所述第一网络训练所述第二网络用到的参数包括特征损失,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
通过所述目标网络识别图像中的内容。
在上述方法中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,训练所述第二网络用到的参数还包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述目标网络具体为通过所述第一网络对第二网络进行训练,并通过第三网络对训练得到的网络进一步进行训练之后的网络,其中,所述第三网络的深度大于所述第一网络的深度。
在该可能的实现方式中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
在一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
在一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
在一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议, 或者为所述区域提议集合中属于所述目标物体的正例区域提议。
在一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000111
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000112
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000113
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000114
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000115
表示基于
Figure PCTCN2021088787-appb-000116
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000117
表示基于
Figure PCTCN2021088787-appb-000118
Figure PCTCN2021088787-appb-000119
得到的二值交叉熵损失,β为预设的权重平衡因子。
在一种可能的实现方式中,训练所述第二网络用到的参数还包括所述第二网络的回归损失和RPN损失,其中,所述第二网络的回归损失和RPN损失为根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值确定的。
在一种可能的实现方式中,在获取目标网络方面,所述处理器具体用于:
通过通信接口1903接收模型训练设备发送的目标网络,其中所述模型训练设备用于训练得到所述目标网络。
需要说明的是,各个操作的实现还可以对应参照图10所示的方法实施例的相应描述。
请参见图20,图20是本申请实施例提供的一种模型使用设备200的结构示意图,该模型使用设备200也可以称为图像检测设备或者其他名称,该模型使用设备200包括处理器2001、存储器2002和通信接口2003,所述处理器2001、存储器2002和通信接口2003通过总线相互连接。
存储器2002包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器2002用于相关计算机程序及数据。通信接口2003用于接收和发送数据。
处理器2001可以是一个或多个中央处理器(central processing unit,CPU),在处理器2001是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。
该模型使用设备200中的处理器2001用于读取所述存储器2002中存储的计算机程序代码,执行以下操作:
获取目标网络,其中,所述目标网络为通过多个网络迭代对第二网络进行训练得到的 网络,所述多个网络均为分类网络,所述多个网络至少包括第一网络和第三网络,所述第三网络用于在所述第一网络对第二网络进行训练得到中间网络后对所述中间网络进行训练,其中,所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度;
通过所述目标网络识别图像中的内容。
在上述方法中,在通过第一网络对第二网络进行训练之后,进一步使用层更多的第三网络对已训练的第二网络做进一步训练,能够稳定提升第二网络的性能。
在一种可能的实现方式中,所述第一网络对第二网络进行训练时用到的参数包括特征损失,其中,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息。
在该可能的实现方式中,通过高斯掩膜突出第一网络提取的特征信息中关于目标物体的局部特征,以及突出第二网络提取的特征信息中关于目标物体的局部特征,然后根据两网络中关于目标物体的局部特征确定特征损失,后续基于该特征损失对第二网络进行训练。通过高斯掩膜滤掉了图像的背景噪声(包括目标物体的方框外的背景噪声和方框内的背景噪声),在此基础上得到的特征损失更能够反映出第二网络与第一网络的差异,因此基于该特征损失对第二网络进行训练能够使得第二网络对特征的表达更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述第一网络对第二网络进行训练时用到的参数包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
在该可能的实现方式中,通过选取同样的区域提议集合的方式,使得第一网络的分类层和第二网络的分类层基于同样的区域提议来生成分类预测值,在基于的区域提议相同的情况下,两个网络生成的预测值的差异一般就是由于这两个网络的模型参数的差异导致的,因此本申请实施例基于第一预测值与第二预测值的差异确定用于训练第二网络的分类损失;通过这种方式可以最大程度得第二网络相对于第一网络的损失,因此基于分类损失对第二模型进行训练能够使得第二网络的分类结果更趋近于第一网络,模型蒸馏效果很好。
在一种可能的实现方式中,所述第一网络和所述第二网络通过共享区域提议网络RPN的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
在一种可能的实现方式中,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
在一种可能的实现方式中,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
在一种可能的实现方式中,所述分类损失L cls满足如下关系:
Figure PCTCN2021088787-appb-000120
其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
Figure PCTCN2021088787-appb-000121
为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
Figure PCTCN2021088787-appb-000122
为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
Figure PCTCN2021088787-appb-000123
为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
Figure PCTCN2021088787-appb-000124
表示基于
Figure PCTCN2021088787-appb-000125
和y m得到的交叉熵损失,
Figure PCTCN2021088787-appb-000126
表示基于
Figure PCTCN2021088787-appb-000127
Figure PCTCN2021088787-appb-000128
得到的二值交叉熵损失,β为预设的权重平衡因子。
在一种可能的实现方式中,训练所述第二网络用到的参数还包括所述第二网络的回归损失和RPN损失,其中,所述第二网络的回归损失和RPN损失为根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值确定的。
在一种可能的实现方式中,所述获取目标网络,包括:
接收模型训练设备发送的目标网络,其中所述模型训练设备用于训练得到所述目标网络。
需要说明的是,各个操作的实现还可以对应参照图10所示的方法实施例的相应描述。
本申请实施例还提供一种芯片系统,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述收发器和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有计算机程序;所述计算机程序被所述处理器执行时,图10所示的方法流程得以实现。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在处理器上运行时,图10所示的方法流程得以实现。
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在处理器上运行时,图10所示的方法流程得以实现。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来计算机程序相关的硬件完成,该计算机程序可存储于计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储计算机程序代码的介质。

Claims (51)

  1. 一种模型训练方法,其特征在于,包括:
    通过第一网络的特征提取层提取目标图像中的第一特征信息;
    通过第二网络的特征提取层提取目标图像中的第二特征信息,其中,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
    通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
    通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
    通过所述第一局部特征和所述第二局部特征确定特征损失;
    根据所述特征损失训练所述第二网络,得到目标网络。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    通过所述第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
    通过所述第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
    根据所述第一分类预测值和所述第二分类预测值确定分类损失;
    所述根据所述特征损失训练所述第二网络,得到目标网络,包括:
    根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络。
  3. 根据权利要求2所述的方法,其特征在于,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
  4. 根据权利要求3所述的方法,其特征在于,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
  5. 根据权利要求2-4任一项所述的方法,其特征在于,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
  6. 根据权利要求2-5任一项所述的方法,其特征在于,所述分类损失L cls满足如下关系:
    Figure PCTCN2021088787-appb-100001
    其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
    Figure PCTCN2021088787-appb-100002
    为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
    Figure PCTCN2021088787-appb-100003
    为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区 域提议预测的第二分类预测值,
    Figure PCTCN2021088787-appb-100004
    为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
    Figure PCTCN2021088787-appb-100005
    表示基于
    Figure PCTCN2021088787-appb-100006
    和y m得到的交叉熵损失,
    Figure PCTCN2021088787-appb-100007
    表示基于
    Figure PCTCN2021088787-appb-100008
    Figure PCTCN2021088787-appb-100009
    得到的二值交叉熵损失,β为预设的权重平衡因子。
  7. 根据权利要求2-6任一项所述的方法,其特征在于,所述方法还包括:
    根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值,确定所述第二网络的回归损失和RPN损失;
    根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络,包括:
    根据所述特征损失、所述分类损失、所述回归损失和所述RPN损失训练所述第二网络,得到目标网络。
  8. 根据权利要求1所述的方法,其特征在于,所述根据所述特征损失训练所述第二网络,得到目标网络,包括:
    根据所述特征损失训练所述第二网络;
    通过第三网络对经过训练后的所述第二网络进行训练,得到目标网络,其中,所述第三网络的深度大于所述第一网络的深度。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述根据所述特征损失训练所述第二网络,得到目标网络之后,还包括:
    向模型使用设备发送所述目标网络,其中所述目标网络用于预测图像中的内容。
  10. 一种模型训练方法,其特征在于,包括:
    基于第一网络训练第二网络得到中间网络;
    基于第三网络训练所述中间网络,得到目标网络,其中所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度。
  11. 根据权利要求10所述的方法,其特征在于,所述基于第一网络训练第二网络得到中间网络包括:
    通过第一网络的特征提取层提取目标图像中的第一特征信息;
    通过第二网络的特征提取层提取目标图像中的第二特征信息;
    通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
    通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
    通过所述第一局部特征和所述第二局部特征确定特征损失;
    根据所述特征损失训练所述第二网络,得到所述中间网络。
  12. 根据权利要求11所述的方法,其特征在于,所述第一网络、所述第二网络和所述第 三网络均为分类网络,所述方法还包括:
    通过第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
    通过第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
    根据所述第一分类预测值和所述第二分类预测值确定分类损失;
    所述根据所述特征损失训练所述第二网络,得到所述中间网络,包括:
    根据所述特征损失和所述分类损失训练所述第二网络,得到所述中间网络。
  13. 一种图像检测方法,其特征在于,包括:
    获取目标网络,其中,所述目标网络为通过第一网络对第二网络进行训练后得到的网络,通过所述第一网络训练所述第二网络用到的参数包括特征损失,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
    通过所述目标网络识别图像中的内容。
  14. 根据权利要求13所述的方法,其特征在于,训练所述第二网络用到的参数还包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
  15. 根据权利要求14所述的方法,其特征在于,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
  16. 根据权利要求15所述的方法,其特征在于,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
  17. 根据权利要求14-16任一项所述的方法,其特征在于,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
  18. 根据权利要求14-17任一项所述的方法,其特征在于,所述分类损失L cls满足如下关系:
    Figure PCTCN2021088787-appb-100010
    其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
    Figure PCTCN2021088787-appb-100011
    为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
    Figure PCTCN2021088787-appb-100012
    为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
    Figure PCTCN2021088787-appb-100013
    为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
    Figure PCTCN2021088787-appb-100014
    表示基于
    Figure PCTCN2021088787-appb-100015
    和y m得到的交叉熵损失,
    Figure PCTCN2021088787-appb-100016
    表示基于
    Figure PCTCN2021088787-appb-100017
    Figure PCTCN2021088787-appb-100018
    得到的二值交叉熵损失,β为预设的权重平衡因子。
  19. 根据权利要求14-18任一项所述的方法,其特征在于:
    训练所述第二网络用到的参数还包括所述第二网络的回归损失和RPN损失,其中,所述第二网络的回归损失和RPN损失为根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值确定的。
  20. 根据权利要求13-19任一项所述的方法,其特征在于,所述目标网络具体为通过所述第一网络对第二网络进行训练,并通过第三网络对训练得到的网络进一步进行训练之后的网络,其中,所述第三网络的深度大于所述第一网络的深度。
  21. 根据权利要求13-20任一项所述的方法,其特征在于,所述获取目标网络,包括:
    接收模型训练设备发送的目标网络,其中所述模型训练设备用于训练得到所述目标网络。
  22. 一种图像检测方法,其特征在于,包括:
    获取目标网络,其中,所述目标网络为通过多个网络迭代对第二网络进行训练得到的网络,所述多个网络均为分类网络,所述多个网络至少包括第一网络和第三网络,所述第三网络用于在所述第一网络对第二网络进行训练得到中间网络后对所述中间网络进行训练,其中,所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度;
    通过所述目标网络识别图像中的内容。
  23. 根据权利要求22所述的方法,其特征在于,所述第一网络对第二网络进行训练时 用到的参数包括特征损失,其中,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息。
  24. 根据权利要求22或23所述的方法,其特征在于,所述第一网络对第二网络进行训练时用到的参数包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
  25. 一种模型训练装置,其特征在于,包括:
    特征提取单元,用于通过第一网络的特征提取层提取目标图像中的第一特征信息;
    所述特征提取单元,还用于通过第二网络的特征提取层提取目标图像中的第二特征信息,其中,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
    第一优化单元,用于通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
    第二优化单元,用于通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征;
    第一确定单元,用于通过所述第一局部特征和所述第二局部特征确定特征损失;
    权重调整单元,用于根据所述特征损失训练所述第二网络,得到目标网络。
  26. 根据权利要求25所述的装置,其特征在于,所述装置还包括:
    第一生成单元,用于通过所述第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
    第二生成单元,用于通过所述第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
    第二确定单元,用于根据所述第一分类预测值和所述第二分类预测值确定分类损失;
    所述权重调整单元具体用于:根据所述特征损失和所述分类损失训练所述第二网络,得到目标网络。
  27. 根据权利要求26所述的装置,其特征在于,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
  28. 根据权利要求27所述的装置,其特征在于,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
  29. 根据权利要求26-28任一项所述的装置,其特征在于,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
  30. 根据权利要求26-29任一项所述的装置,其特征在于,所述分类损失L cls满足如下关系:
    Figure PCTCN2021088787-appb-100019
    其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
    Figure PCTCN2021088787-appb-100020
    为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
    Figure PCTCN2021088787-appb-100021
    为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
    Figure PCTCN2021088787-appb-100022
    为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
    Figure PCTCN2021088787-appb-100023
    表示基于
    Figure PCTCN2021088787-appb-100024
    和y m得到的交叉熵损失,
    Figure PCTCN2021088787-appb-100025
    表示基于
    Figure PCTCN2021088787-appb-100026
    Figure PCTCN2021088787-appb-100027
    得到的二值交叉熵损失,β为预设的权重平衡因子。
  31. 根据权利要求26-30任一项所述的装置,其特征在于,所述装置还包括:
    第三确定单元,用于根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值,确定所述第二网络的回归损失和RPN损失;
    所述权重调整单元具体用于:根据所述特征损失、所述分类损失、所述回归损失和所述RPN损失训练所述第二网络,得到目标网络。
  32. 根据权利要求25所述的装置,其特征在于,所述在根据所述特征损失训练所述第二网络,得到目标网络,所述权重调整单元具体用于:
    根据所述特征损失训练所述第二网络;
    通过第三网络对经过训练后的所述第二网络进行训练,得到目标网络,其中,所述第三网络的深度大于所述第一网络的深度。
  33. 根据权利要求25-32任一项所述的装置,其特征在于,还包括:
    发送单元,用于在所述权重调整单元根据所述特征损失训练所述第二网络,得到目标网络之后,向模型使用设备发送所述目标网络,其中所述目标网络用于预测图像中的内容。
  34. 一种模型训练装置,其特征在于,包括:
    第一训练单元,用于基于第一网络训练第二网络得到中间网络;
    第二训练单元,用于基于第三网络训练所述中间网络,得到目标网络,其中,所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度。
  35. 根据权利要求34所述的装置,其特征在于,所述基于第一网络训练第二网络得到中间网络包括:
    通过第一网络的特征提取层提取目标图像中的第一特征信息;
    通过第二网络的特征提取层提取目标图像中的第二特征信息;
    通过高斯掩膜提取所述第一特征信息中关于目标物体的特征,得到第一局部特征;
    通过高斯掩膜提取所述第二特征信息中关于所述目标物体的特征,得到第二局部特征;
    通过所述第一局部特征和所述第二局部特征确定特征损失;
    根据所述特征损失训练所述第二网络,得到所述中间网络。
  36. 根据权利要求35所述的装置,其特征在于,所述第一网络、所述第二网络和所述第三网络均为分类网络,所述装置还包括:
    第一生成单元,用于通过第一网络的分类层生成区域提议集合中的目标区域提议的第一分类预测值;
    第二生成单元,用于通过第二网络的分类层生成所述区域提议集合中的所述目标区域提议的第二分类预测值;
    第二确定单元,用于根据所述第一分类预测值和所述第二分类预测值确定分类损失;
    所述根据所述特征损失训练所述第二网络,得到所述中间网络,具体为:
    根据所述特征损失和所述分类损失训练所述第二网络,得到所述中间网络。
  37. 一种图像检测装置,其特征在于,包括:
    获取单元,用于获取目标网络,其中,所述目标网络为通过第一网络对第二网络进行训练后得到的网络,通过所述第一网络训练所述第二网络用到的参数包括特征损失,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息,所述第一网络和所述第二网络均为分类网络,且所述第一网络的深度大于所述第二网络的深度;
    识别单元,用于通过所述目标网络识别图像中的内容。
  38. 根据权利要求37所述的装置,其特征在于,训练所述第二网络用到的参数还包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第 一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
  39. 根据权利要求38所述的装置,其特征在于,所述第一网络和所述第二网络通过共享区域提议网络(RPN)的方式使得所述第一网络和所述第二网络均具有所述区域提议集合。
  40. 根据权利要求39所述的装置,其特征在于,所述RPN为所述第二网络共享给所述第一网络的,或者为所述第一网络共享给所述第二网络的。
  41. 根据权利要求38-40任一项所述的装置,其特征在于,所述目标区域提议为所述区域提议集合中的全部区域提议,或者为所述区域提议集合中属于所述目标物体的正例区域提议。
  42. 根据权利要求28-41任一项所述的装置,其特征在于,所述分类损失L cls满足如下关系:
    Figure PCTCN2021088787-appb-100028
    其中,K为所述区域提议集合中区域提议的总数,N p为所述区域提议集合中属于所述目标物体的正例区域提议的总数,
    Figure PCTCN2021088787-appb-100029
    为所述第二网络的分类层对所述区域提议集合中第m个区域提议预测的分类预测值,y m为所述区域提议集合中第m个区域提议对应的真值标签,
    Figure PCTCN2021088787-appb-100030
    为所述第二网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第二分类预测值,
    Figure PCTCN2021088787-appb-100031
    为所述第一网络的分类层对所述区域提议集合中第n个属于所述目标物体的正例区域提议预测的第一分类预测值,
    Figure PCTCN2021088787-appb-100032
    表示基于
    Figure PCTCN2021088787-appb-100033
    和y m得到的交叉熵损失,
    Figure PCTCN2021088787-appb-100034
    表示基于
    Figure PCTCN2021088787-appb-100035
    Figure PCTCN2021088787-appb-100036
    得到的二值交叉熵损失,β为预设的权重平衡因子。
  43. 根据权利要求38-42任一项所述的装置,其特征在于:
    训练所述第二网络用到的参数还包括所述第二网络的回归损失和RPN损失,其中,所述第二网络的回归损失和RPN损失为根据所述目标图像中的区域提议的真值标签和所述第二网络对所述目标图像中的区域提议预测的预测值确定的。
  44. 根据权利要求37-43任一项所述的装置,其特征在于,所述目标网络具体为通过所述第一网络对第二网络进行训练,并通过第三网络对训练得到的网络进一步进行训练之后的网络,其中,所述第三网络的深度大于所述第一网络的深度。
  45. 根据权利要求37-44任一项所述的装置,其特征在于,所述获取单元具体用于:
    接收模型训练设备发送的目标网络,其中所述模型训练设备用于训练得到所述目标网络。
  46. 一种图像检测装置,其特征在于,包括:
    获取单元,用于获取目标网络,其中,所述目标网络为通过多个网络迭代对第二网络进行训练得到的网络,所述多个网络均为分类网络,所述多个网络至少包括第一网络和第三网络,所述第三网络用于在所述第一网络对第二网络进行训练得到中间网络后对所述中间网络进行训练,其中,所述第三网络的深度大于所述第一网络的深度,所述第一网络的深度大于所述第二网络的深度;
    识别单元,用于通过所述目标网络识别图像中的内容。
  47. 根据权利要求46所述的装置,其特征在于,所述第一网络对第二网络进行训练时用到的参数包括特征损失,其中,所述特征损失为根据第一局部特征和第二局部特征确定的,所述第一局部特征为通过高斯掩膜从第一特征信息中提取的关于目标物体的特征,所述第二局部特征为通过高斯掩膜从第二特征信息中提取的关于所述目标物体的特征,所述第一特征信息为通过所述第一网络的特征提取层提取到的目标图像中的特征信息,所述第二特征信息为通过所述第二网络的特征提取层提取到的所述目标图像中的特征信息。
  48. 根据权利要求46或47所述的装置,其特征在于,所述第一网络对第二网络进行训练时用到的参数包括分类损失,其中,所述分类损失为根据第一分类预测值和第二分类预测值确定的,所述第一分类预测值为通过所述第一网络的分类层生成的区域提议集合中的目标区域提议的分类预测值,所述第二分类预测值为通过所述第二网络的分类层生成的所述区域提议集合中的所述目标区域提议的分类预测值。
  49. 一种模型训练设备,其特征在于,包括处理器和存储器,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序来执行权利要求1-12任一项所述的方法。
  50. 一种模型使用设备,其特征在于,包括处理器和存储器,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序来执行权利要求13-24任一项所述的方法。
  51. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储计算机程序,当所述计算机程序在处理器上运行时,实现权利要求1-24任一项所述的方法。
PCT/CN2021/088787 2020-05-15 2021-04-21 一种模型训练方法及相关设备 WO2021227804A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/986,081 US20230075836A1 (en) 2020-05-15 2022-11-14 Model training method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010412910.6A CN113673533A (zh) 2020-05-15 2020-05-15 一种模型训练方法及相关设备
CN202010412910.6 2020-05-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/986,081 Continuation US20230075836A1 (en) 2020-05-15 2022-11-14 Model training method and related device

Publications (1)

Publication Number Publication Date
WO2021227804A1 true WO2021227804A1 (zh) 2021-11-18

Family

ID=78526362

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088787 WO2021227804A1 (zh) 2020-05-15 2021-04-21 一种模型训练方法及相关设备

Country Status (3)

Country Link
US (1) US20230075836A1 (zh)
CN (1) CN113673533A (zh)
WO (1) WO2021227804A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115131357A (zh) * 2022-09-01 2022-09-30 合肥中科类脑智能技术有限公司 一种输电通道挂空悬浮物检测方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444558A (zh) * 2020-11-05 2022-05-06 佳能株式会社 用于对象识别的神经网络的训练方法及训练装置
JP2022090491A (ja) * 2020-12-07 2022-06-17 キヤノン株式会社 画像処理装置、画像処理方法、及びプログラム
CN114581751B (zh) * 2022-03-08 2024-05-10 北京百度网讯科技有限公司 图像识别模型的训练方法和图像识别方法、装置
CN117542085B (zh) * 2024-01-10 2024-05-03 湖南工商大学 基于知识蒸馏的园区场景行人检测方法、装置及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921294A (zh) * 2018-07-11 2018-11-30 浙江大学 一种用于神经网络加速的渐进式块知识蒸馏方法
US20180365564A1 (en) * 2017-06-15 2018-12-20 TuSimple Method and device for training neural network
CN109961442A (zh) * 2019-03-25 2019-07-02 腾讯科技(深圳)有限公司 神经网络模型的训练方法、装置和电子设备
CN110472730A (zh) * 2019-08-07 2019-11-19 交叉信息核心技术研究院(西安)有限公司 一种卷积神经网络的自蒸馏训练方法和可伸缩动态预测方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180365564A1 (en) * 2017-06-15 2018-12-20 TuSimple Method and device for training neural network
CN108921294A (zh) * 2018-07-11 2018-11-30 浙江大学 一种用于神经网络加速的渐进式块知识蒸馏方法
CN109961442A (zh) * 2019-03-25 2019-07-02 腾讯科技(深圳)有限公司 神经网络模型的训练方法、装置和电子设备
CN110472730A (zh) * 2019-08-07 2019-11-19 交叉信息核心技术研究院(西安)有限公司 一种卷积神经网络的自蒸馏训练方法和可伸缩动态预测方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115131357A (zh) * 2022-09-01 2022-09-30 合肥中科类脑智能技术有限公司 一种输电通道挂空悬浮物检测方法
CN115131357B (zh) * 2022-09-01 2022-11-08 合肥中科类脑智能技术有限公司 一种输电通道挂空悬浮物检测方法

Also Published As

Publication number Publication date
CN113673533A (zh) 2021-11-19
US20230075836A1 (en) 2023-03-09

Similar Documents

Publication Publication Date Title
WO2021227804A1 (zh) 一种模型训练方法及相关设备
US11586992B2 (en) Travel plan recommendation method, apparatus, device and computer readable storage medium
WO2021249071A1 (zh) 一种车道线的检测方法及相关设备
US9760806B1 (en) Method and system for vision-centric deep-learning-based road situation analysis
Yang et al. Crossing or not? Context-based recognition of pedestrian crossing intention in the urban environment
CN108805016B (zh) 一种头肩区域检测方法及装置
WO2019209583A1 (en) System and method of object-based navigation
Matzka et al. Efficient resource allocation for attentive automotive vision systems
CN108960074B (zh) 基于深度学习的小尺寸行人目标检测方法
US11756309B2 (en) Contrastive learning for object detection
US20240149906A1 (en) Agent trajectory prediction using target locations
CN115797736B (zh) 目标检测模型的训练和目标检测方法、装置、设备和介质
CN113807399A (zh) 一种神经网络训练方法、检测方法以及装置
Fang et al. Traffic police gesture recognition by pose graph convolutional networks
Katare et al. Bias detection and generalization in AI algorithms on edge for autonomous driving
Nejad et al. Vehicle trajectory prediction in top-view image sequences based on deep learning method
WO2023179593A1 (zh) 数据处理方法及装置
JP2023036795A (ja) 画像処理方法、モデル訓練方法、装置、電子機器、記憶媒体、コンピュータプログラム及び自動運転車両
Pan et al. A Hybrid Deep Learning Algorithm for the License Plate Detection and Recognition in Vehicle-to-Vehicle Communications
Venkatesh et al. An intelligent traffic management system based on the Internet of Things for detecting rule violations
Roncancio et al. Ceiling analysis of pedestrian recognition pipeline for an autonomous car application
US11879744B2 (en) Inferring left-turn information from mobile crowdsensing
CN114863685B (zh) 一种基于风险接受程度的交通参与者轨迹预测方法及系统
CN116541715B (zh) 目标检测方法、模型的训练方法、目标检测系统及装置
CN113963027B (zh) 不确定性检测模型的训练、不确定性的检测方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21803747

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21803747

Country of ref document: EP

Kind code of ref document: A1