WO2021258481A1 - 基于多任务与弱监督的美丽预测方法、装置及存储介质 - Google Patents

基于多任务与弱监督的美丽预测方法、装置及存储介质 Download PDF

Info

Publication number
WO2021258481A1
WO2021258481A1 PCT/CN2020/104568 CN2020104568W WO2021258481A1 WO 2021258481 A1 WO2021258481 A1 WO 2021258481A1 CN 2020104568 W CN2020104568 W CN 2020104568W WO 2021258481 A1 WO2021258481 A1 WO 2021258481A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
noise
value
tasks
task
Prior art date
Application number
PCT/CN2020/104568
Other languages
English (en)
French (fr)
Inventor
甘俊英
白振峰
翟懿奎
何国辉
Original Assignee
五邑大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 五邑大学 filed Critical 五邑大学
Priority to US17/424,407 priority Critical patent/US11721128B2/en
Publication of WO2021258481A1 publication Critical patent/WO2021258481A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/169Holistic features and representations, i.e. based on the facial image taken as a whole
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/87Arrangements for image or video recognition or understanding using pattern recognition or machine learning using selection of the recognition techniques, e.g. of a classifier in a multiple classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Definitions

  • the present invention relates to the field of image processing, in particular to a beauty prediction method, device and storage medium based on multi-task and weak supervision.
  • the face beauty prediction technology uses image processing and artificial intelligence to intelligently judge the face beauty level.
  • the face beauty prediction technology is mainly implemented through deep learning, but deep learning networks require a large number of training samples, training models are easy to overfit, ignoring the correlation and difference between multiple tasks, and the cost of data labeling in strong supervised learning is relatively high. High and ignore the actual situation that it is difficult to obtain all truth labels in the database.
  • most tasks are based on single-task, strong-labeled data for model training, single-task ignores the correlation between tasks, real-life tasks are often inextricably linked; strong-labeled data in real life is difficult Obtaining all, and obtaining all the truth labels is expensive.
  • the purpose of the present invention is to solve at least one of the technical problems existing in the prior art, and to provide a beauty prediction method, device and storage medium based on multitasking and weak supervision.
  • the first aspect of the present invention includes the following steps:
  • Preprocessing the input face image to obtain a preprocessed image Preprocessing the input face image to obtain a preprocessed image, wherein the preprocessed image includes a true value image marked with a truth value label and a noise image marked with a noise label;
  • the preprocessed image is assigned to multiple tasks, where each task includes multiple truth-value images and multiple noise images, and multiple tasks include a main task specifically for predicting the beauty of human faces and multiple tasks.
  • the image feature is processed by the residual network, the mapping from the image feature to the residual value of the truth label and the noise label is learned, and the first predicted value is obtained;
  • learn the mapping from the image feature to the truth label and obtain a second predicted value; obtain the classification according to the first predicted value and the second predicted value through a classifier result.
  • the preprocessing of the input face image to obtain the preprocessed image specifically includes: sequentially performing image enhancement processing, image correction processing, image cropping processing, image deduplication processing, and image processing on the face image.
  • the normalization process obtains the preprocessed image.
  • the feature extraction layer is one of VGG16, ResNet50, Google Inception V3, or DenseNet.
  • the overall loss function of the multiple tasks is: Where L n is the loss of a single said task, and ⁇ n is the weight corresponding to each said task.
  • the loss function of the residual net is: Where D n is the image feature, y i is the noise tag, h i is the predicted value of the first, L noise loss is the residual value of the network, N n is the total number of the image feature.
  • the loss function of the standard neural network is: Where v j is the truth label, g j is the second predicted value, and L clean is the loss value of the standard neural network.
  • the overall goals of multiple classification networks are: Where W is a hyperparameter, and ⁇ is a trade-off parameter between the loss value of the residual network and the loss value of the standard neural network.
  • the number of the noise images is greater than the number of the true value images.
  • a beauty prediction device based on multitasking and weak supervision is characterized in that it includes:
  • a preprocessing module for preprocessing the input face image to obtain a preprocessed image, wherein the preprocessed image includes a true value image marked with a truth value label and a noise image marked with a noise label;
  • the multi-task module is used to allocate the preprocessed image to a plurality of tasks, wherein each of the tasks includes a plurality of the truth-value images and a plurality of the noise images, and the plurality of tasks includes one specific face beauty
  • a feature extraction module configured to process the shared image features of the true value image and the noise image of multiple tasks
  • the classification module is used to process the image features to obtain a plurality of classification results.
  • the classification module includes a plurality of classification networks composed of a residual network, a standard neural network, and a classifier. One-to-one correspondence between the tasks;
  • the image feature is processed by the residual network, the mapping from the image feature to the residual value of the truth label and the noise label is learned, and the first predicted value is obtained;
  • learn the mapping from the image feature to the truth label and obtain a second predicted value; obtain the classification according to the first predicted value and the second predicted value through a classifier result.
  • a storage medium stores executable instructions, which can be executed by a computer, so that the computer executes the beauty based on multitasking and weak supervision as described in the first aspect of the present invention. method of prediction.
  • the above scheme has at least the following beneficial effects: use the correlation and difference between multiple tasks to enhance the expression ability of the main task's face beauty prediction; through the weakly supervised classification network, reduce the dependence on the truth label and reduce the data labeling Cost and reduce the impact of noise labels on the face beauty prediction model, and improve the generalization ability of the face beauty prediction model.
  • FIG. 1 is a flowchart of a beauty prediction method based on multitasking and weak supervision according to an embodiment of the present invention
  • FIG. 2 is a structural diagram of a beauty prediction device based on multi-task and weak supervision according to an embodiment of the present invention
  • Figure 3 is a structural diagram of the face beauty prediction model.
  • the orientation description involved such as up, down, front, back, left, right, etc. indicates the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, but only In order to facilitate the description of the present invention and simplify the description, it does not indicate or imply that the pointed device or element must have a specific orientation, be configured and operate in a specific orientation, and therefore cannot be understood as a limitation to the present invention.
  • some embodiments of the present invention provide a beauty prediction method based on multi-tasking and weak supervision, including the following steps:
  • Step S100 Preprocess the input face image to obtain a preprocessed image, where the preprocessed image includes a true value image marked with a truth value label and a noise image marked with a noise label;
  • Step S200 Distribute the preprocessed image to multiple tasks, where each task contains multiple truth-value images and multiple noise images, and the multiple tasks include a main task specifically for predicting the beauty of a face and multiple tasks related to the prediction of the beauty of a face Auxiliary tasks;
  • Step S300 processing the true value images and noise images of multiple tasks through the feature extraction layer to obtain shared image features
  • Step S400 processing image features through a plurality of classification networks 200 composed of a residual net 210, a standard neural network 220, and a classifier 230 to obtain a plurality of classification results, wherein the plurality of classification networks 200 correspond to a plurality of tasks one-to-one;
  • the image features are processed through the residual network 210, and the mapping from the image features to the residual values of the ground truth labels and the noise labels is learned, and the first predicted value is obtained; through the standard neural network 220, learning from The image feature is mapped to the ground truth label, and the second predicted value is obtained; the classifier 230 obtains the classification result according to the first predicted value and the second predicted value.
  • the correlation and difference between multiple tasks are used to enhance the expression ability of the main task's face beauty prediction; through the weakly supervised classification network 200, the dependence on truth labels is reduced, and the cost of data labeling is reduced. Reduce the influence of noise labels on the face beauty prediction model, and improve the generalization ability of the face beauty prediction model.
  • the classification network 200 with weak supervision mode can effectively use The image of the truth value label; solves the problem of poor model generalization ability, only single-task training, and high data labeling cost.
  • the input face image is data from multiple databases, including LSFBD face beauty database, GENKI-4K smile recognition database, IMDB-WIKI500k+ database, and SCUT-FBP5500 database.
  • preprocessing the input face image to obtain the preprocessed image is specifically: sequentially performing image enhancement processing, image correction processing, image cropping processing, image deduplication processing, and image normalization processing on the face image to obtain the preprocessed image.
  • the preprocessing can efficiently perform area detection and key point detection on the face image, as well as alignment and cropping, so that the size of the face image is consistent, which is convenient for subsequent operations.
  • the preprocessed image is input to the face beauty prediction model to perform step S200, step S300, and step S400.
  • the face beauty prediction model Refer to Figure 3 for the structure of the face beauty prediction model.
  • step S200 in each task, the number of noise images is more than the number of true value images.
  • the overall loss function of multiple tasks is: Where L n is the loss of a single task, and ⁇ n is the weight corresponding to each task.
  • the main task is the prediction of the beauty of the face; the auxiliary tasks are tasks related to the prediction of the beauty of the face, such as gender recognition and facial expression recognition.
  • the feature extraction layer is one of VGG16, ResNet50, Google Inception V3, or DenseNet.
  • the specific structure of the feature extraction layer is: the first layer is a 3*3 size convolution layer; the second layer is a 3*3 size convolution layer; the third layer is a 3*3 size volume Stacked layer; the fourth layer is a pooling layer; the fifth layer is a 3*3 convolutional layer; the sixth layer is a 3*3 convolutional layer; the seventh layer is a pooling layer; the eighth layer is 3 *3 size convolutional layer; the ninth layer is a 3*3 size convolutional layer; the tenth layer is a 3*3 size convolutional layer; the eleventh layer is a pooling layer; the twelfth layer is a 3* 3 size convolutional layer; the thirteenth layer is a 3*3 size convolutional layer; the fourteenth layer is a pooling layer.
  • the images of multiple tasks are extracted through the feature extraction layer to obtain shared image features, and multiple
  • the loss function of the residual net 210 is: Where D n is image feature, y i is the noise label, h i is a first prediction value, L noise residuals net loss values of 210, N n is the total number of image features.
  • D n is image feature
  • y i is the noise label
  • h i is a first prediction value
  • N n is the total number of image features.
  • the loss function of the standard neural network 220 is: Where v j is the truth label, g j is the second predicted value, and L clean is the loss value of the standard neural network 220.
  • the mapping from the image feature to the ground truth label is learned, and the second predicted value is obtained; the ground truth label is used to supervise all the image features entered into the standard neural network 220.
  • W is a hyperparameter
  • is a trade-off parameter between the loss value of the residual network 210 and the loss value of the standard neural network 220.
  • some embodiments of the present invention provide a beauty prediction device based on multitasking and weak supervision.
  • the beauty prediction method based on multitasking and weak supervision as described in the method embodiment is applied.
  • the beauty prediction device includes:
  • the preprocessing module, 100 is used to preprocess the input face image to obtain a preprocessed image, where the preprocessed image includes a true value image marked with a true value label and a noise image marked with a noise label;
  • the multi-task module 200 is used to distribute the pre-processed image to multiple tasks, where each task contains multiple truth-value images and multiple noise images.
  • the multiple tasks include a main task specifically for predicting the beauty of human faces and multiple related tasks. Auxiliary tasks related to facial beauty prediction;
  • the feature extraction module 300 is used to process true value images and noise images of multiple tasks to obtain shared image features
  • the classification module 400 is used to process image features to obtain multiple classification results.
  • the classification module 400 includes multiple classification networks 200 composed of a residual network 210, a standard neural network 220, and a classifier 230. One-to-one correspondence between tasks;
  • the image features are processed through the residual network 210, and the mapping from the image features to the residual values of the ground truth labels and the noise labels is learned, and the first predicted value is obtained; through the standard neural network 220, learning from The image feature is mapped to the ground truth label, and the second predicted value is obtained; the classifier 230 obtains the classification result according to the first predicted value and the second predicted value.
  • the beauty prediction device based on multitasking and weak supervision applies the beauty prediction method based on multitasking and weak supervision as described in the method embodiment. With the cooperation of various modules, it can execute the beauty prediction method based on multitasking and weak supervision.
  • the various steps of the beauty prediction method have the same technical effects as the beauty prediction method based on multi-task and weak supervision, and will not be detailed here.
  • Some embodiments of the present invention provide a storage medium storing executable instructions, which can be executed by a computer, so that the computer executes the beauty prediction method based on multi-tasking and weak supervision as described in the method embodiment of the present invention.
  • Examples of storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM) ), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical storage, magnetic cartridges Type magnetic tape, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technologies
  • CD-ROM compact disc
  • DVD digital versatile disc

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

基于多任务与弱监督的美丽预测方法、装置及存储介质,其中方法包括:预处理输入的人脸图像得到预处理图像(S100);将预处理图像分配至多个任务(S200);通过特征提取层处理多个所述任务的所述真值图像和所述噪声图像得到共享的图像特征(S300);通过多个由残差网、标准神经网络和分类器组成的分类网络得到多个分类结果(S400);利用多任务间的相关性和差异性,增强主任务人脸美丽预测的表达能力;通过弱监督模式的分类网络,减少对真值标签的依赖,降低数据标注成本和降低噪声标签对人脸美丽预测模型的影响,提高人脸美丽预测模型的泛化能力。

Description

基于多任务与弱监督的美丽预测方法、装置及存储介质 技术领域
本发明涉及图像处理领域,特别是基于多任务与弱监督的美丽预测方法、装置及存储介质。
背景技术
人脸美丽预测技术是通过图像处理与人工智能的结合,智能判断人脸美丽等级。目前人脸美丽预测技术主要通过深度学习实现,但深度学习网络要求有大量的训练样本、训练模型容易过拟合、忽略多任务之间的相关性和差异性、强监督学习中数据标注成本较高以及忽略了获取数据库全部真值标签较困难的实际情况。目前,大多数任务是针对单任务、强标签数据进行模型训练的,单任务忽略了任务之间的关联性,现实生活中的任务往往有千丝万缕的联系;现实生活中强标签数据难以全部获取,并且全部获取真值标签成本昂贵。
发明内容
本发明的目的在于至少解决现有技术中存在的技术问题之一,提供基于多任务与弱监督的美丽预测方法、装置及存储介质。
本发明解决其问题所采用的技术方案是:
本发明的第一方面,基于多任务与弱监督的美丽预测方法,包括以下步骤:
预处理输入的人脸图像得到预处理图像,其中所述预处理图像包 括标有真值标签的真值图像和标有噪声标签的噪声图像;
将所述预处理图像分配至多个任务,其中每个所述任务包含多个所述真值图像和多个所述噪声图像,多个所述任务包括一个具体为人脸美丽预测的主任务和多个与人脸美丽预测相关的辅任务;
通过特征提取层处理多个所述任务的所述真值图像和所述噪声图像得到共享的图像特征;
通过多个由残差网、标准神经网络和分类器组成的分类网络处理所述图像特征得到多个分类结果,其中多个所述分类网络与多个所述任务一一对应;
其中,在分类网络中,通过所述残差网处理所述图像特征,学习从所述图像特征到所述真值标签与所述噪声标签的残差值的映射,并得到第一预测值;通过所述标准神经网络,学习从所述图像特征到所述真值标签的映射,并得到第二预测值;通过分类器根据所述第一预测值和所述第二预测值得到所述分类结果。
根据本发明的第一方面,所述预处理输入的人脸图像得到预处理图像具体是:对所述人脸图像依次进行图像增强处理、图像矫正处理、图像裁剪处理、图像去重处理和图像归一化处理得到预处理图像。
根据本发明的第一方面,所述特征提取层是VGG16、ResNet50、Google Inception V3或DenseNet中的一种。
根据本发明的第一方面,多个所述任务的总体损失函数为:
Figure PCTCN2020104568-appb-000001
其中L n是单个所述任务的损失,ω n是对应 每个所述任务的权重。
根据本发明的第一方面,所述残差网的损失函数为:
Figure PCTCN2020104568-appb-000002
其中D n是所述图像特征,y i是所述噪声标签,h i是所述第一预测值,L noise是所述残差网的损失值,N n是所述图像特征的总数量。
根据本发明的第一方面,所述标准神经网络的损失函数为:
Figure PCTCN2020104568-appb-000003
其中v j是所述真值标签,g j是所述第二预测值,L clean是所述标准神经网络的损失值。
根据本发明的第一方面,多个所述分类网络的总体目标为:
Figure PCTCN2020104568-appb-000004
Figure PCTCN2020104568-appb-000005
其中W是超参数,α是所述残差网的损失值与所述标准神经网络的损失值之间的权衡参数。
根据本发明的第一方面,在每个所述任务中,所述噪声图像的数量比所述真值图像的数量多。
本发明的第二方面,基于多任务与弱监督的美丽预测装置,其特征在于,包括:
预处理模块,用于预处理输入的人脸图像得到预处理图像,其中所述预处理图像包括标有真值标签的真值图像和标有噪声标签的噪 声图像;
多任务模块,用于将所述预处理图像分配至多个任务,其中每个所述任务包含多个所述真值图像和多个所述噪声图像,多个所述任务包括一个具体为人脸美丽预测的主任务和多个与人脸美丽预测相关的辅任务;
特征提取模块,用于处理多个所述任务的所述真值图像和所述噪声图像得到共享的图像特征;
分类模块,用于处理所述图像特征得到多个分类结果,所述分类模块包括多个由残差网、标准神经网络和分类器组成的分类网络,其中多个所述分类网络与多个所述任务一一对应;
其中,在分类网络中,通过所述残差网处理所述图像特征,学习从所述图像特征到所述真值标签与所述噪声标签的残差值的映射,并得到第一预测值;通过所述标准神经网络,学习从所述图像特征到所述真值标签的映射,并得到第二预测值;通过分类器根据所述第一预测值和所述第二预测值得到所述分类结果。
本发明的第三方面,存储介质,所述存储介质存储有可执行指令,可执行指令能被计算机执行,使所述计算机执行如本发明第一方面所述的基于多任务与弱监督的美丽预测方法。
上述方案至少具有以下的有益效果:利用多任务间的相关性和差异性,增强主任务人脸美丽预测的表达能力;通过弱监督模式的分类网络,减少对真值标签的依赖,降低数据标注成本和降低噪声标签对 人脸美丽预测模型的影响,提高人脸美丽预测模型的泛化能力。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
下面结合附图和实例对本发明作进一步说明。
图1是本发明实施例基于多任务与弱监督的美丽预测方法的流程图;
图2是本发明实施例基于多任务与弱监督的美丽预测装置的结构图;
图3是人脸美丽预测模型的结构图。
具体实施方式
本部分将详细描述本发明的具体实施例,本发明之较佳实施例在附图中示出,附图的作用在于用图形补充说明书文字部分的描述,使人能够直观地、形象地理解本发明的每个技术特征和整体技术方案,但其不能理解为对本发明保护范围的限制。
在本发明的描述中,需要理解的是,涉及到方位描述,例如上、下、前、后、左、右等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。
在本发明的描述中,若干的含义是一个或者多个,多个的含义是 两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。
本发明的描述中,除非另有明确的限定,设置、安装、连接等词语应做广义理解,所属技术领域技术人员可以结合技术方案的具体内容合理确定上述词语在本发明中的具体含义。
参照图1,本发明的某些实施例,提供了基于多任务与弱监督的美丽预测方法,包括以下步骤:
步骤S100、预处理输入的人脸图像得到预处理图像,其中预处理图像包括标有真值标签的真值图像和标有噪声标签的噪声图像;
步骤S200、将预处理图像分配至多个任务,其中每个任务包含多个真值图像和多个噪声图像,多个任务包括一个具体为人脸美丽预测的主任务和多个与人脸美丽预测相关的辅任务;
步骤S300、通过特征提取层处理多个任务的真值图像和噪声图像得到共享的图像特征;
步骤S400、通过多个由残差网210、标准神经网络220和分类器230组成的分类网络200处理图像特征得到多个分类结果,其中多个分类网络200与多个任务一一对应;
其中,在分类网络200中,通过残差网210处理图像特征,学习从图像特征到真值标签与噪声标签的残差值的映射,并得到第一预测 值;通过标准神经网络220,学习从图像特征到真值标签的映射,并得到第二预测值;通过分类器230根据第一预测值和第二预测值得到分类结果。
在该实施例中,利用多任务间的相关性和差异性,增强主任务人脸美丽预测的表达能力;通过弱监督模式的分类网络200,减少对真值标签的依赖,降低数据标注成本和降低噪声标签对人脸美丽预测模型的影响,提高人脸美丽预测模型的泛化能力。
另外,在该基于多任务与弱监督的美丽预测方法中,同时学习多个相关任务,通过辅任务来提高人脸美丽预测主任务的准确度;弱监督模式的分类网络200,能有效利用具有真值标签的图像;解决了模型泛化能力差、只能训练单任务以及数据标注成本过高问题。
需要说明的是,输入的人脸图像是融合了多个数据库的数据,包括LSFBD人脸美丽数据库、GENKI-4K微笑识别数据库、IMDB-WIKI500k+数据库以及SCUT-FBP5500数据库。
进一步,预处理输入的人脸图像得到预处理图像具体是:对人脸图像依次进行图像增强处理、图像矫正处理、图像裁剪处理、图像去重处理和图像归一化处理得到预处理图像。预处理能高效地对人脸图像进行区域检测以及关键点检测,以及对齐和裁剪,使人脸图像大小一致,便于后续操作。
实际上,输入预处理图像至人脸美丽预测模型,以执行步骤S200、步骤S300和步骤S400。人脸美丽预测模型的结构参照图3。
进一步,对于步骤S200,在每个任务中,噪声图像的数量比真值图像的数量多。多个任务的总体损失函数为:
Figure PCTCN2020104568-appb-000006
其中L n是单个任务的损失,ω n是对应每个任务的权重。需要说明的是,主任务为人脸美丽预测;辅任务为与人脸美丽预测相关的任务,例如性别识别、表情识别等。
进一步,特征提取层是VGG16、ResNet50、Google Inception V3或DenseNet中的一种。在本实施例中,特征提取层的具体结构为:第一层为3*3大小的卷积层;第二层为3*3大小的卷积层;第三层为3*3大小的卷积层;第四层为池化层;第五层为3*3大小的卷积层;第六层为3*3大小的卷积层;第七层为池化层;第八层为3*3大小的卷积层;第九层为3*3大小的卷积层;第十层为3*3大小的卷积层;第十一层为池化层;第十二层为3*3大小的卷积层;第十三层为3*3大小的卷积层;第十四层为池化层。通过特征提取层提取多个任务的图像得到共享的图像特征,通过共享的图像特征并行学习多个相关任务,挖掘多个相关任务间的关系,从而能获取额外的有用信息。
进一步,对于步骤S400,残差网210的损失函数为:
Figure PCTCN2020104568-appb-000007
其中D n是图像特征,y i是噪声标签,h i是第一预测值,L noise是残差网210的损失值,N n是图像特征的总数量。在残差网210中,学习从图像特征到真值标签与噪声标签的残差值的映射,并得到第一预测值;利 用噪声标签监督进入残差网210的所有图像特征。
进一步,标准神经网络220的损失函数为:
Figure PCTCN2020104568-appb-000008
其中v j是真值标签,g j是第二预测值,L clean是标准神经网络220的损失值。在标准神经网络220中,学习从图像特征到真值标签的映射,并得到第二预测值;利用真值标签监督进入标准神经网络220的所有图像特征。
另外,第一预测值和第二预测值进入分类器230,分类结果按以下式子计算:k=W 1a+W 2b;其中k为分类结果,a为第一预测值,b为第二预测值,W 1是对应第一预测值的权重,W 2是对应第二预测值的权重。
进一步,多个分类网络200的总体目标为:
Figure PCTCN2020104568-appb-000009
Figure PCTCN2020104568-appb-000010
其中W是超参数,α是残差网210的损失值与标准神经网络220的损失值之间的权衡参数。
参照图2,本发明的某些实施例,提供了基于多任务与弱监督的美丽预测装置,应用如方法实施例所述的基于多任务与弱监督的美丽预测方法,美丽预测装置包括:
预处理模,100,用于预处理输入的人脸图像得到预处理图像,其中预处理图像包括标有真值标签的真值图像和标有噪声标签的噪声 图像;
多任务模块200,用于将预处理图像分配至多个任务,其中每个任务包含多个真值图像和多个噪声图像,多个任务包括一个具体为人脸美丽预测的主任务和多个与人脸美丽预测相关的辅任务;
特征提取模块300,用于处理多个任务的真值图像和噪声图像得到共享的图像特征;
分类模块400,用于处理图像特征得到多个分类结果,分类模块400包括多个由残差网210、标准神经网络220和分类器230组成的分类网络200,其中多个分类网络200与多个任务一一对应;
其中,在分类网络200中,通过残差网210处理图像特征,学习从图像特征到真值标签与噪声标签的残差值的映射,并得到第一预测值;通过标准神经网络220,学习从图像特征到真值标签的映射,并得到第二预测值;通过分类器230根据第一预测值和第二预测值得到分类结果。
在该装置实施例中,基于多任务与弱监督的美丽预测装置应用如方法实施例所述的基于多任务与弱监督的美丽预测方法,经各个模块的配合,能执行基于多任务与弱监督的美丽预测方法的各个步骤,具有和基于多任务与弱监督的美丽预测方法相同的技术效果,在此不再详述。
本发明的某些实施例,提供了存储介质,存储有可执行指令,可执行指令能被计算机执行,使计算机执行如本发明方法实施例所述的 基于多任务与弱监督的美丽预测方法。
存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。
以上所述,只是本发明的较佳实施例而已,本发明并不局限于上述实施方式,只要其以相同的手段达到本发明的技术效果,都应属于本发明的保护范围。

Claims (10)

  1. 基于多任务与弱监督的美丽预测方法,其特征在于,包括以下步骤:
    预处理输入的人脸图像得到预处理图像,其中所述预处理图像包括标有真值标签的真值图像和标有噪声标签的噪声图像;
    将所述预处理图像分配至多个任务,其中每个所述任务包含多个所述真值图像和多个所述噪声图像,多个所述任务包括一个具体为人脸美丽预测的主任务和多个与人脸美丽预测相关的辅任务;
    通过特征提取层处理多个所述任务的所述真值图像和所述噪声图像得到共享的图像特征;
    通过多个由残差网、标准神经网络和分类器组成的分类网络处理所述图像特征得到多个分类结果,其中多个所述分类网络与多个所述任务一一对应;
    其中,在分类网络中,通过所述残差网处理所述图像特征,学习从所述图像特征到所述真值标签与所述噪声标签的残差值的映射,并得到第一预测值;通过所述标准神经网络,学习从所述图像特征到所述真值标签的映射,并得到第二预测值;通过分类器根据所述第一预测值和所述第二预测值得到所述分类结果。
  2. 根据权利要求1所述的基于多任务与弱监督的美丽预测方法,其特征在于,所述预处理输入的人脸图像得到预处理图像具体是:
    对所述人脸图像依次进行图像增强处理、图像矫正处理、图像裁 剪处理、图像去重处理和图像归一化处理得到预处理图像。
  3. 根据权利要求1所述的基于多任务与弱监督的美丽预测方法,其特征在于,所述特征提取层是VGG16、ResNet50、Google Inception V3或DenseNet中的一种。
  4. 根据权利要求1所述的基于多任务与弱监督的美丽预测方法,其特征在于,多个所述任务的总体损失函数为:
    Figure PCTCN2020104568-appb-100001
    其中L n是单个所述任务的损失,ω n是对应每个所述任务的权重。
  5. 根据权利要求1所述的基于多任务与弱监督的美丽预测方法,其特征在于,所述残差网的损失函数为:
    Figure PCTCN2020104568-appb-100002
    其中D n是所述图像特征,y i是所述噪声标签,h i是所述第一预测值,L noise是所述残差网的损失值,N n是所述图像特征的总数量。
  6. 根据权利要求5所述的基于多任务与弱监督的美丽预测方法,其特征在于,所述标准神经网络的损失函数为:
    Figure PCTCN2020104568-appb-100003
    其中v j是所述真值标签,g j是所述第二预测值,L clean是所述标准神经网络的损失值。
  7. 根据权利要求6所述的基于多任务与弱监督的美丽预测方法,其 特征在于,多个所述分类网络的总体目标为:arg Wmin((αL clean,1+L noise,1)+...+(αL clean,n+L noise,n)),其中W是超参数,α是所述残差网的损失值与所述标准神经网络的损失值之间的权衡参数。
  8. 根据权利要求1所述的基于多任务与弱监督的美丽预测方法,其特征在于,在每个所述任务中,所述噪声图像的数量比所述真值图像的数量多。
  9. 应用如权利要求1至8任一项所述的基于多任务与弱监督的美丽预测装置,其特征在于,包括:
    预处理模块,用于预处理输入的人脸图像得到预处理图像,其中所述预处理图像包括标有真值标签的真值图像和标有噪声标签的噪声图像;
    多任务模块,用于将所述预处理图像分配至多个任务,其中每个所述任务包含多个所述真值图像和多个所述噪声图像,多个所述任务包括一个具体为人脸美丽预测的主任务和多个与人脸美丽预测相关的辅任务;
    特征提取模块,用于处理多个所述任务的所述真值图像和所述噪声图像得到共享的图像特征;
    分类模块,用于处理所述图像特征得到多个分类结果,所述分类模块包括多个由残差网、标准神经网络和分类器组成的分类网络,其中多个所述分类网络与多个所述任务一一对应;
    其中,在分类网络中,通过所述残差网处理所述图像特征,学习从所述图像特征到所述真值标签与所述噪声标签的残差值的映射,并得到第一预测值;通过所述标准神经网络,学习从所述图像特征到所述真值标签的映射,并得到第二预测值;通过分类器根据所述第一预测值和所述第二预测值得到所述分类结果。
  10. 存储介质,其特征在于,所述存储介质存储有可执行指令,可执行指令能被计算机执行,使所述计算机执行如权利要求1至8任一项所述的基于多任务与弱监督的美丽预测方法。
PCT/CN2020/104568 2020-06-24 2020-07-24 基于多任务与弱监督的美丽预测方法、装置及存储介质 WO2021258481A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/424,407 US11721128B2 (en) 2020-06-24 2020-07-24 Beauty prediction method and device based on multitasking and weak supervision, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010587525.5A CN111832436B (zh) 2020-06-24 2020-06-24 基于多任务与弱监督的美丽预测方法、装置及存储介质
CN202010587525.5 2020-06-24

Publications (1)

Publication Number Publication Date
WO2021258481A1 true WO2021258481A1 (zh) 2021-12-30

Family

ID=72898839

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/104568 WO2021258481A1 (zh) 2020-06-24 2020-07-24 基于多任务与弱监督的美丽预测方法、装置及存储介质

Country Status (3)

Country Link
US (1) US11721128B2 (zh)
CN (1) CN111832436B (zh)
WO (1) WO2021258481A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399406A (zh) * 2018-01-15 2018-08-14 中山大学 基于深度学习的弱监督显著性物体检测的方法及系统
CN110119689A (zh) * 2019-04-18 2019-08-13 五邑大学 一种基于多任务迁移学习的人脸美丽预测方法
CN110147456A (zh) * 2019-04-12 2019-08-20 中国科学院深圳先进技术研究院 一种图像分类方法、装置、可读存储介质及终端设备
CN110414489A (zh) * 2019-08-21 2019-11-05 五邑大学 一种基于多任务学习的人脸美丽预测方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9792532B2 (en) * 2013-06-28 2017-10-17 President And Fellows Of Harvard College Systems and methods for machine learning enhanced by human measurements
US20170337682A1 (en) * 2016-05-18 2017-11-23 Siemens Healthcare Gmbh Method and System for Image Registration Using an Intelligent Artificial Agent
US20200395117A1 (en) * 2019-06-14 2020-12-17 Cycle Clarity, LLC Adaptive image processing method and system in assisted reproductive technologies

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399406A (zh) * 2018-01-15 2018-08-14 中山大学 基于深度学习的弱监督显著性物体检测的方法及系统
CN110147456A (zh) * 2019-04-12 2019-08-20 中国科学院深圳先进技术研究院 一种图像分类方法、装置、可读存储介质及终端设备
CN110119689A (zh) * 2019-04-18 2019-08-13 五邑大学 一种基于多任务迁移学习的人脸美丽预测方法
CN110414489A (zh) * 2019-08-21 2019-11-05 五邑大学 一种基于多任务学习的人脸美丽预测方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BLOG.CSDN: "Loss Balancing Issues in Multi-Task Learning", CSDN, 24 August 2019 (2019-08-24), XP055882362, Retrieved from the Internet <URL:https://blog.csdn.net/qq_34527082/article/details/100048864> [retrieved on 20220121] *
HU MENGYING; HAN HU; SHAN SHIGUANG; CHEN XILIN: "Weakly Supervised Image Classification Through Noise Regularization", 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 15 June 2019 (2019-06-15), pages 11509 - 11517, XP033687242, DOI: 10.1109/CVPR.2019.01178 *

Also Published As

Publication number Publication date
CN111832436A (zh) 2020-10-27
US20220309828A1 (en) 2022-09-29
US11721128B2 (en) 2023-08-08
CN111832436B (zh) 2023-06-16

Similar Documents

Publication Publication Date Title
EP3757905A1 (en) Deep neural network training method and apparatus
US10546209B2 (en) Machine learning method and apparatus
CN110414344B (zh) 一种基于视频的人物分类方法、智能终端及存储介质
CN110516095B (zh) 基于语义迁移的弱监督深度哈希社交图像检索方法和系统
EP3798917A1 (en) Generative adversarial network (gan) for generating images
US20090290802A1 (en) Concurrent multiple-instance learning for image categorization
US11508173B2 (en) Machine learning prediction and document rendering improvement based on content order
CN108427740B (zh) 一种基于深度度量学习的图像情感分类与检索算法
CN109063743B (zh) 基于半监督多任务学习的医疗数据分类模型的构建方法
WO2021233031A1 (zh) 图像处理方法、装置、设备、存储介质以及图像分割方法
CN110942011A (zh) 一种视频事件识别方法、系统、电子设备及介质
CN113326390A (zh) 基于深度特征一致哈希算法的图像检索方法
CN112200031A (zh) 一种用于生成图像对应文字说明的网络模型训练方法与设备
US11250299B2 (en) Learning representations of generalized cross-modal entailment tasks
US20200257984A1 (en) Systems and methods for domain adaptation
WO2021258482A1 (zh) 基于迁移与弱监督的美丽预测方法、装置及存储介质
WO2021258481A1 (zh) 基于多任务与弱监督的美丽预测方法、装置及存储介质
CN116975743A (zh) 行业信息分类方法、装置、计算机设备和存储介质
CN113516118B (zh) 一种图像与文本联合嵌入的多模态文化资源加工方法
CN115204301A (zh) 视频文本匹配模型训练、视频文本匹配方法和装置
CN113435206A (zh) 一种图文检索方法、装置和电子设备
Sun et al. Robust ensembling network for unsupervised domain adaptation
CN117115564B (zh) 基于跨模态概念发现与推理的图像分类方法及智能终端
Grzeszick Partially supervised learning of models for visual scene and object recognition
US11908080B2 (en) Generating surfaces with arbitrary topologies using signed distance fields

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20941548

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20941548

Country of ref document: EP

Kind code of ref document: A1