CN110674483A

CN110674483A - Identity recognition method based on multi-mode information

Info

Publication number: CN110674483A
Application number: CN201910749103.0A
Authority: CN
Inventors: 管贻生; 叶家杰
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2020-01-10
Anticipated expiration: 2039-08-14
Also published as: CN110674483B

Abstract

The invention discloses an identity recognition method based on multi-mode information, which comprises the following steps: firstly, making a multi-modal video data set with a label; step two, respectively constructing and training a human face detection model and a head detection model; step three, constructing and training a feature extraction model of the face, the head and the voice; step four, extracting the characteristics of the face, the head and the voice information through the trained characteristic extraction model; constructing and training a classification model to classify the three extracted features respectively; step six, respectively using three characteristics to predict results through a classification model; seventhly, performing information fusion on the classification result according to a formulated multi-mode information fusion strategy; step eight, sorting the fused results and outputting an identity recognition result; the invention provides an identity recognition network model based on multi-mode information, and the identity recognition network model has wide application prospects in the fields of human-computer interaction, information security, security monitoring and the like.

Description

Identity recognition method based on multi-mode information

Technical Field

The invention relates to the technical field of pattern recognition and biological recognition, in particular to an identity recognition method based on multi-mode information.

Background

With the economic development and experience accumulation, technological innovation has advanced greatly, and especially in recent decades, a series of new technologies represented by biometric identification technology have advanced rapidly, and among methods for identity recognition, a face recognition technology attracts most attention. The face recognition technology recognizes the target identity by collecting and analyzing facial features of a person, has the characteristics of easiness in sampling, convenience in background operation, no contact with a sampling object and the like, has obvious advantages in practical application compared with other recognition modes, plays a remarkable role in the fields of identity recognition and intelligent man-machine interaction, and radiates considerable influence to the fields of safety monitoring, multimedia entertainment and the like.

Due to the interest in deep learning in recent years, there has been a great increase in research on identification, and especially in research based on face recognition and speaker recognition, the performance on public data sets has surpassed the recognition capability of people. Meanwhile, based on continuous optimization of the identity recognition algorithms in single mode, researchers gradually change the research direction from a constrained environment to an unconstrained environment, so that the difficulty of identity recognition is greatly improved, the difficulty of improving the recognition capability of the identity recognition algorithms in the unconstrained environment is also a difficult problem of the current research, and in many unconstrained environments, the identity recognition task cannot be completed by single mode information alone, and multiple mode information is required to be considered as a basis to improve the recognition capability. Therefore, identity recognition methods based on multimodal information are an important research direction.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an identity recognition method based on multi-modal information.

The purpose of the invention is realized by the following technical scheme:

an identity recognition method based on multi-modal information comprises the following steps:

collecting video clips and video clips of a movie star and a known person, making a person data set containing multi-modal information, and adding an identity tag to the data set;

step two, constructing detection models of the face and the head, respectively training by using different opening source data sets, and detecting the face and the head in the character data set in the step one;

step three, constructing a feature extraction model of the face, head and sound modal information according to the face and head information detected in the step two, and training the model by using an open source data set;

step four, respectively extracting the characteristics of the human face, the head and the voice information according to the characteristic extraction model in the step three;

constructing a classification model, and training the classification model by using the training set and the verification set in the character data set in the step one;

step six, using the classification model in the step five to respectively predict the results of the test set in the human data set in the step one;

seventhly, performing information fusion on the prediction result by formulating a fusion strategy according to the prediction result in the sixth step;

and step eight, sorting and sequencing according to the fusion result in the step seven, and outputting a final identity recognition result.

Preferably, the specific process of creating a person data set containing information of multiple modalities and adding an identity tag to the data set in the first step is as follows:

constructing and training a face detection score evaluation and quality evaluation model, carrying out face detection score evaluation and quality score evaluation on a large number of acquired videos, wherein the detection score range is 0-1, the quality score range is 0-200, screening the videos through the face detection score and quality evaluation model, randomly dividing the videos into video segments of 3-30 seconds, dividing 80% of video data of the whole data set into high-score video segments and 20% of low-score video segments, and adding 5% of unknown label video segments in the data set.

Preferably, a face detection model is constructed in the second step, a detection model is constructed according to a Pyramidbox algorithm, and the detection model is trained by using an open source data set Megaface and MS-Celeb-1M; the head detection model is YOLOv3, and only the head position of a person is detected by using the weight trained by the open source.

Preferably, the feature extraction model of the human face in the third step is a neural network feature extraction model based on a VGG16 structure and an ArcFace loss function, and an open source data set Megaface and an MS-Celeb-1M training model are used; wherein the ArcFace loss function is shown in the following formula (1):

in the above formula, N represents the batch size of the input data, s represents a hypersphere with radius s, m represents an additional angle edge penalty value,angle, theta, representing true value_jRepresenting the included angle between the jth column weight and the ith sample characteristic;

the feature extraction models of the human face and the head have the same neural network structure and the same loss function, but the network parameters are not shared;

the feature extraction model of the sound is a neural network model based on Resnet50, the last layer loss function is softmax, and the model is trained by using an open source data set VoxColeb 2.

Preferably, feature extraction is performed on the character data set in the first step by using a feature extraction model of the face, the head and the voice in the fourth step, and the output of the last but one full-connected layer is taken as the feature to be extracted, wherein the last but one full-connected layer has 512 nodes.

Preferably, the classification model in the fifth step is a multilayer perceptron and has three fully-connected layers, the first and second layers are 1024 nodes, the third layer is a classified number, the classification model is trained only by using three types of modal information extracted from the training set and the verification set, and the three types of modal information are respectively trained.

Preferably, in the sixth step, the classification model is used to predict the results of the test set of the human data set, and the prediction results are three types and are predicted by the human face classification model, the head classification model and the sound classification model respectively.

Preferably, the fusion strategy in the seventh step is a method for information fusion on a decision layer, and a weighted average method is used for obtaining a fusion result, wherein the selection of the weight is divided into two parts, under the condition that the face detection score and the quality score are high, the detection score and the quality score of the face are selected as the weight, and under other conditions, the prediction result ranking score is adopted as the weight;

specifically, the selection of the weight is divided into two parts according to the detection score and the quality score of the face, the high-score video is subjected to prediction classification through the first part, and the low-score video is subjected to prediction classification through the second part;

the fusion strategy of the first part mainly uses the detection fraction and the mass fraction as weights to calculate a weighted average value, which is shown in the following formula (2):

in the above formula, qua score_iRepresents the quality score, detscore, of the ith frame image_iDenotes a detection score of an i-th frame image, n denotes the number of frames contained in a currently input video, f_iRepresenting the characteristics of the ith frame in the current video, and F representing the composite characteristic expression obtained by weighted average;

the fusion strategy of the second part mainly utilizes three prediction results to carry out decision fusion, video IDs with the same prediction results are accumulated according to different labels, and weighted average values are obtained through ranking scores, as shown in the following notations (3) and (4):

in the above formulaLabel i denotes the ith label, result score_jRepresents the jth prediction, rank score_jThe number of the j-th predicted results is represented by m, the number of the predicted results under the same label and the same video ID in all the predicted results is represented by W, the weight score of the same label and the same video ID is represented by N, the number of classification categories in the data set is represented by k, and the number of the video IDs contained under the same label is represented by k.

Preferably, in the step eight, the fusion results in the step seven are used, the fusion results are respectively sorted according to the label types in the data set, the top K method is used for sorting and selecting, and the identity recognition result is output according to the sorting result.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a method for making a multi-mode information data set, which solves the technical problem of screening out multi-mode information data meeting requirements under the condition of a large amount of data;

(2) the invention provides an effective multi-mode information fusion model, which solves the problem that identity recognition cannot be carried out through single-mode information under a real unconstrained environment, for example, face recognition cannot be carried out accurately under the conditions that an image has exposure and a side face and a face is shielded;

(3) the invention provides a method for fusing various prediction results based on a weighted mean value, and a method for hierarchically sampling a data set of K-fold is combined, so that the prediction results are enhanced, the result prediction accuracy is improved, and the problem that the prediction accuracy is easy to reduce when result fusion is carried out at a decision level is solved.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic flow chart of the multi-modal information dataset production of the present invention;

FIG. 3 is a schematic flow chart of the multi-modal feature extraction of the present invention;

FIG. 4 is a schematic diagram of a model structure of the fusion policy model according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 1 to 4, an identity recognition method based on multi-modal information includes the following steps:

firstly, making a character video data set with a label and multi-modal information, wherein the multi-modal information comprises a face, a head, sound and the like;

as shown in fig. 2, a face detection score evaluation and quality evaluation model is constructed and trained, a large number of videos obtained from a network are subjected to face detection score evaluation and quality score evaluation, the detection score range is 0-1, the quality score range is 0-200, the videos are screened through the face detection score evaluation and quality evaluation model, the videos are randomly cut into video segments of 3-30 seconds, the detection score is greater than 0.8, the quality score is greater than 80, the video segments are high-score video segments, the other video segments are low-score video segments, 80% of video data in the whole data set are high-score video segments, 20% of the video segments are low-score video segments, and 5% of unknown label video segments are added in the data set.

Step two, respectively constructing and training a human face detection model and a head detection model, wherein the neural network structures of the human face detection model and the head detection model are different, the human face detection model is trained by using an open source data set, and the head detection model uses pre-training weights of the open source;

(1) and constructing a face detection model, constructing the detection model according to a Pyramidbox algorithm, and training the detection model by using an open source data set Megaface and MS-Celeb-1M.

(2) The detection model for constructing the head is YOLOv3, and only the head position of the person is detected by using the weight trained in the open source pre-training.

Step three, constructing and training feature extraction models of the face, the head and the voice, wherein the face and head feature extraction models adopt a neural network feature extraction model of a VGG16 structure and an ArcFace loss function, and an open source data set Megaface and an MS-Celeb-1M training model are used; the voice extraction model is based on a Resnet50 neural network model, the number of nodes in the second layer from the last is 512, the loss function in the last layer is softmax, and an open source data set VoxColeb 2 is used for training the model; the ArcFace loss function is shown in the following equation (1):

in the above formula, N represents the batch size of the input data, s represents a hypersphere with radius s, m represents an additional angle edge penalty value,

angle, theta, representing true value_jRepresenting the angle between the jth column weight and the ith sample feature.

And step four, extracting the features of the human face, the head and the sound information through the trained feature extraction model, extracting the human face, the head and the sound features in the character data set in the step one by using the human face and head detection model in the step two and the three feature extraction models in the step three, and taking the output of the second last layer of each feature extraction model as the extracted features, wherein the specific feature extraction process is shown in fig. 3.

Constructing and training classification models to classify the three extracted features respectively, wherein the classification models are all of multilayer perceptron structures, each multilayer perceptron has three layers of neural network structures, namely three full-connection layers, the first layer and the second layer are 1024 nodes, the third layer is the number of classified classes, namely the number of nodes of the last output layer is the number of classes of data set classification, and the loss function of the last layer is a softmax function; and training the classification model by using only three kinds of modal information extracted from the training set and the verification set, wherein the three kinds of modal information are used for respectively training the three classification models.

Step six, respectively using three characteristics to predict results through a classification model, and specifically comprising the following steps: and performing hierarchical sampling on the character data set by using a K-fold method, splitting the character data set into K data sets, and performing result prediction on the K personal face detection model data sets by using three models respectively to obtain 3 times K prediction results.

Seventhly, performing information fusion on the classification result according to a formulated multi-mode information fusion strategy, wherein the specific steps are shown in FIG. 4;

the fusion strategy is mainly divided into two parts according to the detection score and the quality score of the human face, the high-score videos are subjected to prediction classification through the first part, and the low-score videos are subjected to prediction classification through the second part.

in the above formula, qua score_iRepresents the quality score, detscore, of the ith frame image_iDenotes a detection score of an i-th frame image, n denotes the number of frames contained in a currently input video, f_iRepresenting the feature of the ith frame in the current video, and F representing the composite feature expression obtained by weighted averaging.

The fusion strategy of the second part mainly utilizes three prediction results to make decision fusion, accumulates video IDs with the same prediction results according to different labels, and calculates a weighted average value through ranking scores, as shown in the following formulas (3) and (4):

in the above formula, label i represents the ith label, result score_jRepresents the jth prediction, rank score_jThe number of the j-th predicted results is represented by m, the number of the predicted results under the same label and the same video ID in all the predicted results is represented by W, the weight score of the same label and the same video ID is represented by N, the number of classification categories in the data set is represented by k, and the number of the video IDs contained under the same label is represented by k.

And step eight, sorting according to the weight scores under each label by using the fusion result obtained in the step seven, and finally outputting an identity recognition result according to a Top K method.

The invention provides a method for making a multi-mode information data set, which solves the technical problem of screening out multi-mode information data meeting requirements under the condition of a large amount of data; an effective multi-mode information fusion model is provided, and the problem that identity recognition cannot be performed through single-mode information in a real unconstrained environment is solved, for example, face recognition cannot be performed accurately under the conditions that an image has exposure and a side face and a face is shielded; the method for fusing multiple prediction results based on the weighted mean is provided, and the K-fold data set hierarchical sampling method is combined, so that the prediction results are enhanced, the result prediction accuracy is improved, and the problem that the prediction accuracy is easy to reduce when result fusion is carried out at a decision level is solved.

The present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents and are included in the scope of the present invention.

Claims

1. An identity recognition method based on multi-modal information, characterized by comprising the following steps:

2. The multi-modal information-based identity recognition method according to claim 1, wherein the specific process of creating a people data set containing multi-modal information and adding identity tags to the data set in the first step is as follows:

3. The identity recognition method based on multi-modal information according to claim 1, wherein a face detection model is constructed in the second step, a detection model is constructed according to a Pyramidbox algorithm, and the detection model is trained by using an open source data set Megaface and MS-Celeb-1M; the head detection model is YOLOv3, and only the head position of a person is detected by using the weight trained by the open source.

4. The identity recognition method based on multi-modal information according to claim 1, wherein the feature extraction model of the human face in the third step is a neural network feature extraction model based on a VGG16 structure and an ArcFace loss function, and the model is trained by using a source dataset Megaface and an MS-Celeb-1M; wherein the ArcFace loss function is shown in the following formula (1):

angle, theta, representing true value_jRepresenting the included angle between the jth column weight and the ith sample characteristic;

5. The method of claim 1, wherein the character data set of the first step is extracted by using the feature extraction model of the face, head and voice of the fourth step, and the output of the last-but-one layer of the fully-connected layer is taken as the feature to be extracted, wherein the last-but-one layer comprises 512 nodes.

6. The identity recognition method based on multi-modal information according to claim 1, wherein the classification model in the fifth step is a multi-layered perceptron, and has three fully connected layers, the first and second layers are 1024 nodes, the third layer is a classification number, the classification model is trained by using only three types of modal information extracted from the training set and the verification set, and the three types of modal information are used for respectively training the three classification models.

7. The identity recognition method based on multi-modal information as claimed in claim 1, wherein in the sixth step, the classification model is used to perform result prediction on the test set of the human data set, and the prediction results are three types and are predicted by the face, head and sound classification models respectively.

8. The identity recognition method based on multi-modal information according to claim 1, wherein the fusion strategy in the seventh step is a method of information fusion on a decision layer, a weighted average method is used to obtain a fusion result, wherein the selection of the weight is divided into two parts, the detection score and the quality score of the face are selected as the weight under the condition that the face detection score and the quality score are high, and the ranking score of the prediction result is adopted as the weight under other conditions;

Labeli∶

in the above formula, labeli represents the ith label, result score_jRepresents the jth prediction, rank score_jThe number of the j-th predicted results is represented by m, the number of the predicted results under the same label and the same video ID in all the predicted results is represented by W, the weight score of the same label and the same video ID is represented by N, the number of classification categories in the data set is represented by k, and the number of the video IDs contained under the same label is represented by k.

9. The multi-modal information based identity recognition method according to claim 1, wherein the fusion result of the step seven is used in the step eight, the fusion results are respectively sorted according to the label categories in the data set, the top K method is used for sorting and selecting, and the identity recognition result is output according to the sorting result.