Disclosure of Invention
The embodiment of the invention discloses a multi-modal feature fusion method, device and system based on a convolutional neural network, which solve the problem of limitation of single-modal identification in the prior art and improve the accuracy of biological feature identification.
The embodiment of the invention discloses a multi-mode feature fusion method based on a convolutional neural network, which comprises the following steps:
extracting features of a plurality of modals from different heterogeneous images to obtain a first feature set of each modality;
in the multi-mode convolutional neural network, according to the correlation among different modes, screening out the characteristics which meet the preset conditions from the first characteristic set of each mode to obtain a second characteristic set of each mode;
and determining the weight of the second feature set of each mode in a full connection layer of the multi-mode convolutional neural network, and fusing the second feature sets of the multiple modes according to the weight so that the fused second feature sets train the multi-mode convolutional neural network for biological feature recognition.
Optionally, the heterogeneous image includes:
the image processing device comprises a visible light face image, a near infrared face image, a visible light iris image and a near infrared iris image, wherein each image corresponds to a mode.
Optionally, when the heterogeneous image is a near-infrared face image or a visible light face image, the extracting features of multiple modalities from different heterogeneous images includes:
detecting the input visible light face image or near infrared face image to obtain position information of a face and position information of key points;
preprocessing the input visible light face image or near-infrared face image;
and inputting the preprocessed near-infrared face image or visible light face image into a trained face image feature extraction model, and extracting face features under near-infrared light or visible light.
Optionally, when the heterogeneous image is a visible light iris image or a near-infrared iris image, the extracting features of multiple modalities from different heterogeneous images includes:
extracting correlation characteristics of two eyes in the visible light iris image or the near-infrared iris image respectively in a first mode and a second mode to obtain a first target characteristic set and a second target characteristic set;
and extracting the depth features of the iris from the first target feature set and the second target feature set according to the complementarity of the first target feature set and the second target feature set.
Optionally, the screening out, according to the correlation between different modalities, features that meet a preset condition from the first feature set of each modality includes:
respectively screening out features with maximized inter-class difference and minimized intra-class difference from the first feature set of each mode to obtain a third feature set of each mode;
and analyzing the third feature set of each mode through a multivariate variable regression model to obtain a second feature set of each mode.
The embodiment of the invention also discloses a multi-mode feature fusion device based on the convolutional neural network, which comprises the following steps:
the multi-modal feature extraction unit is used for extracting features of a plurality of modes from different heterogeneous images to obtain a first feature set of each mode;
the screening unit is used for screening out the characteristics meeting the preset conditions from the first characteristic set of each mode according to the correlation among different modes in the multi-mode convolutional neural network to obtain a second characteristic set of each mode;
and the fusion unit is used for determining the weight of the second feature set of each mode in a full connection layer of the multi-mode convolutional neural network, and fusing the second feature sets of the multiple modes according to the weight so that the fused second feature sets train the multi-mode convolutional neural network for biological feature recognition.
Optionally, the heterogeneous image includes:
the image processing device comprises a visible light face image, a near infrared face image, a visible light iris image and a near infrared iris image, wherein each image corresponds to a mode.
Optionally, the screening unit includes:
the screening subunit is used for screening out the features with the maximized inter-class difference and the minimized intra-class difference from the first feature set of each mode respectively to obtain a third feature set of each mode;
and the analysis subunit is used for analyzing the third feature set of each mode through the multivariate variable regression model to obtain a second feature set of each mode.
The embodiment of the invention also discloses a multi-mode feature fusion system based on the convolutional neural network, which comprises the following steps:
the system comprises an acquisition end and a data processing end;
the acquisition terminal is used for acquiring heterogeneous images representing different modalities;
the data processing terminal is used for extracting features of a plurality of modes from different heterogeneous images to obtain a first feature set of each mode;
in the multi-mode convolutional neural network, according to the correlation among different modes, screening out the characteristics which meet the preset conditions from the first characteristic set of each mode to obtain a second characteristic set of each mode;
and determining the weight of the second feature set of each mode in a full connection layer of the multi-mode convolutional neural network, and fusing the second feature sets of the multiple modes according to the weight so that the fused second feature sets train the multi-mode convolutional neural network for biological feature recognition.
Optionally, the heterogeneous image includes:
the image processing device comprises a visible light face image, a near infrared face image, a visible light iris image and a near infrared iris image, wherein each image corresponds to a mode.
The embodiment of the invention discloses a multi-mode feature fusion method, a device and a system based on a convolutional neural network, which comprise the following steps: extracting features of a plurality of modals from different heterogeneous images to obtain a first feature set of each modality; in the multi-mode convolutional neural network, according to the correlation among different modes, screening out the characteristics which meet the preset conditions from the first characteristic set of each mode to obtain a second characteristic set of each mode; and determining the weight of the second feature set of each mode in a full connection layer of the multi-mode convolutional neural network, and fusing the second feature sets of the multiple modes according to the weight so that the fused second feature sets train the multi-mode convolutional neural network for biological feature recognition. Therefore, the multi-modal convolutional neural network for feature recognition is obtained by fusing the multi-modal features and training the multi-modal convolutional neural network according to the fused features, so that the problem of limitation of single-mode recognition in the prior art is solved, and the accuracy of biological feature recognition is improved.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a flow diagram of a multi-modal feature fusion method based on a convolutional neural network according to an embodiment of the present invention is shown, in this embodiment, the method includes:
s101: extracting features of a plurality of modals from different heterogeneous images to obtain a first feature set of each modality;
in this embodiment, the heterogeneous images are images under different scene conditions, such as different illumination, different shooting angles, different lens settings (near distance and far distance), and different shooting sites (offices, banks, cells, and the like).
In this embodiment, the following four images are taken as examples to explain the scheme, including: visible light face image, near-infrared face image, visible light iris image and near-infrared iris image.
In this embodiment, feature extraction may be performed on images of different modalities (for example, a human face or an iris) in multiple ways, which is not limited in this embodiment. However, in order to clearly explain a specific implementation process of the present embodiment, two ways are described in the present embodiment for performing feature extraction on a face image and an iris image, respectively.
In one embodiment, as shown in fig. 2, for a near-infrared face image or a visible-light face image, S101 includes:
s201: detecting the input visible light face image or near infrared face image to obtain position information of a face and position information of key points;
s202: preprocessing the input visible light face image or near-infrared face image;
because the illumination condition when shooing is different with the shooting angle, there is certain difference in the face image in the sample and the face image of standard, in order to eliminate the error influence that this difference brought, carries out the preliminary treatment with visible light face image or near-infrared face image, specifically includes:
acquiring key point position information and illumination conditions of a standard face;
the standard face may be preset, or the key point position information and the illumination condition of the average face calculated on the training set may be used as the standard face.
And aligning the key point position of the visible light face image or the near-infrared face image with the key point position of the standard face according to the acquired position information and the key point position information of the visible light face image or the near-infrared face image.
Acquiring illumination of a visible light face image or a near-infrared face image;
and converting the illumination of the visible light face image or the near-infrared face image into the illumination condition of a standard face through an image processing algorithm.
The key points of the sample are aligned with the key points of the standard face, and the number of operations of converting the illumination of the face image into the illumination condition of the standard face is not limited. In addition, the order of performing the key point alignment and the conversion of the lighting conditions may be arbitrarily adjusted, and is not limited in this embodiment.
S203: and inputting the preprocessed visible light face image or near-infrared face image into a trained face image feature extraction model, and extracting face features under near-infrared light or visible light.
In this embodiment, the face image feature extraction model is obtained by training a standard face, where the features that can be extracted by the model include: identity feature vectors and feature vectors of different attributes. In addition, the method can also comprise the following steps: gender feature vector, age feature vector.
Specifically, the facial image feature extraction model may be a multi-task neural network model, and each task represents extracting different facial features, for example: identity characteristics, gender characteristics, age characteristics, and the like.
The objective of the multitask neural network is to minimize the weighted loss sum of each subtask, and in order to minimize the weighted loss sum of each subtask, different loss functions may be used to optimize the multitask neural network model, specifically, the method includes:
1. for identity recognition tasks
The multitasking neural network model may be optimized, for example, using a softmax loss function as an optimization objective, where the softmax loss function is as follows:
where N is the number of classes, x is the input face image, y
Identity∈R
N×1Is a class vector representing the class of face images,
and representing the output of the ith node of the face identity classifier learned by the neural network.
2. Identifying tasks for age
The human face gender estimation task divides the human face image into two categories according to different genders, and the task can use a two-category loss function represented by a hinge loss as an optimization target. Where the change loss function is as follows:
wherein, y
GenderE { -1, +1} is a label representing the gender of the face image,
the method is the prediction output of the neural network to the gender of the input face image.
The human face age estimation task means that the age of the human face is predicted according to the human face image, and the human face age estimation task is a regression task. This task may use a series of regression loss functions, represented by the square loss, as optimization objectives. Wherein the square loss is as follows:
wherein, y
AgeIs the true value of the age of the face image,
is the prediction output of the neural network to the age of the input face image.
It should be noted that the above task of identity classification, gender classification, and age estimation is not the only task forming form of the multitask neural network, and subtasks may be replaced by ethnic classification, hair style identification, and the like. The subtasks of the multitasking neural network are also not limited to three, and may be any number of combinations. The optimization goal of the overall multitasking neural network is the weighted sum of the subtasks, as follows:
L=λILI+λGLG+λALA+…
where λ is the loss weight of the subtask.
In this embodiment, after the face feature extraction model is obtained, the preprocessed visible light image or near-infrared image is input into the face feature extraction model, and the face features under visible light or near-infrared light are extracted.
In a second embodiment, as shown in fig. 3, when the heterogeneous image is a visible light iris image or a near infrared iris image, S101 includes:
s301: extracting correlation characteristics of two eyes in the visible light iris image or the near-infrared iris image respectively in a first mode and a second mode to obtain a first target characteristic set and a second target characteristic set;
the first mode and the second mode are two different feature extraction modes, for example, the first mode may be a preset convolution algorithm, such as a Pairwise CNNS algorithm, and the second mode may be a conventional feature extraction method, such as a sequencing measure filter.
S302: and extracting the depth features of the iris from the first target feature set and the second target feature set according to the complementarity of the first target feature set and the second target feature set. (ii) a
Different feature extraction modes have advantages and disadvantages, extracted features also have certain complementarity, and depth features with better robustness are extracted through the feature complementarity extracted in different modes.
Preferably, a convolutional neural network model based on a maxout activation unit can be adopted to extract the depth features of the iris. In this embodiment, the extracted depth features may express the similarities and differences between the iris textures more robustly.
Further, it is assumed that the heterogeneous image includes: visible light face image, near-infrared face image, visible light iris image, near-infrared iris image, the first characteristic set that obtains includes: the method comprises the steps of a face feature set under visible light, a face feature set under near infrared light, an iris feature set under visible light and an iris feature set under near infrared light.
S102: in the multi-mode convolutional neural network, according to the correlation among different modes, screening out the characteristics which meet the preset conditions from the first characteristic set of each mode to obtain a second characteristic set of each mode;
in this embodiment, for the same organism, different modalities have a certain correlation, and features having a certain correlation may be screened out from different feature sets according to the correlation, specifically, S102 includes:
respectively screening out features with maximized inter-class difference and minimized intra-class difference from the first feature set of each mode to obtain a third feature set of each mode;
and analyzing the third feature set of each mode through a multivariate variable regression model to obtain a second feature set of each mode.
In this embodiment, the intra-class difference refers to the similarity between different features in the same image, where the similarity between different features may be represented by the distance between features; inter-class differences refer to distances between features in different images. Wherein, in the same image, the larger the similarity between the features is, the smaller the intra-class difference is; the greater the distance between features in different images, the greater the difference between the classes.
In this embodiment, the multivariate regression model includes: CCA (Chinese full name: canonical correlation Analysis), PLS (Partial Least Squares), CSR (Coupled Spectral Regression), and the like.
The main idea is that the third feature set of each modality is processed according to the correlation among the features in the third feature set of each modality, and a common feature space containing the second feature set of each modality is obtained.
Besides, in order to improve the accuracy of the subsequent multi-modal convolutional neural network training, the distance D between feature points with correlation in the third feature set of different modalities may be approximately equal to the distance D between feature points with correlation in the same type of sample.
Wherein homogeneous samples represent homogeneous samples, such as two human face images under visible light.
S103: and determining the weight of the second feature set of each mode in a full connection layer of the multi-mode convolutional neural network, and fusing the second feature sets of the multiple modes according to the weight so that the fused second feature sets train the multi-mode convolutional neural network for biological feature recognition.
In this embodiment, assume f as shown in FIG. 41Representing a set of facial features in visible light, f2Representing a set of facial features in near infrared light, f3Representing the set of features of the iris in visible light, f4Representing a set of iris features in near infrared light. Each feature is combinedInputting the features in the set into a multi-mode feature convolution learning layer to obtain four corresponding feature matrixes which are respectively: w1、W2、W3、W4Then W is weighted according to the different modalities1f1、W2f2、W3f3、W4f4Are connected in series.
The determination of the weight may be preset by a technician, or may be determined according to a training result during a training process.
In the embodiment, the multi-modal convolutional neural network for feature recognition is obtained by fusing the multi-modal features and training the multi-modal convolutional neural network according to the fused features, so that the problem of limitation of single-mode recognition in the prior art is solved, and the accuracy of biological feature recognition is improved.
Referring to fig. 5, a schematic structural diagram of a multi-modal feature fusion apparatus based on a convolutional neural network according to an embodiment of the present invention is shown, in this embodiment, the apparatus includes;
the multi-modal feature extraction unit 501 is configured to extract features of multiple modalities from different heterogeneous images to obtain a first feature set of each modality;
the screening unit 502 is configured to screen out, in the multi-modal convolutional neural network, features meeting preset conditions from the first feature set of each modality according to correlations between different modalities to obtain a second feature set of each modality;
the fusion unit 503 is configured to determine, at a full connection layer of the multi-modal convolutional neural network, a weight of the second feature set of each modality, and fuse the second feature sets of the multiple modalities according to the weight, so that the fused second feature sets train the multi-modal convolutional neural network for biological feature recognition.
Optionally, the heterogeneous image includes:
the image processing device comprises a visible light face image, a near infrared face image, a visible light iris image and a near infrared iris image, wherein each image corresponds to a mode.
Optionally, the screening unit includes:
the screening subunit is used for screening out the features with the maximized inter-class difference and the minimized intra-class difference from the first feature set of each mode respectively to obtain a third feature set of each mode;
and the analysis subunit is used for analyzing the third feature set of each mode through the multivariate variable regression model to obtain a second feature set of each mode.
Optionally, the multi-modal feature extraction unit is specifically configured to:
detecting the input visible light face image or near infrared face image to obtain position information of a face and position information of key points;
preprocessing the input visible light face image or near-infrared face image;
and inputting the preprocessed near-infrared face image or visible light face image into a trained face image feature extraction model, and extracting face features under near-infrared light or visible light.
And
extracting correlation characteristics of two eyes in the visible light iris image or the near-infrared iris image respectively in a first mode and a second mode to obtain a first target characteristic set and a second target characteristic set;
and extracting the depth features of the iris from the first target feature set and the second target feature set according to the complementarity of the first target feature set and the second target feature set.
By the device, the multi-modal characteristics are fused, and the multi-modal convolutional neural network is trained according to the fused characteristics to obtain the multi-modal convolutional neural network for characteristic recognition, so that the problem of limitation of single mode recognition in the prior art is solved, and the accuracy of biological characteristic recognition is improved.
Referring to fig. 6, a schematic structural diagram of a multi-modal feature fusion system based on a convolutional neural network according to an embodiment of the present invention is shown, where the system includes:
an acquisition end 601 and a data processing end 602;
the acquisition terminal 601 is configured to acquire heterogeneous images representing different modalities;
the data processing terminal 602 is configured to extract features of multiple modalities from different heterogeneous images to obtain a first feature set of each modality;
in the multi-mode convolutional neural network, according to the correlation among different modes, screening out the characteristics which meet the preset conditions from the first characteristic set of each mode to obtain a second characteristic set of each mode;
and determining the weight of the second feature set of each mode in a full connection layer of the multi-mode convolutional neural network, and fusing the second feature sets of the multiple modes according to the weight so that the fused second feature sets train the multi-mode convolutional neural network for biological feature recognition.
Optionally, the heterogeneous image includes:
the image processing device comprises a visible light face image, a near infrared face image, a visible light iris image and a near infrared iris image, wherein each image corresponds to a mode.
Optionally, when the data processing end executes the step of extracting features of multiple modalities from different heterogeneous images when the heterogeneous images are near-infrared face images or visible light face images, the data processing end is specifically configured to:
detecting the input visible light face image or near infrared face image to obtain position information of a face and position information of key points;
preprocessing the input visible light face image or near-infrared face image;
and inputting the preprocessed near-infrared face image or visible light face image into a trained face image feature extraction model, and extracting face features under near-infrared light or visible light.
Optionally, when the step of extracting features of multiple modalities from different heterogeneous images is performed when the heterogeneous image is a visible light iris image or a near-infrared iris image, the data processing end specifically includes:
extracting correlation characteristics of two eyes in the visible light iris image or the near-infrared iris image respectively in a first mode and a second mode to obtain a first target characteristic set and a second target characteristic set;
and extracting the depth features of the iris from the first target feature set and the second target feature set according to the complementarity of the first target feature set and the second target feature set.
Optionally, the data processing end, when executing the correlation between the different modalities, screens out a feature meeting a preset condition from the first feature set of each modality, and specifically is configured to:
respectively screening out features with maximized inter-class difference and minimized intra-class difference from the first feature set of each mode to obtain a third feature set of each mode;
and analyzing the third feature set of each mode through a multivariate variable regression model to obtain a second feature set of each mode.
By the system, heterogeneous images in a complex scene are collected, multi-modal features are fused, the multi-modal convolutional neural network is trained according to the fused features, and the multi-modal convolutional neural network for feature recognition is obtained.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.