WO2016110005A1

WO2016110005A1 - Gray level and depth information based multi-layer fusion multi-modal face recognition device and method

Info

Publication number: WO2016110005A1
Application number: PCT/CN2015/074868
Authority: WO
Inventors: 夏春秋
Original assignee: 深圳市唯特视科技有限公司
Priority date: 2015-01-07
Filing date: 2015-03-23
Publication date: 2016-07-14
Also published as: CN104598878A

Abstract

Disclosed in the present invention are a gray level and depth information based multi-layer fusion multi-modal face recognition device and method, the method mainly comprising the steps of: recognizing gray level information of a human face; recognizing depth information of the human face; normalizing the gray level information and the depth information of the human face, and on the basis of normalized matching scores, acquiring a matching score fused multi-modally by adopting a fusion approach to achieve multi-modal face recognition. In the solution of the present invention, a multi-modal system collects two-dimensional gray level information and three-dimensional depth information, takes advantages from the two-dimensional gray level information and the three-dimensional depth information, and overcomes inherent shortcomings of a single-modal system (such as illumination for gray level images and expressions for depth images) via the fusion approach, thereby effectively improving the performance of the face recognition system and bringing more accurate and rapid face recognition.

Description

Multi-layer face recognition device and method based on multi-layer fusion of gray level and depth information

Technical field

The present invention relates to the field of face recognition technology, and in particular to a multi-modal face recognition device and method based on multi-layer fusion of gray and depth information.

Background technique

Compared with two-dimensional face recognition, 3D face recognition has the advantages of its robustness to illumination, small influence on posture and expression, etc. Therefore, after the rapid development of 3D data acquisition technology and the improvement of the quality and precision of 3D data, Many scholars have invested their research in this field.

CN20101025690 proposes related features of three-dimensional bending invariants for performing facial feature description. The method extracts the bending invariant correlation feature by encoding the local features of the bending invariants of the adjacent nodes on the three-dimensional face surface; signing the relevant features of the bending invariant and performing spectral reduction using the spectral regression to obtain the principal component, and The K-nearest neighbor classification method is used to identify three-dimensional faces. However, due to the complex calculation amount required to extract the variables related features, the further application of the method is limited in efficiency;

CN200910197378 proposes a fully automatic three-dimensional face detection and posture correction method. By multi-scale moment analysis of human face three-dimensional surface, this method proposes facial region features to detect face surface coarsely, and proposes the tip region feature to accurately locate the tip of the nose, and then further accurately segment the complete Face surface, according to the distance information of the face surface to propose the characteristics of the nasal root region to detect the position of the nose root, a face coordinate system is established, and the face posture correction is automatically applied accordingly. The purpose of this patent is to estimate the pose of three-dimensional face data, which belongs to the data preprocessing stage of the three-dimensional face recognition system.

Face grayscale images are susceptible to illumination changes, while face depth images are susceptible to data acquisition accuracy and expression changes. These factors affect the stability and accuracy of face recognition systems to some extent.

Therefore, multimodal fusion systems are getting more and more attention. Multi-modal systems can take advantage of each modal data by acquiring multi-modal data, and overcome some inherent weaknesses of the single-mode system through fusion strategies (such as illumination of grayscale images, expressions of depth images). ), effectively improving the performance of the face recognition system.

Summary of the invention

In order to solve the above technical problems, multi-modal fusion systems have attracted more and more attention. Multi-modal systems can take advantage of the advantages of each modal data by using multi-modal data acquisition, and overcome the inherent weaknesses of single-mode systems (such as the illumination of grayscale images and the expression of depth images) through fusion strategies. The performance of the face recognition system is improved, and the present invention adopts the following technical solutions to solve the above technical problems:

A multi-layer fusion multi-modal face recognition device based on gray level and depth information, comprising a calculation unit for face recognition of gray information; a calculation unit for face recognition of depth information; based on multi-mode a unit for calculating the fusion of face recognition scores; a classifier calculation unit for classifying data.

Preferably, in the above-described multi-modality face recognition device based on gradation and depth information, the calculation unit for performing face recognition on gradation information comprises: a human eye detection unit, two-dimensional A data registration calculation unit, a grayscale face feature extraction unit, and a grayscale face recognition score calculation unit.

Preferably, in the above-mentioned multi-modal face recognition device based on gradation and depth information, the calculation unit for performing face recognition on depth information comprises: a nose tip detector unit, and a three-dimensional data registration calculation A unit, a depth face feature extraction unit, and a depth face recognition score calculation unit.

The invention also discloses a multi-modal face recognition method based on multi-layer fusion of gray level and depth information, comprising the following steps:

A. Identify the face grayscale information;

B. Identify face depth information;

C. The face gray information and depth information are normalized. Based on the normalized matching score, the fusion score is used to obtain the multi-modal fusion matching score to realize multi-modal face recognition.

Preferably, in the above-mentioned multi-modality face recognition method based on gradation and depth information, the step A includes the following steps:

A1. Feature area localization, using a human eye detector to acquire a human eye region, the human eye detector being a hierarchical classifier H, obtained by the following algorithm:

Given a set of training samples S = {(x ₁ , y ₁ ), ..., (x _m , y _m )}, weak space classifier

Where x _i ∈χ is the sample vector, y _i =±1, is the classification label, m is the total number of samples; initial sample probability distribution

t=1,...,T, each weak classifier h of the pair operates as follows:

The sample space χ is divided to obtain X ₁ X ₂ ,..., X _n ;

Where ε is a small normal number;

Calculate the normalization factor,

Select an h _t in the weak classifier space to minimize Z

Update training sample probability distribution

among them

Is a normalization factor such that D _t+1 is a probability distribution;

The final strong classifier H is

A2. Using the obtained position of the human eye region for registration, the LBP algorithm is used to process the human eye position data to obtain the LBP histogram feature, and the value formula is

The feature is input to the grayscale image classifier to obtain a grayscale matching score.

Preferably, in the above-mentioned multi-modal face recognition method based on gradation and depth information, the step B includes the following steps:

B1. The feature area is positioned to determine the position of the face of the face;

B2. For the three-dimensional data of different postures, after the registration reference area is obtained, the data is registered according to the ICP algorithm, and the Euclidean distance between the input data and the three-dimensional face model data in the registration library is calculated after the registration is completed;

B3. Obtain the depth image according to the depth information, and use the filter to compensate and denoise the noise points in the mapped depth image, and finally select the robust region of the expression to obtain the final 3D face depth image;

B4. Extraction is a visual dictionary histogram feature vector of a three-dimensional depth image. After the face image is input, after Gabor filtering, any filter vector is compared with all primitive words in the visual sub-dictionary corresponding to its position. By distance matching, it maps to the primitive closest to the distance, extracts the visual dictionary histogram feature of the original depth image, and uses the feature input depth image classifier to obtain the matching score.

Preferably, the multi-modal face recognition based on the multi-layer fusion of gray level and depth information is described above. In the method, the step c specifically includes:

The two-dimensional gray information and the three-dimensional depth information are subjected to fractional normalization using the principle of maximum and minimum linear normalization, and the formula is as follows

After the score is normalized, the matching scores of different modes are merged by using the relatively robust weighted addition principle. The formula is as follows

After obtaining the matching scores of the multi-modal data fusion, the linear discriminant analysis algorithm is used to maximize the objective function by constructing the intra-class scatter matrix SB and the inter-class scatter matrix SW.

Obtain the LDA mapping matrix W, which is the weight.

Preferably, in the above-mentioned multi-modal face recognition method based on gradation and depth information, the step B1 specifically includes

Step 1: determining a threshold, determining a threshold of the average average effective energy density of the domain, defined as thr;

Step 2: using the depth information to select the data to be processed, and using the depth information of the data to extract the face data in a certain depth range as the data to be processed;

Step 3: Calculating the normal vector, and calculating the direction quantity information of the face data selected by the depth information;

Step 4: Calculate the average negative effective energy density of the region. According to the definition of the regional average negative effective energy density, find the average negative effective energy density of the connected domains in the data to be processed, and select the most dense value. Large connected domain

Step 5: Determine whether the nose tip area is found. When the current area threshold is greater than the predefined thr, the area is the nose tip area, otherwise return to step 1 to restart the cycle.

Preferably, in the multi-modal face recognition method based on the multi-layer fusion of gray level and depth information, the main steps of the ICP algorithm include:

Determining a matching data set pair, selecting a reference data point set P from the three-dimensional nose point data in the reference template, and then using the closest distance between the point-to-point to select a data point set Q of the input three-dimensional face that matches the reference data;

Calculate the rigid motion parameters and calculate the rotation matrix R and the translation vector t

When the X determinant value is 1, R = X;

t=P-R*Q

According to the error between the rigid transformed data set RQ+t and the reference data set P, whether the three-dimensional data set is registered or not, after registration, the European type between the input data and the three-dimensional face model data in the registration library is calculated by the following formula: distance

Where P and Q are respectively a set of feature points to be matched, and the set contains N feature points.

Preferably, in the multi-modal face recognition method based on the multi-layer fusion of gray level and depth information, step B4 is specifically:

Dividing the 3D face depth image into some local texture regions;

For each GaBor filter response vector, it is mapped to the vocabulary of its corresponding visual sub-dictionary according to the position, and based on this, the visual dictionary histogram vector is established as the special diagnosis expression of the three-dimensional human face;

The nearest neighbor classifier is used as the final face recognition, where the L1 distance is chosen as the distance metric.

Compared with the prior art, the present invention has the following technical effects:

By adopting the scheme of the invention, the multi-modal system overcomes some inherent aspects of the single-modal system by using the advantages of two-dimensional gray information and three-dimensional depth information by utilizing the advantages of two-dimensional gray information and three-dimensional depth information. Weaknesses (such as the illumination of grayscale images and the expression of depth images) effectively improve the performance of the face recognition system, making face recognition more accurate and faster.

DRAWINGS

Figure 1 is a flow chart of the present invention;

Figure 2 is a block diagram of the system of the present invention;

3 is a schematic view showing the positioning of a three-dimensional human face tip according to the present invention;

4 is a schematic diagram of a three-dimensional human face space mapping according to the present invention;

FIG. 5 is a schematic diagram of extracting features of a three-dimensional face depth representation of the present invention; FIG.

6 is a schematic diagram of a two-dimensional human face human eye detection according to the present invention;

7 is a schematic diagram of a two-dimensional face LBP feature of the present invention;

8 is a schematic diagram of extracting features of a two-dimensional face grayscale representation according to the present invention;

FIG. 9 is a schematic diagram of different modal score fusion algorithms according to the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

The invention discloses a multi-modality face recognition device based on multi-layer fusion of grayscale and depth information, comprising a calculation unit for face recognition of grayscale information; and a face recognition method for depth information. a computing unit; a computing unit that fuses based on multimodal face recognition scores; a classifier computing unit that classifies data.

The grayscale information face recognition calculation unit specifically includes a human eye detection unit, a two-dimensional data registration calculation unit, a grayscale face feature extraction unit, and a grayscale face recognition score calculation unit.

The depth information face recognition calculation unit specifically includes a nose tip detector unit, a three-dimensional data registration calculation unit, a depth face feature extraction unit, and a depth face recognition score calculation unit.

At the same time, the present invention also discloses a multi-modality face recognition method based on multi-layer fusion of gray and depth information. As shown in FIG. 9, the multi-modal fusion system disclosed by the present invention includes multiple data sources: Grayscale image, 3D depth image. For the two-dimensional gray image, feature point detection (human eye) is first performed, and then the obtained feature point position is used for registration. After the gray image registration, the LBP histogram feature is acquired by the LBP algorithm, and the The feature input gray image classifier obtains the matching score; for the three-dimensional depth data, the feature point detection (nose tip) is first performed and the acquired feature points are used for registration, and then the registered three-dimensional spatial data is mapped into the face depth image, and the The visual dictionary algorithm acquires a visual dictionary histogram feature for the data, and uses the feature input depth image classifier to obtain a matching score. The multi-modal system utilizes a decision-making layer fusion strategy. Therefore, after obtaining the matching scores of each data source, these scores need to be normalized, and then based on the normalized matching scores, the fusion strategy can be used to obtain the multi-modal fusion. Match scores to achieve multimodal face recognition.

As shown in FIG. 6, the human eye detector is obtained by a human eye detector, which is a hierarchical classifier, each layer is a strong classifier (such as Adaboost), and each layer filters a part of the non-human. In the eye area, the image area finally obtained is the human eye area. The advantage of the hierarchical classifier is that the first few levels of classifiers contain fewer features, so the calculation speed is faster; after the previous few levels of classifiers, although the level classifier complexity increases, but the rest of the time The image area has been relatively small. Through the above mechanism, the hierarchical classifier can achieve real-time detection performance. Adaboost algorithm can summarize as follows:

t=1,...,T, each weak classifier h of the pair operates as follows:

The sample space χ is divided to obtain X ₁ , X ₂ , ... X _n ;

Where ε is a small normal number;

Calculate the normalization factor,

Select an h _t in the weak classifier space to minimize Z

Update training sample probability distribution

among them

Is a normalization factor such that D _t+1 is a probability distribution;

The final strong classifier H is

As shown in FIGS. 7 and 8, the obtained human eye region position is used for registration, and the LBP algorithm is used to process the human eye position data to obtain the LBP histogram feature. The LBP algorithm compares the pixel point with its neighboring pixel point, and the value thereof is as follows. Formula:

If P = 8 and R = 1, some LBP values with the meaning of texture characteristics are shown in Figure (c). The first image represents texture highlights, the second image represents texture boundaries, and the third image represents texture dark spots or smooth texture regions. According to the statistical distribution law of the texture, the obtained LBP values are classified into 59 categories, and these 59 categories are used as the basic structural statistical feature vectors (LBP histogram features) of the histogram. This form effectively combines the descriptiveness of local texture information with the robustness of histograms, and has achieved good recognition performance in the field of face recognition.

The input two-dimensional face data first extracts a key point by human eye detection, and then adjusts the face image to a positive upright posture according to the position of the human eye through rigid transformation. The LBP histogram features will be extracted from the registered grayscale map.

As shown in FIG. 3, for the three-dimensional depth data, the detection of the nose tip region of the face is first performed, specifically by the following steps:

Determining a threshold, determining a threshold of a domain average negative effective energy density, defined as thr;

The depth data is used to select the data to be processed, and the depth information of the data is used to extract the face data in a certain depth range as the data to be processed;

The calculation of the normal vector calculates the direction quantity information of the face data selected by the depth information;

Calculating the average negative effective energy density of the region, according to the definition of the regional average negative effective energy density, finding the average negative effective energy density of the connected domains in the data to be processed, and selecting the connected domain with the largest density value;

It is determined whether the nose tip region is found. When the current region threshold is greater than the predefined thr, the region is the nose tip region, otherwise the selection is restarted.

As shown in FIG. 4, the acquired nose region is used for registration. In the present invention, the ICP algorithm is used for data registration, and the reference point data point set P is first selected from the three-dimensional nose data in the reference template, and then the point-to-point point is used. The nearest distance between the two is selected to input the data point set Q in the three-dimensional face that matches the reference data, and the matrix of 3*3 is first calculated.

Where N is the capacity of the data set, and then the SVD decomposition of the H matrix

H=U∧V ^T

X=VU ^T

Calculate the rotation matrix R and the translation matrix t

When the X determinant value is 1, R = X;

t=P-R*Q

It is judged whether the error between the rigid transformed data set RQ+t and the reference data set P is sufficiently small. When the error is less than a certain threshold, then the two three-dimensional data sets have been registered; otherwise, the first step is restarted until the data set pair is registered.

According to the above adaptive feature point sampling and ICP registration algorithm, the distance function is as follows:

Since the feature point sampling density is different, when calculating the Euclidean distance between the input data and the 3D face model data in the registration library after the registration is completed, the distance needs to be normalized according to the number of effective feature points.

As shown in FIG. 4, after registration, the depth image is first acquired according to the depth information, and then the noise point (data bump or hole point) in the mapped depth image is compensated by the filter. Denoising, finally selecting the robust region of the expression to obtain the final 3D face depth image.

As shown in FIG. 5, after the face image is input, after Gabor filtering, any filter vector is compared with all primitive vocabularies in the visual sub-dictionary corresponding to its position, and it is matched by distance matching. Map to the primitive closest to it. In this way, the visual dictionary histogram features of the original depth image can be extracted. The general process is summarized as follows:

Dividing the 3D face depth image into some local texture regions;

As shown in FIG. 9, in the present invention, the two-dimensional gray information and the three-dimensional depth information are subjected to fractional normalization using the principle of maximum and minimum linear normalization, and the formula is as follows

Unlike the traditional principle of maximum and minimum linear normalization, since max represents a position farther away from the space, this value is easily affected by noise (such as 3D human face hair occlusion, etc.), so in the max When the value is taken, it is the value of the single modal score set {S _k } at the 95% position after the ascending order; and since min represents the position closer in the distance space, the value is not affected by the data noise (this The affected value will become larger), so the min value is the minimum value after the single modal score set {S _k } is sorted in ascending order. S _k is the matching score in the modality,

Is the normalized matching score in the modality;

The matching score after multimodal data fusion is obtained. Here, the weight is obtained by a linear discriminant analysis algorithm (LDA). The algorithm utilizes the category information of the data to maximize the objective function by constructing an intra-class scatter matrix SB and an inter-class scatter matrix SW.

Obtain the LDA mapping matrix W, which is the weight.

It is apparent to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, and the present invention can be embodied in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims instead All changes in the meaning and scope of equivalent elements are included in the present invention. Any reference signs in the claims should not be construed as limiting the claim.

In addition, it should be understood that although the description is described in terms of embodiments, not every embodiment includes only one independent technical solution. The description of the specification is merely for the sake of clarity, and those skilled in the art should regard the specification as a whole. The technical solutions in the respective embodiments may also be combined as appropriate to form other embodiments that can be understood by those skilled in the art.

Claims

A multi-layer fusion multi-modal face recognition device based on gray level and depth information, comprising: a calculation unit for performing face recognition on gray information; and a calculation unit for performing face recognition on depth information a computing unit that fuses based on multimodal face recognition scores; a classifier computing unit that classifies data.
A multi-modal face recognition apparatus based on gradation and depth information based on multi-layer fusion according to claim 1, wherein the calculation unit for performing face recognition on gradation information comprises: human eye detection The unit, the two-dimensional data registration calculation unit, the grayscale face feature extraction unit, and the grayscale face recognition score calculation unit.
A multi-modal face recognition apparatus based on gradation and depth information based on multi-layer fusion according to claim 1, wherein a plurality of calculation units for performing face recognition on depth information include: a nose tip detector unit, A three-dimensional data registration calculation unit, a deep face feature extraction unit, and a deep face recognition score calculation unit.
A multi-layer fusion multi-modal face recognition method based on gray level and depth information, comprising the following steps:

A. Identify the face grayscale information;

B. Identify face depth information;

C. The face gray information and depth information are normalized. Based on the normalized matching score, the fusion score is used to obtain the multi-modal fusion matching score to realize multi-modal face recognition.
Multi-mode fusion multi-modality based on gray level and depth information according to claim 4 The face recognition method is characterized in that the step A includes the following steps:

A1. Feature area localization, using a human eye detector to acquire a human eye region, the human eye detector being a hierarchical classifier H, obtained by the following algorithm:

Given a set of training samples S = {(x 1 , y 1 ), ..., (x m , y m )}, weak space classifier
Where x i ∈χ is the sample vector, y i =±1, is the classification label, m is the total number of samples; initial sample probability distribution

T-1,...,T, each weak classifier h of the pair performs the following operations:

The sample space χ is divided to obtain X 1 , X 2 ,..., X n ;

Where ε is a small normal number;

Calculate the normalization factor,

Select an h t in the weak classifier space to minimize Z

Update training sample probability distribution
i=1,...,m, where
Is a normalization factor such that D t+1 is a probability distribution;

The final strong classifier H is

A2. Using the obtained position of the human eye region for registration, the LBP algorithm is used to process the human eye position data to obtain the LBP histogram feature, and the value formula is

The feature is input to the grayscale image classifier to obtain a grayscale matching score.
The method of claim 4, wherein the step B comprises the following steps:

B1. The feature area is positioned to determine the position of the face of the face;

B2. For the three-dimensional data of different postures, after the registration reference area is obtained, the data is registered according to the ICP algorithm, and the Euclidean distance between the input data and the three-dimensional face model data in the registration library is calculated after the registration is completed;

B3. Obtain the depth image according to the depth information, and use the filter to compensate and denoise the noise points in the mapped depth image, and finally select the robust region of the expression to obtain the final 3D face depth image;

B4. Extraction is a visual dictionary histogram feature vector of a three-dimensional depth image. After the face image is input, after Gabor filtering, any filter vector is compared with all primitive words in the visual sub-dictionary corresponding to its position. By distance matching, it maps to the primitive closest to the distance, extracts the visual dictionary histogram feature of the original depth image, and uses the feature input depth image classifier to obtain the matching score.
The multi-modal face recognition method based on the multi-layer fusion of the gradation and the depth information according to claim 4, wherein the step c specifically includes:

Fractionation of two-dimensional gray information and three-dimensional depth information using the principle of maximum and minimum linear normalization Normalized, the formula is as follows

After the score is normalized, the matching scores of different modes are merged by using the relatively robust weighted addition principle. The formula is as follows

After obtaining the matching scores of the multi-modal data fusion, the linear discriminant analysis algorithm is used to maximize the objective function by constructing the intra-class scatter matrix SB and the inter-class scatter matrix SW.

Obtain the LDA mapping matrix W, which is the weight.
A multi-modal face recognition method based on gradation and depth information based on multi-layer fusion according to claim 6, wherein the step B1 specifically comprises

Step 1: determining a threshold, determining a threshold of the average average effective energy density of the domain, defined as thr;

Step 2: using the depth information to select the data to be processed, and using the depth information of the data to extract the face data in a certain depth range as the data to be processed;

Step 3: Calculating the normal vector, and calculating the direction quantity information of the face data selected by the depth information;

Step 4: Calculate the average negative effective energy density of the region, according to the regional average negative effective energy density The definition, find the average negative effective energy density of the connected domains in the data to be processed, and select the connected domain with the largest density value;

Step 5: Determine whether the nose tip area is found. When the current area threshold is greater than the predefined thr, the area is the nose tip area, otherwise return to step 1 to restart the cycle.
The multi-modal face recognition method based on gradation and depth information based on multi-layer fusion according to claim 6, wherein the main steps of the ICP algorithm include:

Determining a matching data set pair, selecting a reference data point set P from the three-dimensional nose point data in the reference template, and then using the closest distance between the point-to-point to select a data point set Q of the input three-dimensional face that matches the reference data;

Calculate the rigid motion parameters and calculate the rotation matrix R and the translation vector t

When the X determinant value is 1, R = X;

t=P-R*Q

According to the error between the rigid transformed data set RQ+t and the reference data set P, whether the three-dimensional data set is registered or not, after registration, the European type between the input data and the three-dimensional face model data in the registration library is calculated by the following formula: distance

Where P and Q are respectively a set of feature points to be matched, and the set contains N feature points.
The multi-modal face recognition method based on the gradation and depth information of the multi-layer fusion according to claim 6, wherein the step B4 is specifically:

Dividing the 3D face depth image into some local texture regions;

For each Gabor filter response vector, it is mapped to the vocabulary of its corresponding visual sub-dictionary according to the position, and based on this, the visual dictionary histogram vector is established as the special diagnosis expression of the three-dimensional human face;

The nearest neighbor classifier is used to obtain the recognition score of the three-dimensional face recognition, wherein the L1 distance is selected as the distance metric.