CN115188084A

CN115188084A - Multi-mode identity recognition system and method for non-contact voiceprint and palm print palm vein

Info

Publication number: CN115188084A
Application number: CN202210927661.3A
Authority: CN
Inventors: 胡文艺; 杜育佳; 王洪坤; 赵昆
Original assignee: Chengdu Univeristy of Technology
Current assignee: Chengdu Univeristy of Technology
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-10-14

Abstract

The invention discloses a multi-mode identity recognition system and method for non-contact voiceprints and palm print palm veins, which comprises the following steps: a power supply module: the system is used for supplying power to the whole multi-mode identity recognition system; fixed wavelength infrared LED light source module: the hand of a human body is irradiated by an infrared LED light source, and the acquisition of the information characteristics of the palm print and the palm vein of the human body is assisted by an image acquisition CCD module; image acquisition CCD module: collecting the information characteristics of the palm print and the palm vein of the human body; the voice acquisition module: extracting voice information by using MFCC characteristics; a storage module: the device is used for storing data acquired by the voice acquisition module and the image acquisition CCD module. The multi-modal identity recognition module: and (4) preprocessing the picture, extracting picture characteristics, fusing and comparing the characteristics and outputting a result. The invention has the advantages that: the authentication safety is improved, the complexity of manually extracting features is reduced, the anti-noise interference capability is enhanced, and the robustness and the transportability of the system are improved.

Description

Multi-mode identity recognition system and method for non-contact voiceprint and palm print palm vein

Technical Field

The invention relates to the technical field of biological feature recognition, in particular to a non-contact type multi-mode identity recognition system and method for voiceprints and palmar veins.

Background

With the rapid development of global information industrialization, how to perform rapid, accurate and safe identification and verification in a digital environment is a hot topic which is receiving attention in recent years. The traditional identity authentication is easy to lose, forget and forge, so that the biometric feature recognition technology is concerned more and more. Biometric identification is a process of identifying the authenticity of identity information after being processed by a system by collecting physiological characteristics and behavioral characteristics of a human body [1]. At present, the more mature or widely applied biometric identification technology is face, voice, fingerprint, iris, finger vein, DNA, signature, gait, etc. [2,3,4]. However, the single-mode biometric recognition may have a reduced accuracy due to sensor noise, unsuitability of feature extraction or matching methods, and may also have security problems due to the falsification of features, such as false fingerprint. Further, multimodal biometric identification comes into the line of sight of people. Different descriptions or perspectives of the same object are called modalities, while multi-modal characterization is the characterization of a particular task using information from multiple such entities together [3]. In general, a multi-modal biometric system fuses two or more biometric features at different levels, and can be divided into a sensor layer, a feature layer, a score layer, and a decision layer [5,6,7]. The research difficulty of multi-modal fusion authentication is how to effectively acquire, extract and compare the characteristics of multi-source heterogeneous data.

The characteristic learning technology is a technical set which can effectively identify and apply original complex data distribution according to tasks, namely useful information is extracted from data so as to learn data characteristics, and therefore the effectiveness of an algorithm model and the accuracy of a predictor are greatly improved. Based on the research of the characterization learning technology in the multi-mode data environment, the characterization learning can establish a model for processing and associating various mode information to perform multi-mode information fusion, so that the accuracy and the safety of identity authentication are improved. The goal of multi-modal token learning is to extract tokens of data objects (users) from data of multiple heterogeneous modalities, a typical approach is to concatenate the individual tokens of each modality together to form a joint token, and then perform subsequent task learning on this joint token [8]. The data representation is fused and the data of a plurality of data sources are unified, so that the heterogeneity among the data is overcome, and complementary information can be extracted from the data sources, so that the fused representation has richer and more effective information than that in a single mode.

Prior art 1

The fingerprint identification is to identify the identity by utilizing the uneven grain characteristics on the skin on the front surface of the tail end of the finger, the fingerprint has uniqueness and stability, and the verification of the real identity is realized by comparing the fingerprint with the fingerprint prestored in a database. Among various biometric identification techniques, fingerprint identification remains the most mature identification technique, and fingerprint identification has been accepted by officials in many countries, becomes an effective means for identity identification in the judicial community, and has also been widely used in many other industrial fields, and has become a pronoun and de facto standard for biometric identification. The fingerprint identification technology mainly relates to the processes of fingerprint image acquisition, fingerprint image preprocessing, fingerprint feature extraction, fingerprint image database establishment, fingerprint feature value comparison and matching and the like. After years of research, various fingerprint identification methods have been generated, wherein the most mature and widely applied fingerprint identification method based on the minutiae is the most widely used fingerprint identification method. The image that adopts in the laboratory utilizes the current device in laboratory to accomplish the collection, and specific content includes: acquiring a fingerprint image directional diagram; segmenting the fingerprint image; enhancing the fingerprint image; carrying out binarization and post-processing on the fingerprint image; thinning the fingerprint image; sixthly, extracting the characteristics of the fingerprint image; matching of fingerprint images.

Disadvantages of the first prior art

The finger cleaning device has the advantages that the requirement on the environment is high, the finger cleaning degree and the humidity of the finger are sensitive, and dirty oil and water can not be identified or the identification result is influenced;

the problems of difficult identification and low identification rate of low-quality fingerprints such as scars, molting and the like are solved;

the operation specification requirement during fingerprint identification is high;

fingerprint traces may remain on the device, and these traces may be used to copy the fingerprint.

Prior art 2

Compared with other biological identification technologies, the palm print and palm vein fusion identification technology has higher identification precision, convenience and stability, is favorable for improving the convenience of life of people and improves the safety of personal information to a certain extent.

The palm print and the palm vein have the texture which does not change with age, and the palm print characteristic identification has the advantages of rich texture characteristics, easy acceptance by users, higher safety and stability and the like.

The second prior art has the defects

(1) Palm vein and palm print image acquisition environment. The collection of the palm vein mainly has contact collection and non-contact collection, and no matter which collection mode is utilized, the collection process can be influenced by factors such as illumination, collection background and temperature.

(2) The influence of the localized segmentation of the critical region of the palm vein. In order to obtain a region with rich vein features, a palm region-of-interest (ROI) image needs to be positioned and segmented, researchers generally adopt palm vein images of a Hongkong science university database to perform vein recognition research, and hardware equipment is installed at a valley position between a middle finger and a ring finger due to the fact that a palm needs to be fixed during collection of the palm vein images in the database, so that the palm vein ROI image is difficult to position and segment. Due to the lack of a proper ROI positioning segmentation method, the accuracy of feature extraction is low, and the recognition rate is low.

(3) The palm veins interfere with the palm prints. The palm vein image has a palm print, and the existing algorithm still cannot completely remove the interference of the palm print, for example, the robustness of the algorithm is improved by using fuzzy threshold judgment and global gray value matching, but the interference of the palm print is not better removed, so that the identification effect of the palm vein is poor.

(4) The non-contact acquisition mode mainly has the problems of position deviation, distance drift, image defocusing, brightness fluctuation and the like of the palm print sample image. For the anti-counterfeiting of palm print identification, counterfeiting means such as a silica gel prosthesis and a palm print film mainly exist. These factors are the main reason why the accuracy of the non-contact palm print recognition system is lower than that of the contact palm print recognition system, and also the main reason that the non-contact palm print recognition system is limited to be put into practical use.

Reference to the literature

[1] Liu Qianying, liu Ji biometric identification technology is developing in the field of authentication [ J ] the electronic world, 2020 (05): 23-24;

[2] xie Lu, yu Fei secure authentication technology based on multi-modal biometric [ J ] secret science technology, 2016 (01): 36-40;

[3] zhou Chenyi, multimodal biometric identification based on fusion algorithms and deep learning study [ D ]. Southern medical university, 2020;

[4] zhang Lou, wang Huabin, tao Liang, zhou Jian adaptive multimodal biometric fusion based on classification distance scores [ J ] computer research and development, 2018, volume 55 (1): 151-162;

[5] ma Ruru bimodal identity authentication research based on fingerprints and electrocardiosignals [ D ]. Tianjin university of science, 2021;

[6] zhang Yue, algorithmic study of multimodal biometric identification technology [ D ]. University of vinpocetine, 2017;

[7] ding Xuan multimodal biometric identification technology and its standardized dynamics [ J ] computer knowledge and technology, 2017, vol 13 (36): 153-154;

[8] halbernet, lu Kai, characterization learning summary of complex heterogeneous data [ J ] computer science, 2020,47 (02): 1-9.

Disclosure of Invention

The invention provides a non-contact type multi-mode identity recognition system and method for voiceprints and palmprint metacarpal veins, which aim at the defects of the prior art. The identity authentication multi-mode biological feature recognition method based on the intelligent data representation theory is used as core content, and related technologies are integrated in a network security scene, so that the identity authentication multi-mode biological feature recognition method has high safety, convenience and reliability.

In order to realize the purpose, the technical scheme adopted by the invention is as follows:

a multi-modal identification system for non-contact voiceprint and palmprint metacarpal veins, comprising: the system comprises a power supply module, a fixed wavelength infrared LED light source module, an image acquisition CCD module, a voice acquisition module, a storage module and a multi-mode identity recognition module;

a power supply module: for powering the entire multimodal identity recognition system

Fixed wavelength infrared LED light source module: the human hand is irradiated by an infrared LED light source to assist the image acquisition CCD module in acquiring the information characteristics of the palm print and the palm vein of the human body;

image acquisition CCD module: collecting the information characteristics of the palm print and the palm vein of the human body;

the voice acquisition module: extracting voice information by using MFCC characteristics;

a storage module: the device is used for storing data acquired by the voice acquisition module and the image acquisition CCD module.

The multi-modal identity recognition module: and preprocessing the picture, extracting picture characteristics, fusing and comparing the characteristics and outputting a result.

A multi-mode identity recognition method for non-contact voiceprints and palm print palm veins comprises the following steps:

step 1, preprocessing an image; the preprocessing mainly comprises three steps, firstly, denoising an infrared acquisition palm image by adopting low-pass filtering, secondly, extracting a binary image of a palm region by an image enhancement part through a Sauvola algorithm, finally, performing gray level transformation on a palm print and a palm vein by an ROI positioning part to enable the palm edge to be protruded, then, using a Canny operator to detect the palm edge, and finally, cutting the image to obtain an interested palm region image.

Step 2, feature extraction; the feature extraction is divided into two parts, wherein the first part is used for extracting voice features, and the second part is used for extracting two hand features of a palm print and a palm vein. Adopting ResNet as a main structure, introducing an SE module, constructing an SE-ResNet network structure, inputting the preprocessed pictures into the SE-ResNet network structure, generating feature distribution by adding a global pooling layer, and finishing the extraction of information codes. In order to obtain the correlation between channels, a ReLU activation function and a sigmoid gate control mechanism are combined to complete the recalibration of the characteristics.

Step 3, feature fusion; and a multi-layer characteristic fusion mechanism is adopted, the bilinear models are decomposed to carry out fusion to obtain the interaction between different modes of the hand and the audio, paired audio and hand characteristics are input into the fusion model, and the final result is output on a full connection layer through softmax.

Step 4, comparing characteristics; and (3) calculating the corner response function of each point by using the Shi-Tomasi algorithm for the feature points preliminarily extracted by using the improved FAST corner detection algorithm, and determining the top N points with the maximum response values as the feature points according to the corner response function. There are at least 2 strong boundaries in different directions around the screened feature points. For matching of binary feature description vectors, hamming distance is used as a similarity measure between descriptors.

And 5, outputting the interaction. Judging the in-mold sample characteristic points of the three modes by adopting a joint judgment sparse coding algorithm, so that the distance in the classes is minimum, and the distance between the classes is maximum; and setting a proper threshold value according to the actual scene requirement, if the two matched samples belong to the same class and are successfully matched in the voiceprint, the palm print and the palm vein, displaying that the authentication is successful on an interface, and otherwise, prompting that the authentication is failed.

Further, step 2 specifically comprises: for any given information, after entering the network module, the conversion is performed as shown in formula (1):

x is the input picture and U is the extracted feature.

And the SE compresses the global space information into a channel descriptor, the channel descriptor contains the global distribution condition of the feature response on the channel dimension, and the global average pooling layer is utilized to obtain the statistical data on the channel dimension. Statistical value

Is obtained by compressing U having a spatial dimension H × W by equation (2):

the transform output U is interpreted as a set of local descriptors, the statistics of which can express the entire image.

And completely capturing the dependency on the channel dimension by using the aggregation information obtained by the compression operation. A simple threshold mechanism with sigmoid activation function (3) was chosen:

s＝F _ex (z,W)＝σ(g(z,W))＝σ(W ₂ δ(W ₁ z)) (3)

where, δ represents the ReLU activation function,

and

in order to limit the complexity of the model and to assist the generalization of the model, the threshold mechanism is parameterized by composing two fully-connected layers (FC) around the nonlinearity into a bottleneck (bottleeck) structure, and the final output of the block is obtained by rescaling the transform output U using the activation function (4):

in the formula (I), the compound is shown in the specification,

F _scale( u _c ,s _c ) Representation characteristic diagram

And a scalar s _c The product of the corresponding channels. s the role of this activation function is to give a weight to each channel based on the descriptor z of the input feature.

Further, step 3 is specifically as follows:

decomposing the bilinear model considers each feature pair by a linear transformation:

Z _i ＝x ^T W _i y+b _i (5)

wherein x ∈ R ⁿ And y ∈ R ^m Is an input feature vector, W, from different modalities of hand and audio _i Is a weightMatrix, b _i Is the offset.

Weighting matrix W _i Decomposed into two low-order matrices, i.e. where W _i ＝U _i V _i ^T Wherein U is _i ∈R ^n×d And V _i ∈R ^m×d And d is less than or equal to min (n, m) by applying constraint on the dimension d. Equation (5) can be further rewritten as:

Z _i ＝x ^T U _i V _i ^T y+b _i (6)

capturing the inherent correlation between two heterogeneous modes, equation (7):

wherein 1 ∈ R ^d A column vector representing 1, and

representing a Hadamard or element-wise product.

To obtain the output eigenvector z, two third order tensors are required: u = [ U1, …, UO]∈R ^n×d×o And V = [ V = ₁ ,…,V _o ]∈R ^m×d×o . Using linear projection P ∈ R ^d×o Instead of a column vector, vector z is represented as:

wherein b ∈ R ^o Is a deviation vector.

After each linear mapping a non-linear activation function is added, the vector z is further represented as:

where σ represents any nonlinear activation function, and x and y represent the hand attention vector and audio feature vector, respectively, then the value of x is both greater than 0, and y is in the range of [ -1,1 ].

A Relu function is further added to normalize the output of the network, and the final vector z can be expressed as:

and inputting paired audio and hand features into the fusion model, and outputting a final result on the full connection layer through softmax.

Further, step 4 is specifically as follows:

the improved FAST algorithm is adopted, and the specific improvement is as follows: taking 24 pixels around one pixel P as a detection template, setting the gray value of the P as IP, setting a threshold value T, and if the gray value of 14 continuous pixels in the 24 pixels is greater than IP + T or less than IP-T, then P is an angular point.

And (3) optimizing the characteristic points by using a Shi-Tomasi algorithm, wherein the Shi-Tomasi algorithm compares the smaller one of the two characteristic values with a given minimum threshold value, and if the smaller one of the two characteristic values is larger than the given minimum threshold value, a strong corner point is obtained.

The Shi-Tomasi algorithm detects corner points by calculating the gray level after the local small window W (x, y) is moved in each direction. Shifting the window u, v to produce a gray scale change E u, v

Where M is a 2 x 2 autocorrelation matrix, calculated from the derivatives of the image

For λ in two features of matrix M _max And λ _min The analysis is performed in that the corner response function is defined as λ, since the larger uncertainty of curvature depends on the small corner _min . Calculating the corner response function of each point by using Shi-Tomasi algorithm for the characteristic points preliminarily extracted by using the improved FAST corner detection algorithmλ _min According to λ _min And taking the point with the maximum N response values to determine the characteristic point. At least 2 strong boundaries in different directions exist around the screened feature points, and the feature points are easy to identify and stable.

For matching of binary feature description vectors, hamming distance is used as a similarity measure between descriptors. Let two feature vectors of the descriptor be F1, F2, then the hamming distance of F1, F2 is:

and judging whether the feature vectors are matched or not by determining the threshold value of the Hamming distance.

Further, step 5 is specifically as follows:

the joint discrimination sparse coding algorithm is as follows: given the feature matrices X, Y and Z of the three modalities, jointly learning the three projection matrices Px, py and Pz, mapping the three peak features to sparse matrices Vx ∈ Rd × N, vy ∈ Rd × N and Vz ∈ Rd × N, can accurately approximate the original matrix X, Y, Z, as Vx ≈ PxX, vy ≈ PyY, vz ≈ PzZ.

After the characteristic expressions Vx, vy and Vz are obtained from the three modes, the characteristic expressions are quantized to

C ^x ＝sgn(V ^x )；C ^y ＝sgn(V ^y )；C ^x ＝sgn(V ^X )； (13)

Wherein sgn () is a meta-level sign function to obtain sparse binary code, cx (Cy/Cy) = [ c1, c2, … ], cN ] ∈ Rl × N, ci ∈ {0,1} l represents the learned ith class of sparse binary code, and l (= 1,2, …, 12) is the length of the binary code.

Sparsity constraint is applied to the projection characteristic representations Vx and Vy by utilizing two projection matrixes to reduce projection errors, and the Frobenius norm is used as a cost function, so that the engineering errors can be expressed as

Where a, b >0, a + b epsilon (0,1) are trade-off parameters that balance the three different modes.

Two constraints are performed on the projection sparse features, 1) for the in-mold samples of each mode, the distance in the class is minimized, and the distance between the classes is maximized; 2) For intra-class samples, the information correlation between feature points is maximized and thus the distance is minimized. The projection sparse feature has stronger resolution and compactness through constraint.

Compared with the prior art, the invention has the advantages that:

(1) The method adopts a non-contact mode to collect the characteristics of the voiceprint, the palm print and the palm vein, improves the safety of authentication, and is suitable for scenes with higher requirements on the sanitary environment under epidemic situations and the like.

(2) The feature extraction adopts a deep learning mode, the complexity of manually extracting features is reduced, the anti-noise interference capability is enhanced, and the robustness and the transportability of the system are improved.

(3) Voiceprint recognition is integrated into palm print and palm vein recognition, and three modal features are integrated for identity authentication, so that the security, accuracy and robustness of authentication are improved.

Drawings

FIG. 1 is a diagram of a multi-modal identification system architecture in accordance with an embodiment of the present invention;

FIG. 2 is a flow diagram of the operation of a multimodal identity recognition system in accordance with an embodiment of the present invention;

FIG. 3 is a diagram of a SE-ResNet network architecture in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram of a multi-layer feature fusion model according to an embodiment of the present invention;

fig. 5 is a feature matching flow chart of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings by way of examples.

The voice and hand multi-mode information acquisition device is an important device for identifying the identity of a human body, and the acquisition principle of the device is shown in figure 1. The voice and hand multi-mode identity recognition designed by the system is realized by placing a human hand in an infrared LED light source environment, collecting the information characteristics of a palm print and a palm vein of the human body by using a CCD device, extracting voice information by using MFCC characteristics and comparing the extracted voice and hand multi-mode characteristics with a verification map.

The overall architecture design of the system is shown in fig. 1, and mainly comprises a hardware part and a software part. The hardware part is mainly used for collecting multi-mode voice and hand characteristic information, and the software part is mainly used for multi-mode information processing and recognition. The system flow chart is as shown in fig. 2, and the hardware part specifically comprises a power supply module, a fixed wavelength infrared LED light source module, an image acquisition CCD module, a voice acquisition module and a storage module; the software part comprises image preprocessing, a feature extraction algorithm, feature fusion comparison and a user interaction interface.

The feature extraction is divided into two parts, wherein the first part is used for extracting voice features, and the second part is used for extracting two hand features of a palm print and a palm vein. The invention adopts ResNet as a main structure, and introduces an SE module on the basis to construct an SE-ResNet network structure, as shown in figure 3. And generating feature distribution by adding a global pooling layer, and finishing the extraction of information codes according to the feature distribution. In order to obtain the correlation between channels, a ReLU activation function and a sigmoid gate control mechanism are combined to complete the recalibration of the characteristics. In addition, in order to simplify the complexity of the model parameters, 1 × 1 full connection layers are also used at both ends of the ReLU function.

The SE (Squeeze-and-Excitation Networks) module is a computing unit that can consist of any given transformation, and for any given information, it performs the conversion as shown in (1) after entering the network module:

x is the input picture and U is the extracted feature. In order to make the information of the global receptive field from the network available to the lower level layers, the SE compresses the global spatial information into a channel descriptorThe character comprises the global distribution condition of the feature response on the channel dimension, and statistical data on one channel dimension is obtained by utilizing the global average pooling layer. Statistical value

Is derived from (2) compressing U with spatial dimension H × W:

the transformation output U can be interpreted as a set of local descriptors whose statistics can represent the entire image. In order to be able to exploit the aggregated information from the compression operation, the next goal is to fully capture the dependencies in channel dimensions. A simple threshold mechanism with sigmoid activation function (3) was chosen:

s＝F _ex (z,W)＝σ(g(z,W))＝σ(W ₂ δ(W ₁ z)) (3)

where, δ represents the ReLU activation function,

and

to limit the complexity of the model and to assist the generalization of the model, the threshold mechanism is parameterized by constructing a bottleneck (bottleeck) structure from two fully connected layers (FC) around the non-linearity, e.g. with one parameter W ₁ A dimension reduction layer for reducing the amount of parameters by a factor of r, a ReLU activation function and a parameter of W ₂ Is added to the layer. The final output of the block is obtained by rescaling the transformed output U using the activation function (4):

in the formula (I), the compound is shown in the specification,

F _scale( u _c ,s _c ) Representation characteristic diagram

Technical route and implementation scheme of multi-layer feature fusion mechanism

In general, cascading or element-by-element summation is the most common scheme for heterogeneous feature fusion. Since the distribution of audio and hand features typically varies widely and their feature sizes typically vary in size, the representational capabilities of these simple fusion schemes may not be sufficient to achieve reliable speaker naming performance. Fusing by decomposing a bilinear model (FBM) enables better capture of the interaction between the two different modalities and is generally superior to simple fusion methods (e.g., tandem), as shown in fig. 4.

Z _i ＝x ^T W _i y+b _i (5)

wherein x ∈ R ⁿ And y ∈ R ^m Is an input feature vector from two different modalities (e.g., high-level features of hands and audio), W _i Is a weight matrix, b _i Is the offset. Although a bilinear model can capture the pairwise interrelationship between two modalities, it typically introduces a large number of parameters, which can lead to increased computational costs. To solve this problem, an effective method is to apply a weight matrix W _i Decomposed into two low-order matrices, i.e. where W _i ＝U _i V _i ^T Wherein U is _i ∈R ^n×d And V _i ∈R ^m×d And d is less than or equal to min (n, m) by applying constraint on the dimension d. Therefore, equation (5) can be further rewritten as:

Z _i ＝x ^T U _i V _i ^T y+b _i (6)

in general, the first term on the right of the equation can be further transformed with a Hadamard product or element-by-element multiplication to capture the inherent correlation between the two heterogeneous modes:

wherein 1 ∈ R ^d A column vector representing 1, and

representing a Hadamard or element-wise product. To obtain the output eigenvector z, two third order tensors are required: u = [ U1, …, UO]∈R ^n×d×o And V = [ V = ₁ ,…,V _o ]∈R ^m×d×o . Using linear projection P ∈ R ^d×o The column vector is replaced, so the vector z can be expressed as:

wherein b ∈ R ^o Is a deviation vector. The application of a non-linear activation function generally helps to increase the representation capability of the bilinear model. Therefore, a non-linear activation function is added after each linear mapping, so the vector z can be further expressed as:

where σ denotes any non-linear activation function, such as ReLU, sigmoid or tanh. Assuming that x and y represent the hand attention vector and the audio feature vector, respectively, then x is both greater than 0 and y is in the range of [ -1,1 ]. To avoid information loss, values may be mapped to a finite interval using different nonlinear activation functions. The size of the output neurons may vary greatly due to the introduction of element-by-element multiplication to obtain correlation between the two modalities. To reduce the impact of such variations, a Relu function is further added to normalize the output of the network, and the final vector z can be expressed as:

during the training process, the fusion parameters of the FBM can be updated and optimized by back propagation. And inputting paired audio and hand features into the fusion model, and outputting a final result on the full connection layer through softmax.

Technical route and implementation of feature matching

The FAST algorithm is a corner detection algorithm with a relatively high speed at present, but the FAST algorithm can generate false detection on some edge points, so that some false corner points exist. In order to eliminate the interference of the edge point to the detection result, the invention adopts an improved FAST algorithm, and the specific improvement is as follows: and taking 24 pixel points around one pixel point P as a detection template, setting a threshold value T for the gray value of the P point as IP, and if the gray value of 14 continuous pixel points in the 24 pixel points is greater than IP + T or less than IP-T, then P is an angular point. The invention uses Shi-Tomasi algorithm to optimize the characteristic points, the Shi-Tomasi algorithm takes the smaller of two characteristic values to compare with a given minimum threshold, if the smaller is larger than the minimum threshold, the strong corner point can be obtained.

Where M is a 2 x 2 autocorrelation matrix, which can be calculated from the derivatives of the image

For λ in two features of matrix M _max And λ _min Is divided intoSince the larger uncertainty of curvature depends on the small corner, the corner response function is defined as λ _min . Calculating the corner response function lambda of each point by using Shi-Tomasi algorithm for the characteristic points preliminarily extracted by using the improved FAST corner detection algorithm _min According to λ _min And taking the point with the maximum N response values to determine the characteristic point. At least 2 strong boundaries in different directions exist around the screened feature points, and the feature points are easy to identify and stable.

For matching of binary feature description vectors (as shown in fig. 5), hamming distance is generally used as a similarity measure between descriptors. The hamming distance is the minimum number of replacements required to change one of two binary strings of equal length to the other. Assuming that the two feature vectors of the descriptor are F1, F2, the hamming distance of F1, F2 is:

The above-described method according to the present invention can be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein can be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.

It will be appreciated by those of ordinary skill in the art that the examples described herein are intended to assist the reader in understanding the manner in which the invention is practiced, and it is to be understood that the scope of the invention is not limited to such specifically recited statements and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A multi-modal identification system for voiceprints and palm veins, comprising: the device comprises a power supply module, a fixed wavelength infrared LED light source module, an image acquisition CCD module, a voice acquisition module and a storage module;

a storage module: the device is used for storing data acquired by the voice acquisition module and the image acquisition CCD module;

2. A multi-mode identity recognition method for non-contact voiceprints and palm print palm veins is characterized by comprising the following steps:

step 1, preprocessing an image; the preprocessing mainly comprises three steps, firstly, denoising an infrared acquisition palm image by adopting low-pass filtering, secondly, extracting a binary image of a palm region by an image enhancement part through a Sauvola algorithm, finally, performing gray level transformation on a palm print and a palm vein by an ROI positioning part to enable the palm edge to be protruded, then using a Canny operator for detecting the palm edge, and finally, cutting the image to obtain an interested palm region image;

step 2, feature extraction; the feature extraction is divided into two parts, wherein the first part is used for extracting voice features, and the second part is used for extracting two hand features of a palm print and a palm vein; adopting ResNet as a main body structure, introducing an SE module, constructing an SE-ResNet network structure, inputting a preprocessed picture into the SE-ResNet network structure, generating feature distribution by adding a global pooling layer, and finishing the extraction of information codes; in order to obtain the correlation among channels, a ReLU activation function and a sigmoid gate control mechanism are combined to complete the recalibration of the characteristics;

step 3, feature fusion; a multi-layer characteristic fusion mechanism is adopted, the bilinear models are decomposed for fusion to obtain interaction between different modes of hands and audio, paired audio and hand characteristics are input into the fusion model, and a final result is output on a full connection layer through softmax;

step 4, comparing characteristics; using the characteristic points preliminarily extracted by using an improved FAST corner detection algorithm, calculating the corner response function of each point by using a Shi-Tomasi algorithm, and taking the first N points with the maximum response values according to the corner response function to determine the points as the characteristic points; at least 2 strong boundaries in different directions exist around the screened feature points; for the matching of binary feature description vectors, the Hamming distance is adopted as the similarity measurement between descriptors;

step 5, outputting interaction; judging the in-mold sample characteristic points of the three modes by adopting a joint judgment sparse coding algorithm, so that the distance in the classes is minimum, and the distance between the classes is maximum; and setting a proper threshold value according to the actual scene requirement, if the two matched samples belong to the same class and are successfully matched in the voiceprint, the palm print and the palm vein, displaying that the authentication is successful on an interface, and otherwise, prompting that the authentication is failed.

3. The multimodal identification method of claim 2 wherein: the step 2 specifically comprises the following steps: for any given information, after entering the network module, the conversion is performed as shown in formula (1):

x is the input picture and U is the extracted feature;

the SE compresses the global space information into a channel descriptor, the channel descriptor contains the global distribution condition of the feature response on the channel dimension, and the global average pooling layer is utilized to obtain the statistical data on the channel dimension; statistical value

Is obtained by compressing U with spatial dimension H × W by equation (2):

the transformation output U is interpreted as a set of local descriptors, and the statistical information of the channel descriptors can express the whole image;

the dependency on the channel dimension is completely captured by utilizing the aggregation information obtained by compression operation; a simple threshold mechanism with sigmoid activation function (3) was chosen:

s＝F _ex (z,W)＝σ(g(z,W))＝σ(W ₂ δ(W ₁ z)) (3)

where, δ represents the ReLU activation function,

and

in order to limit the complexity of the model and to help the generalization of the model, the threshold mechanism is parameterized by composing two fully-connected layers (FC) around the nonlinearity into a bottleneck (bottleeck) structure, and the final output of the block is obtained by rescaling the transform output U using the activation function (4):

in the formula (I), the compound is shown in the specification,

F _scale (u _c ,s _c ) Representation characteristic diagram

And a scalar s _c The product of the corresponding channels of (a); s the role of this activation function is to give a weight to each channel based on the descriptor z of the input feature.

4. The multimodal identification method of claim 2 wherein: the step 3 is as follows:

Z _i ＝x ^T W _i y+b _i (5)

wherein x ∈ R ⁿ And y ∈ R ^m Is an input feature vector, W, from different modalities of hand and audio _i Is a weight matrix, b _i Is an offset;

weighting matrix W _i Decomposed into two low-order matrices, i.e. where W _i ＝U _i V _i ^T Wherein U is _i ∈R ^n×d And V _i ∈R ^m×d D is less than or equal to min (n, m) by applying constraint on the dimension d; equation (5) can be further rewritten as:

Z _i ＝x ^T U _i V _i ^T y+b _i (6)

wherein 1 ∈ R ^d A column vector representing 1, and

representing a Hadamard or element-wise product;

to obtain the output eigenvector z, two third order tensors are required: u = [ U1, …, UO]∈R ^n×d×o And V = [ V = ₁ ,…,V _o ]∈R ^m×d×o (ii) a Using linear projection P ∈ R ^d×o Instead of a column vector, vector z is represented as:

wherein b ∈ R ^o Is a deviation vector;

wherein σ represents any nonlinear activation function, and x and y represent a hand attention vector and an audio feature vector, respectively, the value of x is greater than 0, and y is in the range of [ -1,1 ];

5. The multimodal identification method of claim 2 wherein: the step 4 is as follows:

an improved FAST algorithm is adopted, and the specific improvement is as follows: taking 24 pixels around one pixel P as a detection template, setting the gray value of the P as IP, setting a threshold value T, and if the gray value of 14 continuous pixels in the 24 pixels is greater than IP + T or less than IP-T, then P is an angular point;

using Shi-Tomasi algorithm to optimize the characteristic points, comparing the smaller of the two characteristic values with a given minimum threshold value by the Shi-Tomasi algorithm, and if the smaller of the two characteristic values is larger than the minimum threshold value, obtaining a strong corner point;

the Shi-Tomasi algorithm detects angular points by calculating the gray condition of the local small window W (x, y) after moving in each direction; shifting the window u, v to produce a gray scale change E u, v

For λ in two features of matrix M _max And λ _min The analysis is performed as the corner response function is defined as λ, since the larger uncertainty of curvature depends on the small corner _min (ii) a Calculating the corner response function lambda of each point by using Shi-Tomasi algorithm for the characteristic points preliminarily extracted by using the improved FAST corner detection algorithm _min According to λ _min Taking the first N points with the maximum response values to determine as characteristic points; at least 2 strong boundaries in different directions exist around the screened feature points, and the feature points are easy to identify and stable;

for the matching of binary feature description vectors, the Hamming distance is adopted as the similarity measurement between descriptors; let two feature vectors of the descriptor be F1, F2, then the hamming distance between F1, F2 is:

6. The multimodal identification method of claim 2 wherein: the step 5 is as follows:

the joint discrimination sparse coding algorithm is as follows: given characteristic matrixes X, Y and Z of three modes, jointly learning three projection matrixes Px, py and Pz, mapping three peak characteristics to sparse matrixes Vx from Rd multiplied by N, vy from Rd multiplied by N and Vz from Rd multiplied by N, and accurately approximating an original matrix X, Y, Z, such as Vx from PxX, vy from PyY and Vz from PzZ;

C ^x ＝sgn(V ^x )；C ^y ＝sgn(V ^y )；C ^x ＝sgn(V ^X )； (13)

Wherein sgn () is a meta-level sign function to obtain sparse binary code, cx (Cy/Cy) = [ c1, c2, … ], cN ] ∈ Rl × N, ci ∈ {0,1} l represents the learned ith class of sparse binary code, and l (= 1,2, …, 12) is the length of the binary code;

Wherein a, b >0, a + b epsilon (0,1) are balance parameters for balancing three different modes;

two constraints are carried out on the projection sparse characteristics, namely 1) for the in-mold sample of each mode, the distance in the class is minimum, and the distance between the classes is maximum; 2) For the intra-class samples, the information correlation between the feature points is maximized, and thus the distance is minimized; the projection sparse feature has stronger resolution and compactness through constraint.