CN110929239A

CN110929239A - Terminal unlocking method based on lip language instruction

Info

Publication number: CN110929239A
Application number: CN201911045860.6A
Authority: CN
Inventors: 兰星; 胡庆浩
Original assignee: Nanjing Artificial Intelligence Chip Innovation Institute Institute Of Automation Chinese Academy Of Sciences; Institute of Automation of Chinese Academy of Science
Current assignee: Nanjing Artificial Intelligence Chip Innovation Institute Institute Of Automation Chinese Academy Of Sciences; Institute of Automation of Chinese Academy of Science
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-03-27
Anticipated expiration: 2039-10-30
Also published as: CN110929239B

Abstract

The invention relates to a terminal unlocking method based on a lip language instruction. In the verification process, key feature points needing face recognition are extracted in the same way, the Euclidean distance of face features is calculated by adopting facenet network, and comparison threshold judgment is carried out. The user can design the instruction action by himself during collection, and the same action can be made during identification, so that the action instruction is not easy to be stolen by others, and the authentication safety is improved. Meanwhile, the lip language instruction unlocking method does not need large-scale operation on the terminal, so that the hardware performance requirement is greatly reduced, and the recognition speed is increased. The invention can avoid the problem of overlarge gradient caused by accumulation of a certain quadrant in the space, improve the network learning and training efficiency, play the effect of actively learning and training the model and solve the problem that the traditional fixed instruction action is easy to expose.

Description

Terminal unlocking method based on lip language instruction

Technical Field

The invention relates to a terminal unlocking method based on a lip language instruction, and belongs to the technical field of image information processing.

Background

At present, the terminal unlocking mode mainly comprises: face, fingerprint, iris. However, the information is easy to forge, and the static identification method is easy to crack, so that the security is poor, and the leakage of private information is easy to cause. The invention adopts a lip language instruction unlocking method to realize dynamic unlocking and improve the safety of authentication.

The existing lip language unlocking technology is extremely dependent on deep learning, a specific single instruction model needs to be trained at a PC (personal computer) end and then deployed at a terminal for use, and a user needs to match a fixed instruction action. The method has poor effect, does not adapt to the data of the user, only can adapt to fixed command actions, and the commands are easy to be exposed.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of the existing unlocking technology, a terminal unlocking method based on a lip language instruction is provided.

The technical scheme is as follows: a terminal unlocking method based on a lip language instruction comprises the following steps:

step 1, a terminal camera collects a lip language instruction video frame of unlocking by a user, the terminal carries out face detection and extracts face features, and meanwhile, a lip region video frame is extracted;

step 2, extracting characteristic points of the lip video frame data set, matching the characteristic points of adjacent frames and marking position coordinates;

step 3, extracting the change characteristics of the positions of the characteristic points by using a frame difference method, namely the algebraic characteristics of the lip movement;

step 4, matching the human face in a database;

step 5, if matching is successful, people need to be identified to make the same lip language instruction action towards the terminal camera, the terminal extracts lip feature points similarly, and calculates algebraic features of lip movement, and whether matching is an unlocking instruction or not;

and 6, when the face is matched or the matching instruction is unsuccessful, prompting that the matching is failed, and jumping to the step 4.

In a further embodiment, the step 1 is further:

step 1-1, calculating a color histogram of an RGB space of each frame of a video segment, dividing each channel into 32 intervals according to pixel values, and carrying out normalization processing to obtain 96-dimensional features; forming a matrix by the characteristic vectors of each frame, performing dimensionality reduction on the matrix, and calculating an initialization clustering center:

in the formula, C_nRepresenting the cluster center of the nth segment, f_nFeature vector representing the nth frame, f_n+1Represents the (n +1) th feature vector;

calculating the similarity of each new frame to the current cluster center, defining a threshold value sigma, and judging f when the similarity is greater than the threshold value_nBelonging to the cluster center C_nAt this time, f is_nAdding C_nIn the method, a new clustering center C is obtained by updating_n′：

In the formula (f)_nFeature vector representing the nth frame, C_nRepresenting the cluster center of the nth segment, C_n′Representing and updating to obtain a new clustering center;

when the similarity is smaller than the threshold value, f is judged_nMembership to a new cluster center, using f_nInitializing a new cluster center C_n′：

C_n′＝f_n

Step 1-2, firstly, recognizing the contour of a human face, removing a background, carrying out lip cutting on the human face in a video frame, positioning the position of facial feature contour points in the human face, including the coordinates of a nose tip, the leftmost coordinates of the lips, the rightmost coordinates of the lips and the coordinates of a central point of a mouth, cutting an image containing lip details according to the coordinates, and calculating the cutting size according to a formula:

in the formula, L_MNDistance, x, between coordinates representing nose tip and coordinates of center point of mouth_{Right side}Abscissa, y, representing the rightmost feature point of the lip_{Right side}Ordinate, x, representing the rightmost feature point of the lip_{Left side of}Abscissa, y, representing the leftmost feature point of the lip_{Left side of}A vertical coordinate representing a feature point at the leftmost side of the lip;

step 1-3, performing deviation correction on the cut lip image, training the lip image based on a binary model of a convolutional neural network, and judging whether the extracted lip image is an effective image:

where l denotes the number of convolution layers, k denotes the convolution kernel, b denotes the convolution offset, M_jRepresenting the local perceptual value of the input, β the output parameter, and down () the pooling function.

In a further embodiment, the step 2 is further:

step 2-1, aiming at the cropped images extracted in the step 1, a D3D model is constructed to accelerate network convergence, and a loss function correction model is introduced:

in the formula (I), the compound is shown in the specification,

denoted is the cross entropy loss, { y_iK is an indicator function, local (pre) denotes the network output probability, σ is a scaling factor;

where P ({ Z | X }) ═ Σ_k＝1P (pi | X), which is the sum of the probabilities formed by all paths after merging;

step 2-2, respectively extracting feature points from the images of two adjacent frames and obtaining two sets of feature point sets:

p＝{p₁、p₂、p₃…p_n}

p′＝{p₁′、p₂′、p₃′…p_n′}

and respectively calculating pixel interpolation values of the neighborhoods of the two groups of feature points by taking the pixel values of the windows W of the neighborhoods of the two groups of feature points as descriptors of the feature points according to the two adjacent groups of feature points as centers:

in the formula, S represents the pixel interpolation of two groups of characteristic point fields, x represents the abscissa of a pixel point, y represents the ordinate of the pixel point, W represents a field window, a descriptor is made in the formula, p represents a previous frame image, and p' represents a next frame image;

step 2-3, according to the pixel interpolation obtained in the step 2-2, finding a matching point according to a matching coefficient between the feature point and a neighborhood window:

in the formula, G represents the gray value of the previous frame image, G' represents the gray value of the next frame image, C represents the matching coefficient, and the other symbols have the same meanings as above.

In a further embodiment, the step 3 is further:

step 3-1, recording images of three adjacent independent frames, respectively recording the images as f (n +1), f (n) and f (n-1), and respectively recording the gray values corresponding to the three frames of images as G (n +1)^x,y、G(n)^x,y、G(n-1)^x,yAnd obtaining an image P' by adopting a frame difference method:

P′＝|G(n+1)^x,y-G(n)^x,y|∩|G(n)^x,y-G(n-1)^x,y|

comparing the image P' with a preset threshold value T to analyze the liquidity, and extracting a moving target, wherein the comparison conditions are as follows:

in the formula, N represents the total number of pixels in the region to be detected, τ represents the suppression coefficient of illumination, a represents the image of the entire frame, and T is a threshold.

In a further embodiment, the step 4 is further:

step 4-1, on a multi-user terminal, such as a safe case and a door lock, face recognition is required to be carried out, and whether the face of the user exists in a matching database or not is matched; on a single user private terminal, such as a mobile phone and a tablet, face recognition is not needed, face verification can be performed, the facenet network is adopted to calculate the Euclidean distance of face features, and comparison threshold judgment is performed:

in the formula (I), the compound is shown in the specification,

a pair of positive samples is represented, and,

a pair of negative samples is represented, and,

representing flat sample pairs, α representing the range of constraints between positive and negative sample pairs, Φ representing the set of triples;

introducing a neuron model:

h_W,b(x)＝f(W^Tx)

wherein W represents a weight vector of a neuron, W^Tx denotes the nonlinear transformation of the input vector x, f (W)^Tx) represents the activation function transformation of the weight vector;

assigning an input vector x to x_iInto W^Tx：

In the formula, n represents the number of stages of the neural network, and b represents an offset.

In a further embodiment, the step 5 is further: the method comprises the following steps of establishing a coordinate axis by taking the center of a lip as a coordinate origin in an acquisition process, fitting an inner lip region in a lip gray image into two semi-ellipse combinations, enabling an upper inner lip to correspond to an upper ellipse, enabling a lower inner lip to correspond to a lower ellipse, and extracting change characteristics of corresponding characteristic point positions by using a frame difference method, namely algebraic characteristics of interframe lip motion:

recording images of two adjacent independent frames, respectively recording the images as f (n +1) and f (n), and respectively recording the gray values corresponding to the two frames of images as G (n +1)^x,y、G(n)^x,yObtaining an image P' by adopting a frame difference method:

P′＝|G(n+1)^x,y-G(n)^x,y|

Has the advantages that: the invention relates to a terminal unlocking method based on a lip language instruction, wherein a user can design instruction actions by himself during collection and only needs to make the same actions during identification, so that the action instructions are not easy to steal by others, and the authentication safety is improved. Meanwhile, the lip language instruction unlocking method does not need large-scale operation on the terminal, so that the hardware performance requirement is greatly reduced, and the recognition speed is increased. According to the invention, through carrying out matrix dimensionality reduction processing, extracting feature points, initializing a clustering center and adopting facenet network to calculate the Euclidean distance of face features, the problem of overlarge gradient caused by accumulation of a certain quadrant in a space can be avoided, the network learning and training efficiency is improved, the effect of actively learning a training model is achieved, and the problem that the traditional fixed instruction action is easy to expose is solved.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a schematic diagram of establishing a coordinate system for lips according to the present invention.

FIG. 3 is a diagram illustrating an image containing details of a lip cut out from a lip unlock command according to the present invention.

FIG. 4 is a schematic diagram of the introduction of a neuron model according to the present invention.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described in order to avoid obscuring the invention.

The applicant believes that in the field of lip language unlocking, the prior art extremely depends on deep learning, a specific single instruction model needs to be trained at a PC (personal computer) end and then deployed at a terminal for use, and a user needs to match a fixed instruction action. The method has poor effect, does not adapt to the data of the user, can only adapt to fixed command actions, and the commands are easy to expose, so that how to construct the lip language model and continuously improve the active learning of the machine are very important.

In order to solve the problems in the prior art, the invention provides a terminal unlocking method based on a lip language instruction, a user can design instruction actions by himself during collection and only needs to make the same actions during identification, so that the action instructions are not easy to steal by others, and the authentication safety is improved.

The technical scheme of the invention is further explained by the embodiment and the corresponding attached drawings.

Firstly, a terminal camera collects a lip language instruction video frame unlocked by a user, the terminal carries out face detection and extracts face features, and meanwhile, a lip region video frame is extracted; calculating a color histogram of an RGB space of each frame of a video clip, dividing each channel into 32 intervals according to pixel values, and carrying out normalization processing to obtain 96-dimensional features; forming a matrix by the characteristic vectors of each frame, performing dimensionality reduction on the matrix, and calculating an initialization clustering center:

C_n′＝f_n

Recognizing the outline of the face, removing the background, cutting the lips of the face in a video frame, positioning the positions of facial feature outline points in the face, including the coordinates of the nose tip, the leftmost coordinates of the lips, the rightmost coordinates of the lips and the coordinates of the center point of the mouth, cutting an image containing the details of the lips according to the coordinates, and calculating the cutting size according to a formula:

in the formula, L_MNDistance, x, between coordinates representing nose tip and coordinates of center point of mouth_{Right side}Representing characteristic points of extreme right of the lipAbscissa, y_{Right side}Ordinate, x, representing the rightmost feature point of the lip_{Left side of}Abscissa, y, representing the leftmost feature point of the lip_{Left side of}A vertical coordinate representing a feature point at the leftmost side of the lip;

carrying out deviation correction on the cut lip images, training the lip images based on a binary model of a convolutional neural network, and judging whether the extracted lip images are effective images:

Then, extracting characteristic points of the lip video frame data set, matching the characteristic points of adjacent frames, and marking position coordinates;

for the extracted cropped images, a D3D model is constructed to accelerate network convergence, and a loss function correction model is introduced:

in the formula (I), the compound is shown in the specification,

respectively extracting feature points from the images of two adjacent frames and obtaining two groups of feature point sets:

p＝{p₁、p₂、p₃… p_n}

p′＝{p₁′、p₂′、p₃′ … p_n′}

according to the pixel interpolation obtained above, finding a matching point according to the matching coefficient between the feature point and the neighborhood window:

Then, extracting the change characteristics of the positions of the characteristic points by using a frame difference method, namely the algebraic characteristics of the lip movement; recording images of three adjacent independent frames, respectively recording the images as f (n +1), f (n-1), and respectively recording the gray values corresponding to the three frames as G (n +1)^x,y、G(n)^x,y、G(n-1)^x,yAnd obtaining an image P' by adopting a frame difference method:

P′＝|G(n+1)^x,y-G(n)^x,y|∩|G(n)^x,y-G(n-1)^x,y|

Step 4, matching the human face in a database: on a multi-user terminal, such as a safe case and a door lock, face recognition is required to be carried out, and whether the face of the user exists in a database or not is matched; on a single user private terminal, such as a mobile phone and a tablet, face recognition is not needed, face verification can be performed, the facenet network is adopted to calculate the Euclidean distance of face features, and comparison threshold judgment is performed:

in the formula (I), the compound is shown in the specification,

a pair of positive samples is represented, and,

a pair of negative samples is represented, and,

introducing a neuron model:

h_W,b(x)＝f(W^Tx)

assigning an input vector x to x_iInto W^Tx：

Step 5, if matching is successful, people need to be identified to make the same lip language instruction action towards the terminal camera, the terminal extracts lip feature points similarly, and calculates algebraic features of lip movement, and whether matching is an unlocking instruction or not; the method comprises the following steps of establishing a coordinate axis by taking the center of a lip as a coordinate origin in an acquisition process, fitting an inner lip region in a lip gray image into two semi-ellipse combinations, enabling an upper inner lip to correspond to an upper ellipse, enabling a lower inner lip to correspond to a lower ellipse, and extracting change characteristics of corresponding characteristic point positions by using a frame difference method, namely algebraic characteristics of interframe lip motion:

P′＝|G(n+1)^x,y-G(n)^x,y|

And when the face matching or the matching instruction is unsuccessful, prompting that the matching is failed, continuing matching the face in the database, repeating the steps, and temporarily locking the terminal equipment when the matching is failed for more than three times.

In summary, aiming at the defects of the prior art, the invention provides a terminal unlocking method based on a lip language instruction, which is used for acquiring a face by taking several frames of images in the acquisition process and extracting part of key feature points. In the verification process, key feature points needing face recognition are extracted in the same way, the Euclidean distance of face features is calculated by adopting facenet network, and comparison threshold judgment is carried out. In the acquisition process, a coordinate axis is established by taking the center of the lip as a coordinate origin, an inner lip region in the lip gray image is fitted into two semiellipse combinations (an upper inner lip corresponds to an upper ellipse, and a lower inner lip corresponds to a lower ellipse), the variation characteristic of the position of the corresponding characteristic point, namely the algebraic characteristic of the interframe lip motion, is extracted by using a frame difference method, and a judgment threshold value is calculated. In the verification process, lip motion characteristics are extracted in the same way and are compared and judged. By carrying out matrix dimensionality reduction processing, extracting feature points, initializing a clustering center and adopting the facenet network to calculate the Euclidean distance of the face features, the problem of overlarge gradient caused by accumulation of a certain quadrant in a space can be avoided, the network learning and training efficiency is improved, the effect of actively learning a training model is achieved, and the problem that the traditional fixed instruction action is easy to expose is solved.

As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A terminal unlocking method based on a lip language instruction is characterized by comprising the following steps:

step 4, matching the human face in a database;

2. The method for unlocking a terminal based on a lip language command according to claim 1, wherein the step 1 further comprises:

In the formula (f)_nFeature vector representing the nth frame, C_nRepresenting the cluster center of the nth segment, C_n' means update to get new cluster center;

C_n′＝f_n

3. The method for unlocking a terminal based on a lip language command according to claim 1, wherein the step 2 further comprises:

in the formula (I), the compound is shown in the specification,

p＝{p₁、p₂、p₃…p_n}

p′＝{p₁′、p₂′、p₃′…p_n′}

4. The method for unlocking a terminal based on a lip language command according to claim 1, wherein the step 3 further comprises:

step 3-1, recording images of three adjacent independent frames, respectively recording the images as f (n +1), f (n) and f (n-1), and respectively recording the gray values corresponding to the three frames of images as G (n +1)^x，y、G(n)^x，y、G(n-1)^x，yAnd obtaining an image P' by adopting a frame difference method:

P′＝|G(n+1)^x，y-G(n)^x，y|∩|G(n)^x，y-G(n-1)^x，y|

5. The method for unlocking a terminal according to claim 1, wherein the step 4 further comprises:

in the formula (I), the compound is shown in the specification,

a pair of positive samples is represented, and,

a pair of negative samples is represented, and,

introducing a neuron model:

h_W，b(x)＝f(W^Tx)

wherein W represents a weight vector of a neuron, W^Tx represents the pair inputThe vector x is non-linearly transformed, f (W)^Tx) represents the activation function transformation of the weight vector;

assigning an input vector x to x_iInto W^Tx：

6. The method for unlocking a terminal according to claim 1, wherein the step 5 further comprises: the method comprises the following steps of establishing a coordinate axis by taking the center of a lip as a coordinate origin in an acquisition process, fitting an inner lip region in a lip gray image into two semi-ellipse combinations, enabling an upper inner lip to correspond to an upper ellipse, enabling a lower inner lip to correspond to a lower ellipse, and extracting change characteristics of corresponding characteristic point positions by using a frame difference method, namely algebraic characteristics of interframe lip motion:

recording images of two adjacent independent frames, respectively recording the images as f (n +1) and f (n), and respectively recording the gray values corresponding to the two frames of images as G (n +1)^x，y、G(n)^x，yObtaining an image P' by adopting a frame difference method:

P′＝|G(n+1)^x，y-G(n)^x，y|

wherein N represents the total number of pixels in the region to be detected, tau represents the suppression coefficient of illumination, A represents the image of the whole frame, T is a threshold value, G (N)^x，yRepresenting the gray value corresponding to the nth frame image, G (n +1)^x，yIndicating the corresponding gray value of the n +1 th frame image.