Terminal unlocking method based on lip language instruction
Technical Field
The invention relates to a terminal unlocking method based on a lip language instruction, and belongs to the technical field of image information processing.
Background
At present, the terminal unlocking mode mainly comprises: face, fingerprint, iris. However, the information is easy to forge, and the static identification method is easy to crack, so that the security is poor, and the leakage of private information is easy to cause. The invention adopts a lip language instruction unlocking method to realize dynamic unlocking and improve the safety of authentication.
The existing lip language unlocking technology is extremely dependent on deep learning, a specific single instruction model needs to be trained at a PC (personal computer) end and then deployed at a terminal for use, and a user needs to match a fixed instruction action. The method has poor effect, does not adapt to the data of the user, only can adapt to fixed command actions, and the commands are easy to be exposed.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of the existing unlocking technology, a terminal unlocking method based on a lip language instruction is provided.
The technical scheme is as follows: a terminal unlocking method based on a lip language instruction comprises the following steps:
step 1, a terminal camera collects a lip language instruction video frame of unlocking by a user, the terminal carries out face detection and extracts face features, and meanwhile, a lip region video frame is extracted;
step 2, extracting characteristic points of the lip video frame data set, matching the characteristic points of adjacent frames and marking position coordinates;
step 3, extracting the change characteristics of the positions of the characteristic points by using a frame difference method, namely the algebraic characteristics of the lip movement;
step 4, matching the human face in a database;
step 5, if matching is successful, people need to be identified to make the same lip language instruction action towards the terminal camera, the terminal extracts lip feature points similarly, and calculates algebraic features of lip movement, and whether matching is an unlocking instruction or not;
and 6, when the face is matched or the matching instruction is unsuccessful, prompting that the matching is failed, and jumping to the step 4.
In a further embodiment, the step 1 is further:
step 1-1, calculating a color histogram of an RGB space of each frame of a video segment, dividing each channel into 32 intervals according to pixel values, and carrying out normalization processing to obtain 96-dimensional features; forming a matrix by the characteristic vectors of each frame, performing dimensionality reduction on the matrix, and calculating an initialization clustering center:
in the formula, CnRepresenting the cluster center of the nth segment, fnFeature vector representing the nth frame, fn+1Represents the (n +1) th feature vector;
calculating the similarity of each new frame to the current cluster center, defining a threshold value sigma, and judging f when the similarity is greater than the threshold valuenBelonging to the cluster center CnAt this time, f isnAdding CnIn the method, a new clustering center C is obtained by updatingn′:
In the formula (f)nFeature vector representing the nth frame, CnRepresenting the cluster center of the nth segment, Cn′Representing and updating to obtain a new clustering center;
when the similarity is smaller than the threshold value, f is judgednMembership to a new cluster center, using fnInitializing a new cluster center Cn′:
Cn′=fn
Step 1-2, firstly, recognizing the contour of a human face, removing a background, carrying out lip cutting on the human face in a video frame, positioning the position of facial feature contour points in the human face, including the coordinates of a nose tip, the leftmost coordinates of the lips, the rightmost coordinates of the lips and the coordinates of a central point of a mouth, cutting an image containing lip details according to the coordinates, and calculating the cutting size according to a formula:
in the formula, LMNDistance, x, between coordinates representing nose tip and coordinates of center point of mouthRight sideAbscissa, y, representing the rightmost feature point of the lipRight sideOrdinate, x, representing the rightmost feature point of the lipLeft side ofAbscissa, y, representing the leftmost feature point of the lipLeft side ofA vertical coordinate representing a feature point at the leftmost side of the lip;
step 1-3, performing deviation correction on the cut lip image, training the lip image based on a binary model of a convolutional neural network, and judging whether the extracted lip image is an effective image:
where l denotes the number of convolution layers, k denotes the convolution kernel, b denotes the convolution offset, MjRepresenting the local perceptual value of the input, beta the output parameter, and down () the pooling function.
In a further embodiment, the step 2 is further:
step 2-1, aiming at the cropped images extracted in the step 1, a D3D model is constructed to accelerate network convergence, and a loss function correction model is introduced:
in the formula,
denoted is the cross entropy loss, { y
iK is an indicator function, local (pre) denotes the network output probability, σ is a scaling factor;
wherein, P({Z|X})=∑k=1P (pi | X), which is the sum of the probabilities formed by all paths after merging;
step 2-2, respectively extracting feature points from the images of two adjacent frames and obtaining two sets of feature point sets:
p={p1、p2、p3…pn}
p′={p1′、p2′、p3′…pn′}
and respectively calculating pixel interpolation values of the neighborhoods of the two groups of feature points by taking the pixel values of the windows W of the neighborhoods of the two groups of feature points as descriptors of the feature points according to the two adjacent groups of feature points as centers:
in the formula, S represents the pixel interpolation of two groups of characteristic point fields, x represents the abscissa of a pixel point, y represents the ordinate of the pixel point, W represents a field window, a descriptor is made in the formula, p represents a previous frame image, and p' represents a next frame image;
step 2-3, according to the pixel interpolation obtained in the step 2-2, finding a matching point according to a matching coefficient between the feature point and a neighborhood window:
in the formula, G represents the gray value of the previous frame image, G' represents the gray value of the next frame image, C represents the matching coefficient, and the other symbols have the same meanings as above.
In a further embodiment, the step 3 is further:
step 3-1, recording images of three adjacent independent frames, respectively recording the images as f (n +1), f (n) and f (n-1), and respectively recording the gray values corresponding to the three frames of images as G (n +1)x,y、G(n)x,y、G(n-1)x,yAnd obtaining an image P' by adopting a frame difference method:
P′=|G(n+1)x,y-G(n)x,y|∩|G(n)x,y-G(n-1)x,y|
comparing the image P' with a preset threshold value T to analyze the liquidity, and extracting a moving target, wherein the comparison conditions are as follows:
in the formula, N represents the total number of pixels in the region to be detected, τ represents the suppression coefficient of illumination, a represents the image of the entire frame, and T is a threshold.
In a further embodiment, the step 4 is further:
step 4-1, on a multi-user terminal, such as a safe case and a door lock, face recognition is required to be carried out, and whether the face of the user exists in a matching database or not is matched; on a single user private terminal, such as a mobile phone and a tablet, face recognition is not needed, face verification can be performed, the facenet network is adopted to calculate the Euclidean distance of face features, and comparison threshold judgment is performed:
in the formula,
a pair of positive samples is represented, and,
a pair of negative samples is represented, and,
representing a flat sample pair, alpha representing the constraint range between the positive sample pair and the negative sample pair, phi representing the set of triples;
introducing a neuron model:
hW,b(x)=f(WTx)
wherein W represents a weight vector of a neuron, WTx represents the non-linear transformation of an input vector x,f(WTx) represents the activation function transformation of the weight vector;
assigning an input vector x to xiInto WTx:
In the formula, n represents the number of stages of the neural network, and b represents an offset.
In a further embodiment, the step 5 is further: the method comprises the following steps of establishing a coordinate axis by taking the center of a lip as a coordinate origin in an acquisition process, fitting an inner lip region in a lip gray image into two semi-ellipse combinations, enabling an upper inner lip to correspond to an upper ellipse, enabling a lower inner lip to correspond to a lower ellipse, and extracting change characteristics of corresponding characteristic point positions by using a frame difference method, namely algebraic characteristics of interframe lip motion:
recording images of two adjacent independent frames, respectively recording the images as f (n +1) and f (n), and respectively recording the gray values corresponding to the two frames of images as G (n +1)x,y、G(n)x,yObtaining an image P' by adopting a frame difference method:
P′=|G(n+1)x,y-G(n)x,y|
comparing the image P' with a preset threshold value T to analyze the liquidity, and extracting a moving target, wherein the comparison conditions are as follows:
in the formula, N represents the total number of pixels in the region to be detected, τ represents the suppression coefficient of illumination, a represents the image of the entire frame, and T is a threshold.
Has the advantages that: the invention relates to a terminal unlocking method based on a lip language instruction, wherein a user can design instruction actions by himself during collection and only needs to make the same actions during identification, so that the action instructions are not easy to steal by others, and the authentication safety is improved. Meanwhile, the lip language instruction unlocking method does not need large-scale operation on the terminal, so that the hardware performance requirement is greatly reduced, and the recognition speed is increased. According to the invention, through carrying out matrix dimensionality reduction processing, extracting feature points, initializing a clustering center and adopting facenet network to calculate the Euclidean distance of face features, the problem of overlarge gradient caused by accumulation of a certain quadrant in a space can be avoided, the network learning and training efficiency is improved, the effect of actively learning a training model is achieved, and the problem that the traditional fixed instruction action is easy to expose is solved.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a schematic diagram of establishing a coordinate system for lips according to the present invention.
FIG. 3 is a diagram illustrating an image containing details of a lip cut out from a lip unlock command according to the present invention.
FIG. 4 is a schematic diagram of the introduction of a neuron model according to the present invention.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described in order to avoid obscuring the invention.
The applicant believes that in the field of lip language unlocking, the prior art extremely depends on deep learning, a specific single instruction model needs to be trained at a PC (personal computer) end and then deployed at a terminal for use, and a user needs to match a fixed instruction action. The method has poor effect, does not adapt to the data of the user, can only adapt to fixed command actions, and the commands are easy to expose, so that how to construct the lip language model and continuously improve the active learning of the machine are very important.
In order to solve the problems in the prior art, the invention provides a terminal unlocking method based on a lip language instruction, a user can design instruction actions by himself during collection and only needs to make the same actions during identification, so that the action instructions are not easy to steal by others, and the authentication safety is improved.
The technical scheme of the invention is further explained by the embodiment and the corresponding attached drawings.
Firstly, a terminal camera collects a lip language instruction video frame unlocked by a user, the terminal carries out face detection and extracts face features, and meanwhile, a lip region video frame is extracted; calculating a color histogram of an RGB space of each frame of a video clip, dividing each channel into 32 intervals according to pixel values, and carrying out normalization processing to obtain 96-dimensional features; forming a matrix by the characteristic vectors of each frame, performing dimensionality reduction on the matrix, and calculating an initialization clustering center:
in the formula, CnRepresenting the cluster center of the nth segment, fnFeature vector representing the nth frame, fn+1Represents the (n +1) th feature vector;
calculating the similarity of each new frame to the current cluster center, defining a threshold value sigma, and judging f when the similarity is greater than the threshold valuenBelonging to the cluster center CnAt this time, f isnAdding CnIn the method, a new clustering center C is obtained by updatingn′:
In the formula (f)nFeature vector representing the nth frame, CnRepresenting the cluster center of the nth segment, Cn′Representing and updating to obtain a new clustering center;
when the similarity is smaller than the threshold value, f is judgednMembership to a new cluster center, using fnInitializing a new cluster center Cn′:
Cn′=fn
Recognizing the outline of the face, removing the background, cutting the lips of the face in a video frame, positioning the positions of facial feature outline points in the face, including the coordinates of the nose tip, the leftmost coordinates of the lips, the rightmost coordinates of the lips and the coordinates of the center point of the mouth, cutting an image containing the details of the lips according to the coordinates, and calculating the cutting size according to a formula:
in the formula, LMNDistance, x, between coordinates representing nose tip and coordinates of center point of mouthRight sideAbscissa, y, representing the rightmost feature point of the lipRight sideOrdinate, x, representing the rightmost feature point of the lipLeft side ofAbscissa, y, representing the leftmost feature point of the lipLeft side ofA vertical coordinate representing a feature point at the leftmost side of the lip;
carrying out deviation correction on the cut lip images, training the lip images based on a binary model of a convolutional neural network, and judging whether the extracted lip images are effective images:
where l denotes the number of convolution layers, k denotes the convolution kernel, b denotes the convolution offset, MjRepresenting the local perceptual value of the input, beta the output parameter, and down () the pooling function.
Then, extracting characteristic points of the lip video frame data set, matching the characteristic points of adjacent frames, and marking position coordinates;
for the extracted cropped images, a D3D model is constructed to accelerate network convergence, and a loss function correction model is introduced:
in the formula,
denoted is the cross entropy loss, { y
iK is an indicator function, and location (pre) representsThe network output probability, sigma is a proportionality coefficient;
where P ({ Z | X }) ═ Σk=1P (pi | X), which is the sum of the probabilities formed by all paths after merging;
respectively extracting feature points from the images of two adjacent frames and obtaining two groups of feature point sets:
p={p1、p2、p3 … pn}
p′={p1′、p2′、p3′ … pn′}
and respectively calculating pixel interpolation values of the neighborhoods of the two groups of feature points by taking the pixel values of the windows W of the neighborhoods of the two groups of feature points as descriptors of the feature points according to the two adjacent groups of feature points as centers:
in the formula, S represents the pixel interpolation of two groups of characteristic point fields, x represents the abscissa of a pixel point, y represents the ordinate of the pixel point, W represents a field window, a descriptor is made in the formula, p represents a previous frame image, and p' represents a next frame image;
according to the pixel interpolation obtained above, finding a matching point according to the matching coefficient between the feature point and the neighborhood window:
in the formula, G represents the gray value of the previous frame image, G' represents the gray value of the next frame image, C represents the matching coefficient, and the other symbols have the same meanings as above.
Then, extracting the change characteristics of the positions of the characteristic points by using a frame difference method, namely the algebraic characteristics of the lip movement; recording images of three adjacent independent frames, respectively recording the images as f (n +1), f (n-1), and respectively recording the gray values corresponding to the three frames as G (n +1)x,y、G(n)x,y、G(n-1)x,yAnd obtaining an image P' by adopting a frame difference method:
P′=|G(n+1)x,y-G(n)x,y|∩|G(n)x,y-G(n-1)x,y|
comparing the image P' with a preset threshold value T to analyze the liquidity, and extracting a moving target, wherein the comparison conditions are as follows:
in the formula, N represents the total number of pixels in the region to be detected, τ represents the suppression coefficient of illumination, a represents the image of the entire frame, and T is a threshold.
Step 4, matching the human face in a database: on a multi-user terminal, such as a safe case and a door lock, face recognition is required to be carried out, and whether the face of the user exists in a database or not is matched; on a single user private terminal, such as a mobile phone and a tablet, face recognition is not needed, face verification can be performed, the facenet network is adopted to calculate the Euclidean distance of face features, and comparison threshold judgment is performed:
in the formula,
a pair of positive samples is represented, and,
a pair of negative samples is represented, and,
representing a flat sample pair, alpha representing the constraint range between the positive sample pair and the negative sample pair, phi representing the set of triples;
introducing a neuron model:
hW,b(x)=f(WTx)
wherein W represents a weight vector of a neuron, WTx denotes the nonlinear transformation of the input vector x, f (W)Tx) represents the activation function transformation of the weight vector;
assigning an input vector x to xiInto WTx:
In the formula, n represents the number of stages of the neural network, and b represents an offset.
Step 5, if matching is successful, people need to be identified to make the same lip language instruction action towards the terminal camera, the terminal extracts lip feature points similarly, and calculates algebraic features of lip movement, and whether matching is an unlocking instruction or not; the method comprises the following steps of establishing a coordinate axis by taking the center of a lip as a coordinate origin in an acquisition process, fitting an inner lip region in a lip gray image into two semi-ellipse combinations, enabling an upper inner lip to correspond to an upper ellipse, enabling a lower inner lip to correspond to a lower ellipse, and extracting change characteristics of corresponding characteristic point positions by using a frame difference method, namely algebraic characteristics of interframe lip motion:
recording images of two adjacent independent frames, respectively recording the images as f (n +1) and f (n), and respectively recording the gray values corresponding to the two frames of images as G (n +1)x,y、G(n)x,yObtaining an image P' by adopting a frame difference method:
P′=|G(n+1)x,y-G(n)x,y|
comparing the image P' with a preset threshold value T to analyze the liquidity, and extracting a moving target, wherein the comparison conditions are as follows:
in the formula, N represents the total number of pixels in the region to be detected, τ represents the suppression coefficient of illumination, a represents the image of the entire frame, and T is a threshold.
And when the face matching or the matching instruction is unsuccessful, prompting that the matching is failed, continuing matching the face in the database, repeating the steps, and temporarily locking the terminal equipment when the matching is failed for more than three times.
In summary, aiming at the defects of the prior art, the invention provides a terminal unlocking method based on a lip language instruction, which is used for acquiring a face by taking several frames of images in the acquisition process and extracting part of key feature points. In the verification process, key feature points needing face recognition are extracted in the same way, the Euclidean distance of face features is calculated by adopting facenet network, and comparison threshold judgment is carried out. In the acquisition process, a coordinate axis is established by taking the center of the lip as a coordinate origin, an inner lip region in the lip gray image is fitted into two semiellipse combinations (an upper inner lip corresponds to an upper ellipse, and a lower inner lip corresponds to a lower ellipse), the variation characteristic of the position of the corresponding characteristic point, namely the algebraic characteristic of the interframe lip motion, is extracted by using a frame difference method, and a judgment threshold value is calculated. In the verification process, lip motion characteristics are extracted in the same way and are compared and judged. By carrying out matrix dimensionality reduction processing, extracting feature points, initializing a clustering center and adopting the facenet network to calculate the Euclidean distance of the face features, the problem of overlarge gradient caused by accumulation of a certain quadrant in a space can be avoided, the network learning and training efficiency is improved, the effect of actively learning a training model is achieved, and the problem that the traditional fixed instruction action is easy to expose is solved.
As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.