CN110929239B

CN110929239B - Terminal unlocking method based on lip language instruction

Info

Publication number: CN110929239B
Application number: CN201911045860.6A
Authority: CN
Inventors: 兰星; 胡庆浩
Original assignee: Zhongke Nanjing Artificial Intelligence Innovation Research Institute; Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Nanjing Artificial Intelligence Innovation Research Institute; Institute of Automation of Chinese Academy of Science
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2021-11-19
Anticipated expiration: 2039-10-30
Also published as: CN110929239A

Abstract

The invention relates to a terminal unlocking method based on lip language instructions. During the acquisition process, several frames of images are taken to acquire the face, and some key feature points are extracted. In the verification process, the key feature points that need to be recognized for the face are extracted in the same way, and the Euclidean distance of the face features is calculated by the facenet network, and the threshold is compared and judged. When collecting, users can design command actions by themselves, and just make the same actions when identifying, so that action commands are not easily stolen by others, and the security of authentication is improved. At the same time, the lip language instruction unlocking method does not require large-scale operations on the terminal, which greatly reduces the hardware performance requirements and improves the recognition speed. The invention can avoid the problem of excessive gradient caused by accumulation of a certain quadrant in the space, improve the efficiency of network learning and training, play the effect of active learning and training model, and solve the problem of easy exposure of traditional fixed command actions.

Description

A terminal unlocking method based on lip language instruction

技术领域technical field

本发明涉及一种基于唇语指令的终端解锁方法，属于图像信息处理技术领域。The invention relates to a terminal unlocking method based on lip language instructions, and belongs to the technical field of image information processing.

背景技术Background technique

目前终端解锁方式主要包括：人脸，指纹，虹膜。但是这些信息很容易被伪造，采用静态识别方法很容易被破解，安全性较差，容易导致私人信息的泄露。本发明采用唇语指令解锁的方法，实现动态解锁，可以提高认证的安全性。At present, the terminal unlocking methods mainly include: face, fingerprint, and iris. However, these information can be easily forged, and the static identification method is easy to be cracked, and the security is poor, which can easily lead to the leakage of private information. The invention adopts the method of unlocking by lip language instruction, realizes dynamic unlocking, and can improve the security of authentication.

现有唇语解锁技术极度依赖深度学习，需要在PC端训练出特定单一指令模型，然后部署在终端使用，用户需要匹配固定的指令动作。此方法实现的效果差，并没有对使用者的数据进行适配，只能适应于固定的指令动作，并且指令容易被暴露。The existing lip-language unlocking technology relies heavily on deep learning. It needs to train a specific single command model on the PC side, and then deploy it on the terminal. Users need to match fixed command actions. The effect of this method is poor, it does not adapt to the user's data, it can only adapt to fixed command actions, and the commands are easily exposed.

发明内容SUMMARY OF THE INVENTION

发明目的：针对现有解锁技术的不足，提供一种基于唇语指令的终端解锁方法。Purpose of the invention: Aiming at the deficiencies of the existing unlocking technology, a terminal unlocking method based on lip language instruction is provided.

技术方案：一种基于唇语指令的终端解锁方法，包括以下步骤：Technical solution: a terminal unlocking method based on lip language instruction, comprising the following steps:

步骤1、终端摄像头采集用户开锁的唇语指令视频帧，终端进行人脸检测并提取人脸特征，同时提取出唇部区域视频帧；Step 1, the terminal camera collects the lip language instruction video frame unlocked by the user, the terminal performs face detection and extracts face features, and simultaneously extracts the lip region video frame;

步骤2、对嘴唇视频帧数据集提取特征点，并且匹配相邻帧的特征点，标记位置坐标；Step 2, extract feature points from the lip video frame data set, and match the feature points of adjacent frames, and mark the position coordinates;

步骤3、使用帧差法提取特征点位置的变化特征，即嘴唇运动的代数特征；Step 3. Use the frame difference method to extract the change feature of the feature point position, that is, the algebraic feature of the lip movement;

步骤4、在数据库中匹配人脸；Step 4. Match faces in the database;

步骤5、如果匹配成功，需要识别人对着终端摄像头做出同样的唇语指令动作，终端同样提取出唇部特征点，并计算嘴唇运动的代数特征，匹配是否是解锁指令；Step 5. If the matching is successful, it is necessary to recognize that the person makes the same lip language command action to the terminal camera, and the terminal also extracts the lip feature points, and calculates the algebraic features of the lip movement to determine whether the match is an unlock command;

步骤6、当匹配人脸或者匹配指令不成功时，提示匹配失败，并且跳到步骤4。Step 6. When matching the face or the matching instruction is unsuccessful, it indicates that the matching fails, and skips to Step 4.

在进一步的实施例中，所述步骤1进一步为：In a further embodiment, the step 1 is further:

步骤1-1、对视频片段的每一帧计算其RGB空间的颜色直方图，每个通道按照像素值划分为32个区间，并归一化处理，得到96维特征；将每一帧的特征向量组成矩阵，对改矩阵进行降维处理，计算出初始化聚类中心：Step 1-1. Calculate the color histogram of the RGB space for each frame of the video clip, each channel is divided into 32 intervals according to the pixel value, and normalized to obtain 96-dimensional features; the features of each frame are The vectors form a matrix, and the dimensionality reduction process is performed on the modified matrix to calculate the initial cluster center:

式中，C_n表示第n个片段的聚类中心，f_n表示第n帧的特征向量，f_n+1表示第n+1个特征向量；In the formula, C _n represents the cluster center of the nth segment, _fn represents the feature vector of the nth frame, and fn ₊₁ represents the n+1th feature vector;

计算每个新帧对于当前的聚类中心的相似度，规定一个阈值σ，当相似度大于该阈值，则判断f_n隶属于该聚类中心C_n，此时将f_n加入C_n中，更新得到新的聚类中心C_n′：Calculate the similarity of each new frame to the current cluster center, and specify a threshold σ. When the similarity is greater than the threshold, it is judged that f _n belongs to the cluster center C _n , and f _{n is} added to C _n at this time, Update to get a new cluster center C _n′ :

式中，f_n表示第n帧的特征向量，C_n表示第n个片段的聚类中心，C_n′表示更新得到新的聚类中心；In the formula, _fn represents the feature vector of the nth frame, _Cn represents the cluster center of the nth segment, and _Cn′ represents the new cluster center obtained by updating;

当相似度小于该阈值，则判断f_n隶属于新的聚类中心，此时用f_n初始化新的聚类中心C_n′：When the similarity is less than the threshold, it is judged that f _n belongs to a new cluster center, and at this time, f _n is used to initialize the new cluster center C _n′ :

C_n′＝f_n C _n′ =f _n

步骤1-2、首先对人脸轮廓进行识别，并去除背景，对视频帧中的人脸进行唇部裁切，定位人脸中五官轮廓点的位置，包括鼻梢的坐标、唇部最左侧的坐标、唇部最右侧的坐标、嘴部中心点的坐标，根据这些坐标裁切处包含唇部细节的图像，裁切尺寸由此公式计算：Step 1-2, firstly identify the contour of the face, remove the background, cut the lips of the face in the video frame, and locate the position of the contour points of the facial features in the face, including the coordinates of the tip of the nose and the leftmost lip. The coordinates of the side, the coordinates of the far right side of the lip, and the coordinates of the center point of the mouth. According to these coordinates, the image containing the details of the lip is cropped, and the crop size is calculated by this formula:

式中，L_MN表示鼻梢的坐标与嘴部中心点的坐标之间的距离，x_右表示唇部最右侧特征点的横坐标，y_右表示唇部最右侧特征点的纵坐标，x_左表示唇部最左侧特征点的横坐标，y_左表示唇部最左侧特征点的纵坐标；In the formula, L _MN represents the distance between the coordinates of the nose tip and the coordinates of the center point of the mouth, x _right represents the abscissa of the rightmost feature point of the lip, y _right represents the ordinate of the rightmost feature point of the lip, x _left represents the abscissa of the leftmost feature point of the lip, and y _left represents the ordinate of the leftmost feature point of the lip;

步骤1-3、对裁切出的唇部图像进行偏差纠正，训练该唇部图像基于卷积神经网络的二分模型，判断提取出的唇部图像是否为有效图像：Steps 1-3, perform deviation correction on the cropped lip image, train the lip image based on the bipartite model of the convolutional neural network, and judge whether the extracted lip image is a valid image:

式中，l表示卷积层数，k表示卷积核，b表示卷积偏置，M_j表示输入的局部感受值，β表示输出参数，down()表示池化函数。In the formula, l represents the number of convolution layers, k represents the convolution kernel, b represents the convolution bias, M _j represents the input local receptive value, β represents the output parameter, and down() represents the pooling function.

在进一步的实施例中，所述步骤2进一步为：In a further embodiment, the step 2 is further:

步骤2-1、针对步骤1中提取出的裁切图像，构建D3D模型加速网络收敛，并引入损失函数纠正模型：Step 2-1. For the cropped image extracted in step 1, build a D3D model to accelerate network convergence, and introduce a loss function to correct the model:

式中，

表示的是交叉熵损失，{y_i＝k}为指示函数，logit(pre)表示网络输出概率，σ是比例系数；In the formula,

represents the cross-entropy loss, {y _i =k} is the indicator function, logit(pre) represents the network output probability, and σ is the proportional coefficient;

其中，P({Z|X})＝∑_k＝1P(π||X)，即所有路径经过合并之后形成的概率之和；Among them, P({Z|X})=∑ _k=1 P(π||X), that is, the sum of the probabilities formed after all paths are combined;

步骤2-2、分别对相邻两帧的图像提取特征点并得到两组特征点集合：Step 2-2, extract feature points from the images of two adjacent frames and obtain two sets of feature point sets:

p＝{p₁、p₂、p₃…p_n}p={p ₁ , p ₂ , p ₃ . . . p _n }

p′＝{p₁′、p₂′、p₃′…p_n′}p'={p ₁ ', p ₂ ', p ₃ '...p _n '}

根据相邻两组特征点为中心，将其邻域的窗口W的像素值作为该特征点的描述符，分别计算两组特征点邻域的像素插值：According to the center of the adjacent two groups of feature points, the pixel value of the window W of its neighborhood is used as the descriptor of the feature point, and the pixel interpolation of the neighborhood of the two groups of feature points is calculated separately:

式中，S表示两组特征点领域的像素插值，x表示像素点的横坐标、y表示像素点的纵坐标，W表示领域窗口，在此公式中做描述符，p表示前一帧图像，p′表示后一帧图像；In the formula, S represents the pixel interpolation of the two groups of feature points, x represents the abscissa of the pixel, y represents the ordinate of the pixel, W represents the field window, which is used as a descriptor in this formula, p represents the previous frame image, p' represents the next frame of image;

步骤2-3、根据步骤2-2中得出的像素插值，以根据特征点与邻域窗口之间的匹配系数寻找匹配点：Step 2-3, according to the pixel interpolation obtained in step 2-2, to find the matching point according to the matching coefficient between the feature point and the neighborhood window:

式中，G表示前一帧图像的灰度值，G′表示后一帧图像的灰度值，C表示匹配系数，其余符号含义同上。In the formula, G represents the gray value of the image of the previous frame, G′ represents the gray value of the image of the next frame, C represents the matching coefficient, and other symbols have the same meanings as above.

在进一步的实施例中，所述步骤3进一步为：In a further embodiment, the step 3 is further:

步骤3-1、记录相邻三个单独帧的图像，分别记为f(n+1)、f(n)、f(n-1)，三帧图像对应的灰度值分别记为G(n+1)^x,y、G(n)^x,y、G(n-1)^x,y，采用帧差法得到图像P′：Step 3-1. Record the images of three adjacent individual frames, denoted as f(n+1), f(n), and f(n-1), respectively, and the gray values corresponding to the three frames of images are respectively denoted as G( n+1) ^x,y , G(n) ^x,y , G(n-1) ^x,y , using the frame difference method to get the image P′:

P′＝|G(n+1)^x,y-G(n)^x,y|∩|G(n)^x,y-G(n-1)^x,y|P′=|G(n+1) ^x,y -G(n) ^x,y |∩|G(n) ^x,y -G(n-1) ^x,y |

将该图像P′与预设的阈值T比对分析流通性，提取出运动目标，比对条件为：Compare the image P' with the preset threshold T to analyze the circulation, and extract the moving target. The comparison conditions are:

式中，N表示待检测区域中像素的总数目，τ表示光照的抑制系数，A表示完整帧的图像，T为阈值。In the formula, N represents the total number of pixels in the area to be detected, τ represents the suppression coefficient of illumination, A represents the image of the complete frame, and T is the threshold.

在进一步的实施例中，所述步骤4进一步为：In a further embodiment, the step 4 is further:

步骤4-1、在多用户终端上，如保险柜、门锁，需要进行人脸识别，匹配数据库中是否存在这个用户的人脸；在单一用户私人终端上，如手机、平板，不需要进行人脸识别，人脸验证即可，采用facenet网络计算人脸特征的欧氏距离，进行比较阈值判断：Step 4-1. On multi-user terminals, such as safes and door locks, face recognition needs to be performed to match whether the user's face exists in the database; on single-user private terminals, such as mobile phones and tablets, no need to perform face recognition. Face recognition, face verification is enough, the facenet network is used to calculate the Euclidean distance of the face features, and the comparison threshold is judged:

式中，

表示正样本对，

表示负样本对，

表示平样本对，α表示正样本对与负样本对之间的约束范围，Φ表示三元组的集合；In the formula,

represents a positive sample pair,

represents the negative sample pair,

represents the flat sample pair, α represents the constraint range between the positive sample pair and the negative sample pair, and Φ represents the set of triples;

引入神经元模型：Introduce the neuron model:

h_W,b(x)＝f(W^Tx)h _W,b (x)=f(W ^T x)

式中，W表示神经元的权重向量，W^Tx表示对输入向量x进行非线性变换，f(W^Tx)表示对该权重向量进行激活函数转换；In the formula, W represents the weight vector of the neuron, W ^T x represents the nonlinear transformation of the input vector x, and f(W ^T x) represents the activation function transformation of the weight vector;

将输入向量x赋值为x_i，带入W^Tx：Assign the input vector x to x _i , bringing in W ^T x :

式中，n表示神经网络的级数，b表示偏量。In the formula, n represents the series of the neural network, and b represents the bias.

在进一步的实施例中，所述步骤5进一步为：在采集过程中以嘴唇中心为坐标原点建立坐标轴，将嘴唇灰度图像中内唇区域拟合成两个半椭圆组合，上内唇对应上椭圆，下内唇对应下椭圆，使用帧差法提取对应特征点位置的变化特征即帧间嘴唇运动的代数特征：In a further embodiment, the step 5 is further: in the acquisition process, the center of the lips is used as the coordinate origin to establish a coordinate axis, and the inner lip area in the lip grayscale image is fitted into a combination of two semi-ellipses, and the upper inner lip corresponds to The upper ellipse, the lower inner lip corresponds to the lower ellipse, and the frame difference method is used to extract the change feature of the corresponding feature point position, that is, the algebraic feature of the lip movement between frames:

记录相邻两个单独帧的图像，分别记为f(n+1)、f(n)，两帧图像对应的灰度值分别记为G(n+1)^x,y、G(n)^x,y、，采用帧差法得到图像P′：Record the images of two adjacent separate frames, denoted as f(n+1), f(n), respectively, and the gray values corresponding to the two frames of images are denoted as G(n+1) ^x,y , G(n) ^{x, y} , and the image P' is obtained by the frame difference method:

P′＝|G(n+1)^x,y-G(n)^x,y|P′=|G(n+1) ^x,y -G(n) ^x,y |

有益效果：本发明涉及一种基于唇语指令的终端解锁方法，在采集时用户可自行设计指令动作，在识别时做出相同动作即可，这样动作指令不易被他人窃取，提高了认证的安全性。同时，该唇语指令解锁方法无需在终端上进行大规模的运算，极大的降低了硬件性能要求，提高了识别速度。本发明通过对矩阵降维处理、提取特征点、初始化聚类中心、采用facenet网络计算人脸特征的欧氏距离，能够避免空间内某一象限堆积造成梯度过大的问题，并提高网络学习和训练效率，起到主动学习训练模型的效果，解决了传统的固定指令动作易暴露的问题。Beneficial effects: The present invention relates to a terminal unlocking method based on lip language commands. When collecting, users can design command actions by themselves, and they can make the same actions when recognizing, so that action commands are not easily stolen by others, and the security of authentication is improved. sex. At the same time, the lip language instruction unlocking method does not require large-scale operations on the terminal, which greatly reduces the hardware performance requirements and improves the recognition speed. The invention can avoid the problem of excessive gradient caused by the accumulation of a certain quadrant in the space by reducing the matrix, extracting feature points, initializing the cluster center, and using the facenet network to calculate the Euclidean distance of the face features, and improving the network learning and The training efficiency has the effect of actively learning the training model, which solves the problem that the traditional fixed command actions are easy to be exposed.

附图说明Description of drawings

图1为本发明的流程图。FIG. 1 is a flow chart of the present invention.

图2为本发明为嘴唇建立坐标系的示意图。FIG. 2 is a schematic diagram of establishing a coordinate system for lips according to the present invention.

图3为本发明唇语解锁指令中裁切出包含唇部细节的图像。FIG. 3 is a cropped image including lip details in the lip language unlocking instruction of the present invention.

图4为本发明引入神经元模型的示意图。FIG. 4 is a schematic diagram of introducing a neuron model according to the present invention.

具体实施方式Detailed ways

在下文的描述中，给出了大量具体的细节以便提供对本发明更为彻底的理解。然而，对于本领域技术人员而言显而易见的是，本发明可以无需一个或多个这些细节而得以实施。在其他的例子中，为了避免与本发明发生混淆，对于本领域公知的一些技术特征未进行描述。In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without one or more of these details. In other instances, some technical features known in the art have not been described in order to avoid obscuring the present invention.

申请人认为，在唇语解锁领域，现有技术极度依赖深度学习，需要在PC端训练出特定单一指令模型，然后部署在终端使用，用户需要匹配固定的指令动作。此方法实现的效果差，并没有对使用者的数据进行适配，只能适应于固定的指令动作，并且指令容易被暴露，因此，如何对唇语模型进行构建、并不断提高机器的主动学习性是至关重要的。The applicant believes that in the field of lip language unlocking, the existing technology relies heavily on deep learning, and a specific single command model needs to be trained on the PC side, and then deployed on the terminal for use, and the user needs to match fixed command actions. The effect of this method is poor, and it does not adapt to the user's data, but can only adapt to fixed command actions, and the commands are easily exposed. Therefore, how to build a lip language model and continuously improve the active learning of the machine Sex is crucial.

为解决现有技术的存在的上述问题，本发明提出了一种基于唇语指令的终端解锁方法，在采集时用户可自行设计指令动作，在识别时做出相同动作即可，这样动作指令不易被他人窃取，提高了认证的安全性。In order to solve the above-mentioned problems existing in the prior art, the present invention proposes a terminal unlocking method based on lip language instructions. When collecting, the user can design an instruction action by himself, and the same action can be made when recognizing, so that the action instruction is not easy. It is stolen by others, which improves the security of authentication.

下面通过实施例，并结合相应附图，对本发明的技术方案做进一步说明。The technical solutions of the present invention will be further described below through examples and in conjunction with the corresponding drawings.

首先，终端摄像头采集用户开锁的唇语指令视频帧，终端进行人脸检测并提取人脸特征，同时提取出唇部区域视频帧；对视频片段的每一帧计算其RGB空间的颜色直方图，每个通道按照像素值划分为32个区间，并归一化处理，得到96维特征；将每一帧的特征向量组成矩阵，对改矩阵进行降维处理，计算出初始化聚类中心：First, the terminal camera collects the lip language instruction video frame of the user unlocking, the terminal detects the face and extracts the face features, and extracts the video frame of the lip area at the same time; Each channel is divided into 32 intervals according to the pixel value, and normalized to obtain 96-dimensional features; the eigenvectors of each frame are formed into a matrix, and the dimensionality reduction process is performed on the modified matrix to calculate the initialization cluster center:

C_n′＝f_n C _n′ =f _n

对人脸轮廓进行识别，并去除背景，对视频帧中的人脸进行唇部裁切，定位人脸中五官轮廓点的位置，包括鼻梢的坐标、唇部最左侧的坐标、唇部最右侧的坐标、嘴部中心点的坐标，根据这些坐标裁切处包含唇部细节的图像，裁切尺寸由此公式计算：Identify the contour of the face, remove the background, cut the lips of the face in the video frame, and locate the contour points of the facial features in the face, including the coordinates of the tip of the nose, the coordinates of the leftmost lip, and the lip The rightmost coordinates, the coordinates of the center point of the mouth, according to these coordinates, the image containing the lip details is cropped, and the crop size is calculated by this formula:

对裁切出的唇部图像进行偏差纠正，训练该唇部图像基于卷积神经网络的二分模型，判断提取出的唇部图像是否为有效图像：Correct the deviation of the cropped lip image, train the bipartite model based on the convolutional neural network on the lip image, and judge whether the extracted lip image is a valid image:

接着，对嘴唇视频帧数据集提取特征点，并且匹配相邻帧的特征点，标记位置坐标；Next, extract feature points to the lip video frame data set, and match the feature points of adjacent frames, and mark the position coordinates;

针对提取出的裁切图像，构建D3D模型加速网络收敛，并引入损失函数纠正模型：For the extracted cropped image, build a D3D model to accelerate network convergence, and introduce a loss function to correct the model:

式中，

分别对相邻两帧的图像提取特征点并得到两组特征点集合：Extract feature points from the images of two adjacent frames and obtain two sets of feature points:

p＝{p₁、p₂、p₃ … p_n}p={p ₁ , p ₂ , p ₃ … p _n }

p′＝{p₁′、p₂′、p₃′ … p_n′}p′={p ₁ ′, p ₂ ′, p ₃ ′ … p _n ′}

根据上文得出的像素插值，以根据特征点与邻域窗口之间的匹配系数寻找匹配点：According to the pixel interpolation obtained above, to find matching points according to the matching coefficient between the feature points and the neighborhood window:

接着，使用帧差法提取特征点位置的变化特征，即嘴唇运动的代数特征；记录相邻三个单独帧的图像，分别记为f(n+1)、f(n)、f(n-1)，三帧图像对应的灰度值分别记为G(n+1)^x,y、G(n)^x,y、G(n-1)^x,y，采用帧差法得到图像P′：Next, use the frame difference method to extract the change feature of the feature point position, that is, the algebraic feature of the lip movement; record the images of three adjacent individual frames, denoted as f(n+1), f(n), f(n- 1), the gray values corresponding to the three frames of images are respectively recorded as G(n+1) ^x,y , G(n) ^x,y , G(n-1) ^x,y , and the image P′ is obtained by the frame difference method :

步骤4、在数据库中匹配人脸：在多用户终端上，如保险柜、门锁，需要进行人脸识别，匹配数据库中是否存在这个用户的人脸；在单一用户私人终端上，如手机、平板，不需要进行人脸识别，人脸验证即可，采用facenet网络计算人脸特征的欧氏距离，进行比较阈值判断：Step 4. Match faces in the database: On multi-user terminals, such as safes and door locks, face recognition needs to be performed to match whether the user's face exists in the database; on single-user private terminals, such as mobile phones, The tablet does not need face recognition, just face verification. The facenet network is used to calculate the Euclidean distance of the face features, and the comparison threshold is judged:

式中，

表示正样本对，

表示负样本对，

represents a positive sample pair,

represents the negative sample pair,

引入神经元模型：Introduce the neuron model:

h_W,b(x)＝f(W^Tx)h _W,b (x)=f(W ^T x)

步骤5、如果匹配成功，需要识别人对着终端摄像头做出同样的唇语指令动作，终端同样提取出唇部特征点，并计算嘴唇运动的代数特征，匹配是否是解锁指令；在采集过程中以嘴唇中心为坐标原点建立坐标轴，将嘴唇灰度图像中内唇区域拟合成两个半椭圆组合，上内唇对应上椭圆，下内唇对应下椭圆，使用帧差法提取对应特征点位置的变化特征即帧间嘴唇运动的代数特征：Step 5. If the matching is successful, it is necessary to recognize that the person makes the same lip language command action to the terminal camera, and the terminal also extracts the lip feature points, and calculates the algebraic features of the lip movement to determine whether the match is an unlock command; during the acquisition process Use the center of the lips as the coordinate origin to establish the coordinate axis, and fit the inner lip area in the gray image of the lips into a combination of two semi-ellipses. The upper inner lip corresponds to the upper ellipse, and the lower inner lip corresponds to the lower ellipse. The frame difference method is used to extract the corresponding feature points. The change feature of the position is the algebraic feature of the lip movement between frames:

P′＝|G(n+1)^x,y-G(n)^x,y|P′=|G(n+1) ^x,y -G(n) ^x,y |

当匹配人脸或者匹配指令不成功时，提示匹配失败，继续在数据库中匹配人脸，重复上述步骤，当超过三次都匹配失败即临时锁死终端设备。When matching the face or the matching instruction is unsuccessful, it will prompt that the matching fails, continue to match the face in the database, repeat the above steps, and temporarily lock the terminal device if the matching fails more than three times.

综上，针对现有技术的不足，本发明提出了一种基于唇语指令的终端解锁方法，在采集过程中，取几帧图像进行人脸的获取，并提取部分关键特征点。在验证过程中，同理提取需要识别人脸的关键特征点，采用facenet网络计算人脸特征的欧氏距离，进行比较阈值判断。在采集过程中以嘴唇中心为坐标原点建立坐标轴，将嘴唇灰度图像中内唇区域拟合成两个半椭圆组合(上内唇对应上椭圆，下内唇对应下椭圆)，使用帧差法提取对应特征点位置的变化特征即帧间嘴唇运动的代数特征，计算判断阈值。在验证过程中，同理提取嘴唇运动特征，进行比较判断。通过对矩阵降维处理、提取特征点、初始化聚类中心、采用facenet网络计算人脸特征的欧氏距离，能够避免空间内某一象限堆积造成梯度过大的问题，并提高网络学习和训练效率，起到主动学习训练模型的效果，解决了传统的固定指令动作易暴露的问题。To sum up, in view of the deficiencies of the prior art, the present invention proposes a terminal unlocking method based on lip language instructions. During the acquisition process, several frames of images are taken to acquire the face, and some key feature points are extracted. In the verification process, the key feature points that need to be recognized for the face are extracted in the same way, and the Euclidean distance of the face features is calculated by the facenet network, and the threshold is compared and judged. In the acquisition process, the center of the lips is used as the coordinate origin to establish the coordinate axis, and the inner lip area in the lip grayscale image is fitted into a combination of two semi-ellipses (the upper inner lip corresponds to the upper ellipse, and the lower inner lip corresponds to the lower ellipse), and the frame difference is used. The method extracts the change feature of the corresponding feature point position, that is, the algebraic feature of the lip motion between frames, and calculates the judgment threshold. In the verification process, the lip motion features are extracted in the same way for comparison and judgment. By reducing the matrix dimension, extracting feature points, initializing cluster centers, and using facenet network to calculate the Euclidean distance of face features, the problem of excessive gradients caused by accumulation of a certain quadrant in the space can be avoided, and the efficiency of network learning and training can be improved. , has the effect of active learning and training model, and solves the problem that the traditional fixed command action is easy to be exposed.

如上所述，尽管参照特定的优选实施例已经表示和表述了本发明，但其不得解释为对本发明自身的限制。在不脱离所附权利要求定义的本发明的精神和范围前提下，可对其在形式上和细节上做出各种变化。As mentioned above, although the present invention has been shown and described with reference to specific preferred embodiments, this should not be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. a terminal unlocking method based on lip language instruction is characterized in that comprising the following steps:

Step 1, the terminal camera collects the lip language instruction video frame unlocked by the user, the terminal performs face detection and extracts face features, and simultaneously extracts the lip region video frame;

Step 1-1. Calculate the color histogram of the RGB space for each frame of the video clip, each channel is divided into 32 intervals according to the pixel value, and normalized to obtain 96-dimensional features; the features of each frame are The vectors form a matrix, and the dimensionality reduction process is performed on the modified matrix to calculate the initial cluster center:

In the formula,

represents the cluster center of the nth fragment,

represents the feature vector of the nth frame,

Represents the n+1th eigenvector;

Calculate the similarity of each new frame to the current cluster center and specify a threshold

, when the similarity is greater than the threshold, judge

belong to the cluster center

, then the

join in

, update the new cluster center

:

In the formula,

represents the feature vector of the nth frame,

represents the cluster center of the nth fragment,

Indicates that the update obtains a new cluster center;

When the similarity is less than the threshold, judge

belongs to the new cluster center, at this time use

Initialize new cluster centers

:

Step 1-2, firstly identify the contour of the face, remove the background, cut the lips of the face in the video frame, and locate the position of the contour points of the facial features in the face, including the coordinates of the tip of the nose and the leftmost lip. The coordinates of the side, the coordinates of the far right side of the lip, and the coordinates of the center point of the mouth. According to these coordinates, the image containing the details of the lip is cropped, and the crop size is calculated by this formula:

In the formula,

represents the distance between the coordinates of the tip of the nose and the coordinates of the center of the mouth,

represents the abscissa of the rightmost feature point of the lip,

represents the ordinate of the rightmost feature point of the lip,

represents the abscissa of the leftmost feature point of the lip,

Represents the ordinate of the leftmost feature point of the lip;

Steps 1-3, perform deviation correction on the cropped lip image, train the lip image based on the bipartite model of the convolutional neural network, and judge whether the extracted lip image is a valid image:

where l represents the number of convolution layers, k represents the convolution kernel, b represents the convolution bias,

represents the local receptive value of the input,

represents the output parameter,

represents the pooling function;

Step 2, extract feature points from the lip video frame data set, and match the feature points of adjacent frames, and mark the position coordinates;

Step 2-1. For the cropped image extracted in step 1, build a D3D model to accelerate network convergence, and introduce a loss function to correct the model:

In the formula,

represents the cross-entropy loss,

is the indicator function,

represents the network output probability,

is the scale factor;

in,

, that is, the sum of the probabilities formed after all paths are merged;

Step 2-2, extract feature points from the images of two adjacent frames and obtain two sets of feature point sets:

According to the center of the adjacent two groups of feature points, the pixel value of the window W of its neighborhood is used as the descriptor of the feature point, and the pixel interpolation of the neighborhood of the two groups of feature points is calculated separately:

In the formula, S represents the pixel interpolation of the two groups of feature points, x represents the abscissa of the pixel, y represents the ordinate of the pixel, W represents the field window, which is used as a descriptor in this formula, p represents the previous frame image,

Represents the next frame of image;

Step 2-3, according to the pixel interpolation obtained in step 2-2, to find the matching point according to the matching coefficient between the feature point and the neighborhood window:

In the formula,

represents the gray value of the previous frame image,

Represents the gray value of the next frame of image, C represents the matching coefficient, and the rest of the symbols have the same meaning as above;

Step 3. Use the frame difference method to extract the change feature of the feature point position, that is, the algebraic feature of the lip movement;

Step 3-1. Record the images of three adjacent individual frames, respectively denoted as

,

,

, the corresponding grayscale values of the three frames of images are respectively recorded as

,

,

, using the frame difference method to obtain the image

:

the image

Compare and analyze the liquidity with the preset threshold T , and extract the moving target. The comparison conditions are:

In the formula,

represents the total number of pixels in the area to be detected,

represents the suppression coefficient of illumination,

represents the image of the complete frame, T is the threshold;

Step 4. Match faces in the database;

Step 4-1. On multi-user terminals, such as safes and door locks, face recognition needs to be performed to match whether the user's face exists in the database; on single-user private terminals, such as mobile phones and tablets, no need to perform face recognition. Face recognition, face verification is enough, the facenet network is used to calculate the Euclidean distance of the face features, and the comparison threshold is judged:

In the formula,

represents a positive sample pair,

represents the negative sample pair,

represents a flat sample pair,

represents the constraint range between the positive sample pair and the negative sample pair,

represents a set of triples;

Introduce the neuron model:

In the formula,

represents the weight vector of the neuron,

represents the nonlinear transformation of the input vector x ,

Represents the activation function transformation of the weight vector;

Assign the input vector x as

, bring in

:

In the formula, n represents the series of the neural network, and b represents the bias;

Step 5. If the matching is successful, it is necessary to recognize that the person makes the same lip language command action to the terminal camera, and the terminal also extracts the lip feature points, and calculates the algebraic features of the lip movement to determine whether the match is an unlock command;

In the acquisition process, the center of the lip is used as the coordinate origin to establish the coordinate axis, and the inner lip area in the gray image of the lips is fitted into a combination of two semi-ellipses. The upper inner lip corresponds to the upper ellipse, and the lower inner lip corresponds to the lower ellipse. The frame difference method is used. Extract the change feature of the corresponding feature point position, that is, the algebraic feature of the lip movement between frames: