CN112784696B

CN112784696B - Lip language identification method, device, equipment and storage medium based on image identification

Info

Publication number: CN112784696B
Application number: CN202011635782.8A
Authority: CN
Inventors: 周亚云; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2024-05-10
Anticipated expiration: 2040-12-31
Also published as: CN112784696A

Abstract

The invention relates to the field of artificial intelligence and discloses a lip language identification method, device and equipment based on image identification and a storage medium. The method comprises the following steps: acquiring multi-frame face images of a lip language user in real time, and performing key point detection and lip region positioning to obtain lip region images corresponding to the face images; sequentially carrying out feature extraction on lip region images corresponding to the face images to obtain a lip feature sequence of a lip language user; inputting the lip feature sequence into a preset lip recognition model, and outputting a pronunciation phoneme sequence corresponding to a lip user; converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the plurality of natural sentences through a preset statistical language model to obtain a target natural sentence; and performing audio conversion on the target natural sentence to obtain lip language pronunciation and broadcasting. According to the invention, the lip expression sentences can be identified from the lip image data collected in real time, and broadcast is carried out, so that silent lip can be sounded in real time.

Description

Lip language identification method, device, equipment and storage medium based on image identification

Technical Field

The invention relates to the field of artificial intelligence, in particular to a lip language identification method, device and equipment based on image identification and a storage medium.

Background

At present, a large number of people lost sound caused by diseases, accidents or congenital defects exist in the world, and the people cannot smoothly communicate with the outside through sending sound. Generally speaking, the speaker can hear the voice information of the normal person, but how to express his own ideas makes the other person understand it difficult, for example, the communication is difficult to be performed normally without the non-speaker having been trained by the recognition of the professional sign language or the lip language, or without the speaker writing the text, etc. Even though the aphonia crowd can communicate through characters, the communication efficiency is low. The lip language is used for reading the speaking of the other person by looking at the lip movements of the other person when the other person speaks, and the requirements on the educational background are not needed, but the lip language recognition capability is required in a professional way.

In order to solve the problem that the human interpretation of the lip language needs professional training, the machine learning and deep learning technology can be utilized to train the machine to recognize the lip language expression of the human, and the lip language expression is broadcasted through a player so as to realize real-time barrier-free communication between the aphonia crowd and the normal people. Most of the existing lip language recognition schemes comprise mouth detection, mouth segmentation, mouth normalization, feature extraction and construction of a lip language classifier, the accuracy is about 20% -60%, the lip language recognition method belongs to lower lip language recognition accuracy, and the function of directly broadcasting voice is not achieved, so that lip language recognition still stays on a character layer with larger error.

Disclosure of Invention

The invention mainly aims to solve the technical problem that the existing modeling mode of lip language identification is simple and the identification accuracy is affected.

The first aspect of the invention provides a lip language identification method based on image identification, which comprises the following steps:

collecting multi-frame face images of a lip language user in real time;

Sequentially carrying out key point detection and lip region positioning on each face image to obtain a lip region image corresponding to each face image;

Sequentially extracting features of lip region images corresponding to the face images to obtain a lip feature sequence of the lip language user;

Inputting the lip feature sequence into a preset lip recognition model to perform lip pronunciation recognition, and outputting a pronunciation phoneme sequence corresponding to the lip user;

converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the plurality of natural sentences through a preset statistical language model to obtain a target natural sentence;

and converting the audio frequency of the target natural sentence to obtain lip language pronunciation and broadcasting.

Optionally, in a first implementation manner of the first aspect of the present invention, the sequentially performing key point detection and lip area positioning on each face image, and obtaining a lip area image corresponding to each face image includes:

Inputting the face image data into a face recognition model in sequence to detect key points, so as to obtain face key points in the face images;

Determining the mouth angle key points in the face images according to the marking information corresponding to the face key points;

and determining lip areas corresponding to the face images according to the key points of the mouth angles of the face images, and performing screenshot to obtain lip area images corresponding to the face images.

Optionally, in a second implementation manner of the first aspect of the present invention, the sequentially extracting features of the lip area images corresponding to the face images, to obtain a lip feature sequence of the lip language user includes:

Aligning the lip region image corresponding to each face image with a preset standard mouth image;

Calculating offset and rotation factors of the lip region images relative to the standard mouth image to obtain lip feature vectors corresponding to the lip region images;

and sequentially splicing lip feature vectors corresponding to the lip region images according to the acquisition time sequence of the face images to obtain the lip feature sequence of the lip language user.

Optionally, in a third implementation manner of the first aspect of the present invention, before the acquiring, in real time, a multi-frame face image of the lip language user, the method further includes:

acquiring a plurality of lip region image samples with pronunciation phoneme labels;

Extracting lip feature sequences corresponding to the lip region image samples and taking the lip feature sequences as training samples;

Initializing an end-to-end neural network model with initial network parameters, the end-to-end neural network model comprising: an encoder comprising a plurality of layers of first RNN networks and a decoder comprising a plurality of layers of second RNN networks;

Inputting the training samples into each first RNN network of the encoder to perform pronunciation encoding to obtain first pronunciation vectors corresponding to the training samples;

Inputting the first sound vectors into each second RNN network of the decoder to perform sound mapping, so as to obtain sound phoneme prediction results corresponding to each first sound vector;

Calculating a CTC loss function of the end-to-end neural network model according to the pronunciation phoneme prediction result and the training sample to obtain a model loss value;

Judging whether the end-to-end neural network model is converged or not according to the model loss value;

and if the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language identification model, otherwise, continuously reversely inputting the pronunciation phoneme prediction result into the end-to-end neural network model, and updating network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language identification model.

Optionally, in a fourth implementation manner of the first aspect of the present invention, inputting the lip feature sequence into a preset lip recognition model to perform lip pronunciation recognition, and outputting a pronunciation phoneme sequence corresponding to the lip user includes:

inputting the lip feature sequence into an encoder of the lip language identification model for pronunciation encoding to obtain a second pronunciation vector corresponding to the lip feature sequence;

And inputting the second pronunciation vector into a decoder of the lip language identification model to carry out pronunciation mapping, so as to obtain a pronunciation phoneme sequence corresponding to the lip language user.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the statistical language model includes: the step of converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the plurality of natural sentences through a preset statistical language model to obtain a target natural sentence comprises the following steps of:

Inputting the respective natural sentences into the forward LSTM network according to the word sequence positive sequence to perform network calculation, so as to obtain a first prediction result of the respective natural sentences;

Inputting the respective natural sentences into the reverse LSTM network in reverse order according to word sequence to perform network calculation, so as to obtain a second prediction result of the respective natural sentences;

And calculating the average value of the first prediction result and the second prediction result to obtain the corresponding score of each natural sentence, and taking the natural sentence with the highest score as the target natural sentence.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format includes:

Each pronunciation phoneme in the pronunciation phoneme sequence is used as a query keyword, and a preset phoneme pronunciation mapping table is queried to obtain a phoneme ID corresponding to each pronunciation phoneme;

inquiring a preset word mapping table according to the IDs of the phonemes to obtain a plurality of words corresponding to the phonemes;

And combining words corresponding to each phoneme according to the arrangement sequence of the pronunciation phoneme sequence to obtain a plurality of natural language sentences in a text format.

The second aspect of the present invention provides a lip language recognition device based on image recognition, comprising:

the image acquisition module is used for acquiring multi-frame face images of the lip language user in real time;

The lip positioning module is used for sequentially carrying out key point detection and lip area positioning on each face image to obtain a lip area image corresponding to each face image;

the feature extraction module is used for sequentially carrying out feature extraction on the lip region images corresponding to the face images to obtain a lip feature sequence of the lip language user;

the sequence recognition module is used for inputting the lip feature sequence into a preset lip recognition model to perform lip pronunciation recognition and outputting a pronunciation phoneme sequence corresponding to the lip user;

The sentence scoring module is used for converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the plurality of natural sentences through a preset statistical language model to obtain a target natural sentence;

And the lip broadcasting module is used for converting the audio frequency of the target natural sentence to obtain lip pronunciation and broadcasting.

Optionally, in a first implementation manner of the second aspect of the present invention, the lip positioning module is specifically configured to:

Optionally, in a second implementation manner of the second aspect of the present invention, the feature extraction module is specifically configured to:

Optionally, in a third implementation manner of the second aspect of the present invention, the image recognition-based lip language recognition device further includes:

The sample acquisition module is used for acquiring a plurality of lip region image samples with pronunciation phoneme labels; extracting lip feature sequences corresponding to the lip region image samples and taking the lip feature sequences as training samples;

A model prediction module for initializing an end-to-end neural network model with initial network parameters, the end-to-end neural network model comprising: an encoder comprising a plurality of layers of first RNN networks and a decoder comprising a plurality of layers of second RNN networks; inputting the training samples into each first RNN network of the encoder to perform pronunciation encoding to obtain first pronunciation vectors corresponding to the training samples; inputting the first sound vectors into each second RNN network of the decoder to perform sound mapping, so as to obtain sound phoneme prediction results corresponding to each first sound vector;

the loss calculation module is used for calculating a CTC loss function of the end-to-end neural network model according to the pronunciation phoneme prediction result and the training sample to obtain a model loss value;

The model generation module is used for judging whether the end-to-end neural network model is converged according to the model loss value; and if the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language identification model, otherwise, continuously reversely inputting the pronunciation phoneme prediction result into the end-to-end neural network model, and updating network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language identification model.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the sequence identifying module is specifically configured to:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the statistical language model includes: the statement scoring module is specifically configured to:

Optionally, in a sixth implementation manner of the second aspect of the present invention, the sentence scoring module is further configured to:

A third aspect of the present invention provides a lip-language recognition apparatus based on image recognition, comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the image recognition-based lip recognition device to perform the image recognition-based lip recognition method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the above-described image recognition-based lip language recognition method.

In the technical scheme provided by the invention, in order to enable a lip language user incapable of communicating own ideas through sound to break a communication barrier and make a sound in real time, the face animation of the lip language user is required to be acquired in real time, a plurality of continuous multi-frame user face images are extracted through the face animation, a plurality of key points of a human face are identified through a face recognition model, the key points are positioned to lips of the lip language user, then features in the lip image are extracted, the features are expressed through lip feature sequences, and then the sequences are input into the lip language identification model to identify pronunciation phonemes corresponding to the sequences. For Chinese lip language users, the pronunciation phonemes are the pinyin composed of initials and finals. Then, the spelling sequence is converted into natural language sentences with different combination modes, then the natural language sentences are input into a statistical language model for grading, the rationality and the smoothness of the natural language sentences are evaluated, a target natural language sentence is selected, and finally the target natural language sentence is played through a player. According to the embodiment of the invention, the lip expression sentences can be identified from the lip image data collected in real time, and are broadcasted, so that silent lip can be sounded.

Drawings

Fig. 1 is a schematic diagram of a first embodiment of a lip language recognition method based on image recognition in an embodiment of the present invention;

fig. 2 is a schematic diagram of a second embodiment of a lip language recognition method based on image recognition in an embodiment of the present invention;

Fig. 3 is a schematic diagram of a first embodiment of a lip language recognition device based on image recognition in an embodiment of the present invention;

fig. 4 is a schematic diagram of a second embodiment of a lip language recognition device based on image recognition in an embodiment of the present invention;

Fig. 5 is a schematic diagram of an embodiment of a lip language recognition apparatus based on image recognition in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a lip language identification method, device and equipment based on image identification and a storage medium. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For easy understanding, the following describes a specific flow of an embodiment of the present invention, referring to fig. 1, and a first embodiment of a method for identifying a lip language based on image identification in the embodiment of the present invention includes:

101. Collecting multi-frame face images of a lip language user in real time;

It can be understood that the execution subject of the present invention may be a lip language recognition device based on image recognition, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

In this embodiment, the user of the lip language refers to a user who needs to be identified with lip language expression information, and may be, for example, a speaker or a deaf-mute. Or a normal person who is not willing to disclose the voice information, for example, in some occasions needing confidential communication, the lip language user transmits the voice information which is wanted to be expressed to the receiver in a lip animation mode, and then the lip language information is decrypted at the receiver.

In this embodiment, the image capturing manner of the lip language user may be video recording, and a 2D camera may be installed for the lip language user in advance for capturing face video of the speaker. After the face video is obtained, the video can be output as a single frame image by using a video processing tool, such as AE, so that we can obtain continuous frame images of the lip language user for further identifying the lip language expression.

102. Sequentially carrying out key point detection and lip region positioning on each face image to obtain a lip region image corresponding to each face image;

In this embodiment, the acquired face image may have different complex backgrounds, and in order to identify the lips of the face in the image, the face needs to be detected first. The key point positioning of the face can be performed through the face recognition model, a more complete key point detection model is designed by referring to the professional Dlib face recognition model based on computer vision, 98 key points of the face can be positioned by the model, the key points of 30 lip areas are increased compared with 68 key points identified by an original Dlib model, the key points of lips are more dense, and the accuracy of lip feature expression is improved. In addition, the key point detection model uses special labeling information for the lip key points, so that the key points of the lips can be rapidly positioned and extracted, and the efficiency of positioning the lips of the face is improved.

Optionally, step 102 specifically includes:

In the alternative embodiment, the face recognition model is also called a key point detection model, is a face key point detection model trained through a pre-labeled face image, and has the core principle that the face is represented by using image Hog characteristics, and compared with other characteristic extraction operators, the face recognition model can keep good non-deformation on geometric and optical deformation of the image. The characteristic and the LBP characteristic are used together as three classical image characteristics, and the characteristic extraction operator is usually matched with a Support Vector Machine (SVM) algorithm to be used in an object detection scene. The face detection method realized by the face recognition model is based on the Hog characteristics of the image, and combines the face detection function realized by the support vector machine algorithm, and the approximate thought of the algorithm is as follows:

Extracting Hog characteristics from a positive sample (namely an image containing a human face) data set to obtain Hog characteristic descriptors; and extracting Hog characteristics from the negative sample (i.e. the image without the face) data set to obtain a Hog descriptor. The data volume in the negative sample data set is far greater than the sample number in the positive sample data set, and the negative sample image can be obtained by random cutting by using a picture without a human face; training positive and negative samples by using a support vector machine algorithm, wherein the positive and negative samples are obviously a classification problem, and a trained model can be obtained; negative sample refractory detection, namely refractory sample mining (hard-NEGTIVE MINING), is performed by using the model so as to improve the classification capability of the final model. The specific idea is as follows: continuously scaling the negative sample in the training set until the negative sample is matched with the template, searching and matching through a template sliding serial port (the process is a multi-scale detection process), and intercepting part of the image to be added into the negative sample if the classifier detects the non-face region by mistake; and (5) retraining the model by collecting the difficult sample, and repeatedly obtaining a final classification model.

And detecting the face picture by using the finally trained classifier, carrying out sliding scanning on different sizes of the picture, extracting Hog characteristics, and classifying by using the classifier. If the face is detected and judged, the face is calibrated, the face is necessarily calibrated for a plurality of times after one round of sliding scanning, and then the NMS is used for finishing the ending work.

103. Sequentially extracting features of lip region images corresponding to the face images to obtain a lip feature sequence of the lip language user;

in this embodiment, after the lip region image is extracted, feature extraction is performed on the lip region image of each frame, and the lip region image is converted into a lip feature vector, and the lip feature sequence can be obtained by splicing the vectors according to the time sequence of the video. The lip feature sequence may be any vector set capable of expressing lip features, for example, in this embodiment, it is preferable to use the offset and rotation factor of the extracted lip image corresponding to the standard mouth as feature vectors, and in addition, the lip image when the lip user is silent may be extracted as a reference object, and the feature vectors of the lips are calculated.

Optionally, step 103 specifically includes:

In this alternative embodiment, the standard mouth refers to an average mouth disclosed by different countries according to different standard specifications, and can represent mouth standards of different ethnic groups in each country. And calculating the deviation and rotation of the lips in the lip image, which are equivalent to the standard mouth, by taking the standard mouth as a reference object of the lip image, and taking the calculated result as the lip characteristic expression of the lip language user. In this alternative embodiment, the accuracy of the lip recognition by the lip reference does not greatly affect the accuracy of the lip recognition, and the training and application results can be made to have a consistent effect by only using the same reference when training the lip recognition model.

In this alternative embodiment, calculating the offset of the standard mouth is essentially a translation. The twiddle factor refers to a complex constant multiplied in the butterfly operation of the Cooley-Tukey fast fourier transform algorithm, so that the constant is located above a unit circle on a complex plane, and has a twiddle effect on a multiplicand on the complex plane, so that the twiddle factor is named.

104. Inputting the lip feature sequence into a preset lip recognition model to perform lip pronunciation recognition, and outputting a pronunciation phoneme sequence corresponding to the lip user;

In this embodiment, a lip recognition model is built based on an end-to-end neural network, and the lip feature sequence is recognized as a pronunciation phoneme. The network uses the lip feature sequence as network input and the pronunciation phonemes as training targets. The lip language identification model is a seq2seq model comprising an encoder and a decoder, a model prediction result is obtained through a network mapping mode, and a loss function of the prediction result and a target result is calculated to judge whether the model is trained.

Optionally, step 104 specifically includes:

In this alternative embodiment, the lip recognition model is a seq2seq model comprising an encoder and a decoder, and the seq2seq model simply generates one input sequence x from another output sequence y. The seq2seq has many applications such as translation, document extraction, question-answering systems, etc. In translation, the input sequence is the text to be translated and the output sequence is the translated text; in a question-answering system, the input sequence is the question posed and the output sequence is the answer. The encoder converts the input sequence into a vector with fixed length; the decoder converts the fixed vector generated before into an output sequence.

105. Converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the plurality of natural sentences through a preset statistical language model to obtain a target natural sentence;

In this embodiment, the statistical language model is a BiLSTM model formed by combining a forward LSTM (Long Short-Term Memory) and a backward LSTM, so that the reasonable degree of a natural language sentence can be calculated for a context word of a predicted sentence, for example, a commonly used expression of chinese is "you eat" and a natural language sentence obtained by converting a phoneme sequence is "rice eat", and then the score of the sentence in the statistical language model is lower than the score of the commonly used expression sentence, because in sample data of the statistical language model, the probability that a "meal" word is connected with a "eat" word is far lower than the probability that a "meal" word is connected with a "eat" word, and therefore, a natural language sentence which is most in line with a normal expression word sequence and semantics is predicted by a model scoring mode and is broadcasted as a target natural language sentence.

Optionally, step 105 specifically includes:

In this alternative embodiment, the statistical language model may be regarded as a two-layer neural network, where the first layer is input from the left as a series of initial inputs, and the second layer is input from the right as a series of initial inputs, and the last word of the sentence is input, and the same processing as the first layer is performed in the opposite direction, and finally the two obtained results are averaged to obtain the score of the natural language sentence.

Optionally, step 105 further includes:

In this alternative embodiment, the pronunciation phonemes are pronunciation sheets specified according to the pronunciation, for example, for chinese, phonemes refer to initials, finals, etc. The predicted phoneme sequence is converted into phonemes according to the phoneme pronunciation mapping table and the word mapping table implementation factor sequence ID, and is converted into corresponding words according to the phonemes. The process can be understood as a dictionary searching process, the ID is searched through the recognized Chinese pinyin, the ID can be understood as the page number of the dictionary, and the corresponding words are searched through the page number.

106. And converting the audio frequency of the target natural sentence to obtain lip language pronunciation and broadcasting.

In this embodiment, the device includes two modules, a speech synthesis module and a speech broadcasting module, where the speech synthesis module converts text into audio through a deep neural network technology, so as to implement a function of speaking text. At present, the voice synthesis technology at home and abroad provides a corresponding interface. The method can meet the use requirements of different scenes, such as languages of Chinese mandarin, english, japanese, korean and the like. The voice broadcasting module is used for broadcasting the synthesized audio stream. Mainly solves the problem that communication of users who cannot learn words is obstructed. There are many mature voice players on the market at present, and only the synthesized audio needs to be linked with the broadcaster. Or a voice broadcasting chip which can realize the voice broadcasting function based on the card reader is adopted. The chip can be customized according to the content in the card reader by a plurality of manufacturers.

In this embodiment, before broadcasting, the target natural language sentence may be translated, so as to meet the use requirements of users in various languages. A machine translation module may also be included that addresses communication barriers for users in different languages or different dialect regions of the same language. At present, enterprises with better development of machine translation technology at home and abroad have Chinese interfaces for external use. Foreign languages such as google have translation models that can also provide very small languages. The translation interface provided by the enterprise can basically cover the languages mainly used. The translation module can also be customized for language translation under special conditions.

In the embodiment of the invention, in order to enable a lip language user incapable of communicating own ideas through sound to break a communication barrier and make a sound in real time, the face animation of the lip language user needs to be acquired in real time, a plurality of continuous multi-frame user face images are extracted through the face animation, a plurality of key points of a human face are identified through a face recognition model, the key points are positioned to lips of the lip language user, then features in the lip image are extracted, the features are expressed through lip feature sequences, and then the sequences are input into a lip language identification model to identify pronunciation phonemes corresponding to the sequences. For Chinese lip language users, the pronunciation phonemes are the pinyin composed of initials and finals. Then, the spelling sequence is converted into natural language sentences with different combination modes, then the natural language sentences are input into a statistical language model for grading, the rationality and the smoothness of the natural language sentences are evaluated, a target natural language sentence is selected, and finally the target natural language sentence is played through a player. According to the embodiment of the invention, the lip expression sentences can be identified from the lip image data collected in real time, and are broadcasted, so that silent lip can be sounded.

Referring to fig. 2, a second embodiment of a lip language recognition method based on image recognition in an embodiment of the present invention includes:

201. Acquiring a plurality of lip region image samples with pronunciation phoneme labels;

202. Extracting lip feature sequences corresponding to the lip region image samples and taking the lip feature sequences as training samples;

203. Initializing an end-to-end neural network model with initial network parameters, the end-to-end neural network model comprising: an encoder comprising a plurality of layers of first RNN networks and a decoder comprising a plurality of layers of second RNN networks;

204. Inputting the training samples into each first RNN network of the encoder to perform pronunciation encoding to obtain first pronunciation vectors corresponding to the training samples;

205. inputting the first sound vectors into each second RNN network of the decoder to perform sound mapping, so as to obtain sound phoneme prediction results corresponding to each first sound vector;

206. calculating a CTC loss function of the end-to-end neural network model according to the pronunciation phoneme prediction result and the training sample to obtain a model loss value;

207. judging whether the end-to-end neural network model is converged or not according to the model loss value;

208. And if the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language identification model, otherwise, continuously reversely inputting the pronunciation phoneme prediction result into the end-to-end neural network model, and updating network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language identification model.

In this embodiment, the training process of the lip language recognition model is a seq2seq model including an encoder and a decoder, where the encoder and the decoder both include several layers of RNN (Recurrent Neural Network, cyclic neural network) networks, which are recursive neural networks that use sequence (sequence) data as input, recursion in the evolution direction of the sequence, and all nodes (cyclic units) are chained.

In this embodiment, the encoder of the lip language recognition model is responsible for compressing the input sequence into a vector of a specified length, and this vector can be regarded as the semantic of the sequence, and the manner of obtaining the semantic vector is to directly use the hidden state of the last input as the semantic vector C. The last hidden state can be transformed to obtain a semantic vector, and all the hidden states of the input sequence can be transformed to obtain a semantic variable. The decoder is responsible for generating a specified sequence according to the semantic vector by inputting the semantic variable obtained by the encoder as an initial state into the RNN of the decoder to obtain an output sequence. It can be seen that the output at the previous time will be the input at the current time and that the semantic vector C only participates in operations as an initial state, the latter operations being independent of the semantic vector C.

In this embodiment, the RNN learns probability distribution of the semantic meaning of the lip-language image, and then predicts, so as to obtain probability distribution, a softmax activation function is generally used at the output layer of the RNN, so that probability of each category can be obtained. Softmax has very wide application in machine learning and deep learning, especially in dealing with multi-classification (C > 2) problems, where the final output unit of the classifier requires a Softmax function for numerical processing. The definition of the Softmax function is as follows:

Wherein v _i is the output of the classifier front-stage output unit, i represents the class index, the total class number is C, and the ratio of the index of the current element to the sum of indexes of all elements is represented. Softmax converts multi-class output values into relative probabilities that are easier to understand and compare.

209. Collecting multi-frame face images of a lip language user in real time;

210. sequentially carrying out key point detection and lip region positioning on each face image to obtain a lip region image corresponding to each face image;

211. sequentially extracting features of lip region images corresponding to the face images to obtain a lip feature sequence of the lip language user;

212. Inputting the lip feature sequence into a preset lip recognition model to perform lip pronunciation recognition, and outputting a pronunciation phoneme sequence corresponding to the lip user;

213. Converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the plurality of natural sentences through a preset statistical language model to obtain a target natural sentence;

214. And converting the audio frequency of the target natural sentence to obtain lip language pronunciation and broadcasting.

In the embodiment of the invention, in order to improve the accuracy of lip language identification, a lip language identification neural network model is established and trained, and a training sample can be any video with complete face speaking as sample data, and the sample data is marked in a manual marking or direct white input mode. The training process is similar to the training of the end-to-end neural network model, the end-to-end neural network model can input original data at the input end, obtain marking data at the output end, and perform reverse training through calculation errors, so that a trained lip language recognition model is obtained. The embodiment of the invention can complete the training of the lip language identification model, so that the accuracy of lip language identification is improved.

The above describes the method for identifying the lip language based on the image identification in the embodiment of the present invention, and the following describes the device for identifying the lip language based on the image identification in the embodiment of the present invention, please refer to fig. 3, and the first embodiment of the device for identifying the lip language based on the image identification in the embodiment of the present invention includes:

the image acquisition module 301 is configured to acquire multiple frames of face images of a lip language user in real time;

The lip positioning module 302 is configured to sequentially perform key point detection and lip area positioning on each face image to obtain a lip area image corresponding to each face image;

The feature extraction module 303 is configured to sequentially perform feature extraction on the lip area images corresponding to the face images, so as to obtain a lip feature sequence of the lip language user;

the sequence recognition module 304 is configured to input the lip feature sequence into a preset lip recognition model to perform lip pronunciation recognition, and output a pronunciation phoneme sequence corresponding to the lip user;

The sentence scoring module 305 is configured to convert the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and score the plurality of natural sentences through a preset statistical language model to obtain a target natural sentence;

and the lip broadcasting module 306 is configured to perform audio conversion on the target natural sentence, obtain lip pronunciation, and broadcast the lip pronunciation.

Optionally, the lip positioning module 302 is specifically configured to:

Optionally, the feature extraction module 303 is specifically configured to:

Optionally, the sequence identifying module 304 is specifically configured to:

Optionally, the statistical language model includes: the statement scoring module 305 is specifically configured to:

Optionally, the sentence scoring module 305 is further configured to:

Referring to fig. 4, a second embodiment of a lip recognition apparatus based on image recognition according to an embodiment of the present invention includes:

Optionally, the lip language recognition device based on image recognition further includes:

A sample acquiring module 307, configured to acquire a plurality of lip region image samples with phonetic symbols; extracting lip feature sequences corresponding to the lip region image samples and taking the lip feature sequences as training samples;

A model prediction module 308 for initializing an end-to-end neural network model with initial network parameters, the end-to-end neural network model comprising: an encoder comprising a plurality of layers of first RNN networks and a decoder comprising a plurality of layers of second RNN networks; inputting the training samples into each first RNN network of the encoder to perform pronunciation encoding to obtain first pronunciation vectors corresponding to the training samples; inputting the first sound vectors into each second RNN network of the decoder to perform sound mapping, so as to obtain sound phoneme prediction results corresponding to each first sound vector;

The loss calculation module 309 is configured to calculate a CTC loss function of the end-to-end neural network model according to the pronunciation phoneme prediction result and the training sample, so as to obtain a model loss value;

A model generating module 310, configured to determine whether the end-to-end neural network model converges according to the model loss value; and if the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language identification model, otherwise, continuously reversely inputting the pronunciation phoneme prediction result into the end-to-end neural network model, and updating network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language identification model.

The above-described image recognition-based lip recognition apparatus in the embodiment of the present invention is described in detail from the point of view of the modularized functional entity in fig. 3 and 4, and the image recognition-based lip recognition device in the embodiment of the present invention is described in detail from the point of view of hardware processing.

Fig. 5 is a schematic structural diagram of an image recognition-based lip recognition device according to an embodiment of the present invention, where the image recognition-based lip recognition device 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing application programs 533 or data 532. Wherein memory 520 and storage medium 530 may be transitory or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations for the image recognition-based lip language recognition device 500. Still further, the processor 510 may be configured to communicate with the storage medium 530 to perform a series of instructional operations in the storage medium 530 on the image recognition based lip language recognition device 500.

The image recognition based lip language recognition device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input/output interfaces 560, and/or one or more operating systems 531, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the image recognition-based lip recognition apparatus structure illustrated in fig. 5 is not limiting of the image recognition-based lip recognition apparatus and may include more or less components than illustrated, or may be combined with certain components, or may be arranged with different components.

The invention also provides a lip language identification device based on image identification, which comprises a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the lip language identification method based on image identification in the above embodiments.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, where instructions are stored in the computer readable storage medium, when the instructions are executed on a computer, cause the computer to perform the steps of the lip language identification method based on image identification.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The lip language identification method based on the image identification is characterized by comprising the following steps of:

collecting multi-frame face images of a lip language user in real time;

performing audio conversion on the target natural sentence to obtain lip language pronunciation and broadcasting;

The step of sequentially carrying out key point detection and lip region positioning on the face images to obtain lip region images corresponding to the face images comprises the following steps:

According to the key points of the mouth angles of the face images, determining lip areas corresponding to the face images, and performing screenshot to obtain lip area images corresponding to the face images;

before the real-time acquisition of the multi-frame face images of the lip language user, the method further comprises the following steps:

If the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language identification model, otherwise, continuing to reversely input the pronunciation phoneme prediction result into the end-to-end neural network model, and updating network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language identification model;

Inputting the lip feature sequence into a preset lip recognition model for lip pronunciation recognition, and outputting a pronunciation phoneme sequence corresponding to the lip user comprises the following steps:

inputting the second pronunciation vector into a decoder of the lip language identification model for pronunciation mapping to obtain a pronunciation phoneme sequence corresponding to the lip language user;

the statistical language model includes: the step of converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the plurality of natural sentences through a preset statistical language model to obtain a target natural sentence comprises the following steps of:

inputting the respective natural sentences into the forward LSTM network according to the word sequence positive sequence to perform network calculation, and obtaining a first prediction result of the respective natural sentences;

2. The method for recognizing lip language based on image recognition according to claim 1, wherein the sequentially extracting features of the lip area images corresponding to the face images to obtain a lip feature sequence of the lip language user comprises:

Calculating offset and rotation factors of each lip region image relative to a standard mouth image to obtain lip feature vectors corresponding to each lip region image;

3. The image recognition-based lip language recognition method of claim 1, wherein the converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format comprises:

inquiring a preset word mapping table according to each phoneme ID to obtain a plurality of words corresponding to each phoneme;

4. The lip language identification device based on image identification is characterized by comprising:

The lip broadcasting module is used for converting the audio frequency of the target natural sentence to obtain lip pronunciation and broadcasting the lip pronunciation;

The lip positioning module is also used for inputting the face image data into a face recognition model in sequence to detect key points so as to obtain face key points in the face images; determining the mouth angle key points in the face images according to the marking information corresponding to the face key points; according to the key points of the mouth angles of the face images, determining lip areas corresponding to the face images, and performing screenshot to obtain lip area images corresponding to the face images;

The sample training module is used for acquiring a plurality of lip region image samples with pronunciation phoneme labels; extracting lip feature sequences corresponding to the lip region image samples and taking the lip feature sequences as training samples; initializing an end-to-end neural network model with initial network parameters, the end-to-end neural network model comprising: an encoder comprising a plurality of layers of first RNN networks and a decoder comprising a plurality of layers of second RNN networks; inputting the training samples into each first RNN network of the encoder to perform pronunciation encoding to obtain first pronunciation vectors corresponding to the training samples; inputting the first sound vectors into each second RNN network of the decoder to perform sound mapping, so as to obtain sound phoneme prediction results corresponding to each first sound vector; calculating a CTC loss function of the end-to-end neural network model according to the pronunciation phoneme prediction result and the training sample to obtain a model loss value; judging whether the end-to-end neural network model is converged or not according to the model loss value; if the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language identification model, otherwise, continuing to reversely input the pronunciation phoneme prediction result into the end-to-end neural network model, and updating network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language identification model;

The sample training module is also used for inputting the lip feature sequence into an encoder of the lip recognition model to perform pronunciation encoding to obtain a second pronunciation vector corresponding to the lip feature sequence; inputting the second pronunciation vector into a decoder of the lip language identification model for pronunciation mapping to obtain a pronunciation phoneme sequence corresponding to the lip language user;

The statistical language model includes: the sentence scoring module is further used for inputting the respective natural sentences into the forward LSTM network according to the word sequence positive sequence to perform network calculation, so as to obtain a first prediction result of the respective natural sentences; inputting the respective natural sentences into the reverse LSTM network in reverse order according to word sequence to perform network calculation, so as to obtain a second prediction result of the respective natural sentences; and calculating the average value of the first prediction result and the second prediction result to obtain the corresponding score of each natural sentence, and taking the natural sentence with the highest score as the target natural sentence.

5. A lip-language recognition apparatus based on image recognition, characterized in that the lip-language recognition apparatus based on image recognition comprises: a memory and at least one processor, the memory having instructions stored therein;

The at least one processor invokes the instructions in the memory to cause the image recognition-based lip language identification apparatus to perform the image recognition-based lip language identification method of any one of claims 1-3.

6. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the image recognition-based lip language recognition method of any one of claims 1-3.