CN112784696A

CN112784696A - Lip language identification method, device, equipment and storage medium based on image identification

Info

Publication number: CN112784696A
Application number: CN202011635782.8A
Authority: CN
Inventors: 周亚云; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-11
Anticipated expiration: 2040-12-31
Also published as: CN112784696B

Abstract

The invention relates to the field of artificial intelligence and discloses a lip language identification method, a device, equipment and a storage medium based on image identification. The method comprises the following steps: collecting multi-frame face images of a lip language user in real time to perform key point detection and lip region positioning to obtain lip region images corresponding to the face images; sequentially extracting the characteristics of the lip region images corresponding to the face images to obtain a lip characteristic sequence of the lip language user; inputting the lip feature sequence into a preset lip language recognition model, and outputting a pronunciation phoneme sequence corresponding to a lip language user; converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the plurality of natural sentences through a preset statistical language model to obtain target natural sentences; and carrying out audio conversion on the target natural sentence to obtain lip language pronunciation and broadcasting. The method can identify the lip language expression sentences from the lip language image data collected in real time and broadcast the lip language expression sentences so that silent lip language can sound in real time.

Description

Lip language identification method, device, equipment and storage medium based on image identification

Technical Field

The invention relates to the field of artificial intelligence, in particular to a lip language identification method, a device, equipment and a storage medium based on image identification.

Background

At present, a large number of people with voice loss caused by diseases, accidents or congenital defects exist in the world, and the people cannot communicate with the outside smoothly by sending voice. Generally speaking, a person with voice loss can hear and understand the voice information of a normal person, but it is difficult to express his or her own idea and let the other person understand the idea, for example, communication is difficult to be performed normally when a person without voice loss does not have professional sign language or lip language recognition training, or the person without voice loss does not write characters. Even if the voice losing crowd can communicate through characters, the communication efficiency is low. Lip language is the words interpreted by looking at the motion of the lips when someone else speaks, and does not need to have requirements on the education background but has professional requirements on the lip language recognition capability.

In order to solve the problem that professional training is needed for human to read lip language, machine learning and deep learning technologies can be used for training a machine to recognize human lip language expression, and broadcasting is carried out through a player, so that real-time barrier-free communication between voice losing people and normal people is achieved. Most of the existing lip language recognition schemes comprise mouth detection, mouth segmentation, mouth normalization, feature extraction and lip language classifier construction, the accuracy rate is about 20% -60%, the existing lip language recognition schemes belong to the low lip language recognition accuracy rate, the function of voice direct broadcasting is not achieved, and the lip language recognition still stays in the text level with large errors.

Disclosure of Invention

The invention mainly aims to solve the technical problem that the existing lip language recognition modeling mode is simple and influences the recognition accuracy.

The invention provides a lip language identification method based on image identification, which comprises the following steps:

collecting multiframe face images of a lip language user in real time;

sequentially carrying out key point detection and lip region positioning on each face image to obtain a lip region image corresponding to each face image;

sequentially extracting the features of the lip region images corresponding to the face images to obtain a lip feature sequence of the lip language user;

inputting the lip feature sequence into a preset lip language recognition model for lip language pronunciation recognition, and outputting a pronunciation phoneme sequence corresponding to the lip language user;

converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the natural sentences through a preset statistical language model to obtain target natural sentences;

and carrying out audio conversion on the target natural sentence to obtain lip language pronunciation and broadcasting.

Optionally, in a first implementation manner of the first aspect of the present invention, the sequentially performing key point detection and lip region positioning on each face image to obtain a lip region image corresponding to each face image includes:

sequentially inputting the face image data into a face recognition model for key point detection to obtain face key points in each face image;

determining the key points of the mouth corners in the face images according to the marking information corresponding to the key points of the faces;

and determining the lip region corresponding to each face image according to the key point of the mouth angle of each face image, and performing screenshot to obtain the lip region image corresponding to each face image.

Optionally, in a second implementation manner of the first aspect of the present invention, the sequentially performing feature extraction on the lip region images corresponding to the face images to obtain the lip feature sequence of the lip language user includes:

aligning the lip region image corresponding to each face image with a preset standard mouth image;

calculating the offset and the rotation factor of each lip region image relative to a standard mouth image to obtain a lip feature vector corresponding to each lip region image;

and sequentially splicing the lip feature vectors corresponding to the lip region images according to the acquisition time sequence of the face images to obtain the lip feature sequence of the lip language user.

Optionally, in a third implementation manner of the first aspect of the present invention, before the acquiring, in real time, multiple frames of face images of a lip language user, the method further includes:

obtaining a plurality of lip region image samples with pronunciation phoneme labels;

extracting lip characteristic sequences corresponding to the lip region image samples and taking the lip characteristic sequences as training samples;

initializing an end-to-end neural network model with initial network parameters, the end-to-end neural network model comprising: an encoder and a decoder, the encoder comprising a number of layers of a first RNN network, the decoder comprising a number of layers of a second RNN network;

inputting the training samples into each first RNN of the encoder to perform pronunciation encoding to obtain first pronunciation vectors corresponding to the training samples;

inputting the first pronunciation vector into each second RNN of the decoder to perform pronunciation mapping to obtain pronunciation phoneme prediction results corresponding to each first pronunciation vector;

calculating a CTC loss function of the end-to-end neural network model according to the pronunciation phoneme prediction result and the training sample to obtain a model loss value;

judging whether the end-to-end neural network model converges according to the model loss value;

and if the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language recognition model, otherwise, continuously inputting the pronunciation phoneme prediction result to the end-to-end neural network model in a reverse direction, and updating the network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language recognition model.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the inputting the lip feature sequence into a preset lip speech recognition model for lip speech recognition, and outputting a pronunciation phoneme sequence corresponding to the lip user includes:

inputting the lip feature sequence into an encoder of the lip recognition model for pronunciation encoding to obtain a second pronunciation vector corresponding to the lip feature sequence;

and inputting the second pronunciation vector into a decoder of the lip language recognition model for pronunciation mapping to obtain a pronunciation phoneme sequence corresponding to the lip language user.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the statistical language model includes: the step of converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format and scoring the natural sentences through a preset statistical language model to obtain a target natural sentence comprises the following steps:

inputting the natural sentences into the forward LSTM network in a word sequence in a forward order for network calculation to obtain first prediction results of the natural sentences;

inputting the natural sentences into the reverse LSTM network in the reverse order of word order for network calculation to obtain second prediction results of the natural sentences;

and calculating the mean value of the first prediction result and the second prediction result to obtain the corresponding scores of the natural sentences, and taking the natural sentence with the highest score as the target natural sentence.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format includes:

inquiring a preset phoneme pronunciation mapping table by taking each pronunciation phoneme in the pronunciation phoneme sequence as an inquiry keyword to obtain a phoneme ID corresponding to each pronunciation phoneme;

inquiring a preset word mapping table according to the phoneme IDs to obtain a plurality of words corresponding to the phonemes;

and combining the words corresponding to the phonemes according to the arrangement sequence of the pronunciation phoneme sequence to obtain a plurality of natural language sentences in a character format.

The second aspect of the present invention provides a lip language recognition apparatus based on image recognition, including:

the image acquisition module is used for acquiring multi-frame face images of the lip language users in real time;

the lip positioning module is used for sequentially carrying out key point detection and lip region positioning on each face image to obtain a lip region image corresponding to each face image;

the feature extraction module is used for sequentially extracting features of the lip region images corresponding to the face images to obtain a lip feature sequence of the lip language user;

the sequence recognition module is used for inputting the lip feature sequence into a preset lip language recognition model to perform lip language pronunciation recognition and outputting a pronunciation phoneme sequence corresponding to the lip language user;

the sentence scoring module is used for converting the pronunciation phoneme sequence into a plurality of natural sentences in a character format and scoring the natural sentences through a preset statistical language model to obtain target natural sentences;

and the lip language broadcasting module is used for carrying out audio conversion on the target natural sentence to obtain lip language pronunciation and broadcasting.

Optionally, in a first implementation manner of the second aspect of the present invention, the lip positioning module is specifically configured to:

Optionally, in a second implementation manner of the second aspect of the present invention, the feature extraction module is specifically configured to:

Optionally, in a third implementation manner of the second aspect of the present invention, the lip language recognition apparatus based on image recognition further includes:

the system comprises a sample acquisition module, a phoneme labeling module and a phoneme labeling module, wherein the sample acquisition module is used for acquiring a plurality of lip region image samples with phoneme labeling; extracting lip characteristic sequences corresponding to the lip region image samples and taking the lip characteristic sequences as training samples;

a model prediction module for initializing an end-to-end neural network model with initial network parameters, the end-to-end neural network model comprising: an encoder and a decoder, the encoder comprising a number of layers of a first RNN network, the decoder comprising a number of layers of a second RNN network; inputting the training samples into each first RNN of the encoder to perform pronunciation encoding to obtain first pronunciation vectors corresponding to the training samples; inputting the first pronunciation vector into each second RNN of the decoder to perform pronunciation mapping to obtain pronunciation phoneme prediction results corresponding to each first pronunciation vector;

the loss calculation module is used for calculating a CTC loss function of the end-to-end neural network model according to the pronunciation phoneme prediction result and the training sample to obtain a model loss value;

the model generation module is used for judging whether the end-to-end neural network model converges according to the model loss value; and if the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language recognition model, otherwise, continuously inputting the pronunciation phoneme prediction result to the end-to-end neural network model in a reverse direction, and updating the network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language recognition model.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the sequence identification module is specifically configured to:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the statistical language model includes: the statement scoring module is specifically used for:

Optionally, in a sixth implementation manner of the second aspect of the present invention, the statement scoring module is further configured to:

A third aspect of the present invention provides a lip language identification device based on image identification, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor calls the instructions in the memory to cause the image recognition based lip language recognition device to execute the image recognition based lip language recognition method.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-described lip language identification method based on image identification.

In the technical scheme provided by the invention, in order to enable a lip language user incapable of communicating a self idea through sound to break a communication barrier and make a sound in real time, the face animation of the lip language user needs to be collected in real time, continuous multiframe user face images are extracted through the face animation, a plurality of key points of a human face are identified through a face identification model, lips of the lip language user are positioned through the key points, then characteristics in the lip images are extracted, the characteristics are expressed through a lip characteristic sequence, the sequence is input into the lip language identification model, and a pronunciation phoneme corresponding to the sequence is identified. For a Chinese lip language user, pronunciation phonemes are pinyin consisting of initials and finals. And then, converting the pinyin sequence into natural language sentences of different combination modes, inputting the natural language sentences into a statistical language model score, evaluating the rationality and the smoothness degree of the natural language sentences, selecting target natural language sentences, and finally playing the target natural language sentences through a player. The embodiment of the invention can identify the lip language expression sentences from the lip language image data collected in real time and broadcast the lip language expression sentences so as to enable silent lip language to sound.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of a lip language identification method based on image identification according to an embodiment of the present invention;

FIG. 2 is a diagram of a lip language recognition method based on image recognition according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a first embodiment of a lip language recognition device based on image recognition according to an embodiment of the present invention;

FIG. 4 is a diagram of a lip language recognition device based on image recognition according to a second embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a lip language recognition device based on image recognition according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a lip language identification method, a device, equipment and a storage medium based on image identification. The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of a lip language identification method based on image identification in an embodiment of the present invention includes:

101. collecting multiframe face images of a lip language user in real time;

it is to be understood that the execution subject of the present invention may be a lip language recognition device based on image recognition, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

In this embodiment, the lip language user is a user who needs to be identified with lip language expression information, and may be, for example, a deaf person or a deaf-mute. The method can also be a normal person who does not want to disclose the sound information, for example, in some occasions needing confidential communication, a lip language user transmits the sound information which the lip language user wants to express to a receiving party in a lip animation mode, and then the lip language information is decrypted at the receiving party.

In this embodiment, the image acquisition mode of the lip language user may be video recording, and a 2D camera may be installed in advance for the lip language user to acquire the face video of the speaker. After the face video is acquired, the video can be output as a single-frame image by using a video processing tool, such as AE, so that continuous frame images of the lip language user are acquired for further identifying lip language expression.

102. Sequentially carrying out key point detection and lip region positioning on each face image to obtain a lip region image corresponding to each face image;

in this embodiment, the acquired face image may have different complex backgrounds, and in order to identify the lips of the face in the image, the face needs to be detected first. In the embodiment, a more complete key point detection model is designed by referring to a professional digital living face recognition model based on computer vision, and the model can locate 98 key points of a human face, so that compared with 68 key points identified by an original digital living face model, 30 key points of a lip region are added, the key points of the lip are more dense, and the precision of lip feature expression is improved. In addition, the key point detection model uses special labeling information for the lip key points, so that the key points of the lip can be quickly positioned and extracted, and the efficiency of positioning the face lip is improved.

Optionally, step 102 specifically includes:

In the optional embodiment, the face recognition model is also called a key point detection model, and is a face key point detection model trained by a pre-labeled face image, the core principle of the face recognition model is that the image Hog feature is used for representing a face, and compared with other feature extraction operators, the face recognition model can well keep deformation on geometric and optical deformation of the image. The feature, the LBP feature and the Harr feature are used as three classical image features together, and the feature extraction operator is usually matched with a Support Vector Machine (SVM) algorithm to be used in an object detection scene. The human face detection method realized by the human face recognition model is based on the Hog characteristics of the image and integrates the human face detection function realized by the support vector machine algorithm, and the general idea of the algorithm is as follows:

extracting the Hog characteristics from a positive sample (namely, an image containing a human face) data set to obtain a Hog characteristic descriptor; and extracting the Hog features of the negative sample (namely, the image without the human face) data set to obtain a Hog descriptor. The data size in the negative sample data set is far larger than the number of samples in the positive sample data set, and the negative sample image can be obtained by randomly cutting a picture without a human face; training positive and negative samples by using a support vector machine algorithm, which is obviously a two-classification problem, and obtaining a trained model; and carrying out negative sample difficult case detection, namely hard-to-carry mining (hard-to-carry mining) by using the model so as to improve the classification capability of the final model. The specific idea is as follows: continuously zooming the negative sample in the training set until the negative sample is matched with the template, searching and matching through a template sliding serial port (the process is a multi-scale detection process), and if the classifier misdetects a non-face area, intercepting the partial image and adding the partial image into the negative sample; and (5) assembling difficult samples to retrain the model, and repeating the steps to obtain the final classification model.

And detecting the face picture by using a finally trained classifier, performing sliding scanning on different sizes of the picture, extracting Hog characteristics, and classifying by using the classifier. If the face is detected and judged, the face is calibrated, and the same face is necessarily calibrated for multiple times after one sliding scanning, so that the NMS is used for finishing the ending work.

103. Sequentially extracting the features of the lip region images corresponding to the face images to obtain a lip feature sequence of the lip language user;

in this embodiment, after the lip region image is extracted, feature extraction is performed on each frame of lip region image, the lip region image is converted into a lip feature vector, and the lip feature sequence can be obtained by splicing the vectors according to the time sequence order of the video. The lip feature sequence may be any vector set that can express lip features, for example, it is preferable in this embodiment to use the offset and the rotation factor of the extracted lip image corresponding to a standard mouth as a feature vector, and in addition, it is also possible to extract the lip image when the lip language user is silent as a reference object to calculate the feature vector of the lip.

Optionally, step 103 specifically includes:

In this alternative embodiment, the standard mouth refers to an average mouth published by different countries according to different standard specifications, and can represent the mouth standard of different races in each country. And calculating the offset and rotation of the lips in the lip image corresponding to the standard mouth by taking the standard mouth as a reference object of the lip image, and expressing the calculation result as the lip feature of the lip user. In this alternative embodiment, the lip reference does not have much influence on the accuracy of lip language recognition, and the training and application results can be consistent as long as the same reference is used in training the lip language recognition model.

In this alternative embodiment, the offset of the standard tip is calculated to be essentially the amount of translation. The twiddle factor refers to a complex constant multiplied by a butterfly operation of a Cooley-Tukey fast Fourier transform algorithm, so that the constant is positioned above a unit circle on a complex plane, and has a twiddle effect on the complex plane for a multiplicand, so that the twiddle factor is named as the twiddle factor.

104. Inputting the lip feature sequence into a preset lip language recognition model for lip language pronunciation recognition, and outputting a pronunciation phoneme sequence corresponding to the lip language user;

in this embodiment, a lip language recognition model is established based on an end-to-end neural network, and the lip feature sequence is recognized as a pronunciation phoneme. The network uses lip feature sequences as network input and uses pronunciation phonemes as training targets. The lip language identification model is a seq2seq model comprising an encoder and a decoder, a model prediction result is obtained in a network mapping mode, and a loss function of the prediction result and a target result is calculated to judge whether the model is trained or not.

Optionally, step 104 specifically includes:

In this alternative embodiment, the lip language recognition model is a seq2seq model including an encoder and a decoder, and the seq2seq model is simply to generate another output sequence y according to an input sequence x. seq2seq has many applications such as translation, document extraction, question and answer systems, etc. In translation, an input sequence is a text to be translated, and an output sequence is a translated text; in a question-answering system, the input sequence is a question posed, and the output sequence is an answer. The encoder converts the input sequence into a vector with fixed length; the decoder converts the fixed vector generated before into the output sequence.

105. Converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the natural sentences through a preset statistical language model to obtain target natural sentences;

in this embodiment, the statistical language model is a bilst model formed by combining a forward LSTM (Long Short-Term Memory) and a backward LSTM, and can calculate the degree of reasonableness of a natural language sentence, for example, the common expression of chinese is "you have a meal", and the natural language sentence obtained by phoneme sequence conversion is "meal", so that the score of the sentence in the statistical language model is lower than the score of the sentence obtained by the common expression.

Optionally, step 105 specifically includes:

In this alternative embodiment, the statistical language model may be regarded as a two-layer neural network, where the first layer is input from the left as the beginning of the series and from the beginning of the sentence, and the second layer is input from the right as the beginning of the series, and from the last word of the sentence, the same processing as the first layer is performed in the reverse direction, and finally the average of the two obtained results is calculated, so as to obtain the score of the natural language sentence.

Optionally, step 105 further includes:

In this alternative embodiment, the pronunciation unit is specified according to pronunciation, for example, for Chinese, the phoneme refers to initial consonant, vowel, etc. And the predicted phoneme sequence can realize that the factor sequence ID is converted into phonemes according to the phoneme pronunciation mapping table and the word mapping table, and is converted into corresponding words according to the phonemes. The process can be understood as a dictionary looking-up process, the ID can be inquired through the recognized Chinese pinyin, the ID can be understood as the page number of the dictionary, and the corresponding words can be looked up through the page number.

106. And carrying out audio conversion on the target natural sentence to obtain lip language pronunciation and broadcasting.

In the embodiment, the voice broadcasting system comprises two modules, a voice synthesis module and a voice broadcasting module, wherein the voice synthesis module converts a text into audio through a deep neural network technology, so that the function of 'speaking' of the text is realized. At present, the voice synthesis technology at home and abroad provides corresponding interfaces. The use requirements of different scenes can be met, such as languages of Chinese Mandarin, English, Japanese, Korean and the like. The voice broadcasting module has the function of broadcasting the synthesized audio stream. Mainly solves the problem that the communication of users who can not learn words is obstructed. There are many mature voice players in the market today, and it is only necessary to link the synthesized audio with the broadcaster. Or a voice broadcasting chip which can realize the voice broadcasting function based on the card reader is adopted. Many manufacturers of the chip can customize the chip according to the content in the card reader.

In this embodiment, before broadcasting, the target natural language sentence may be translated to meet the use requirements of users of various languages. The system can also comprise a machine translation module, and the machine translation module solves the communication obstacles of users in different languages or dialect areas with the same language. At present, enterprises with better development of machine translation technology at home and abroad have Chinese interfaces for external use. Foreign languages such as google also provide translation models in a very large number of languages. The translation interface provided by the enterprise to the outside basically covers the language mainly used. The language translation in special cases can also customize the translation module.

In the embodiment of the invention, in order to enable a lip language user incapable of transmitting a self idea through sound to break a communication barrier and make a sound in real time, the face animation of the lip language user needs to be collected in real time, continuous multi-frame user face images are extracted through the face animation, a plurality of key points of a human face are identified through a face identification model, the lip of the lip language user are positioned through the key points, then characteristics in the lip image are extracted, the characteristics are expressed through a lip characteristic sequence, the sequence is input into the lip language identification model, and a pronunciation phoneme corresponding to the sequence is identified. For a Chinese lip language user, pronunciation phonemes are pinyin consisting of initials and finals. And then, converting the pinyin sequence into natural language sentences of different combination modes, inputting the natural language sentences into a statistical language model score, evaluating the rationality and the smoothness degree of the natural language sentences, selecting target natural language sentences, and finally playing the target natural language sentences through a player. The embodiment of the invention can identify the lip language expression sentences from the lip language image data collected in real time and broadcast the lip language expression sentences so as to enable silent lip language to sound.

Referring to fig. 2, a second embodiment of the lip language identification method based on image identification according to the embodiment of the present invention includes:

201. obtaining a plurality of lip region image samples with pronunciation phoneme labels;

202. extracting lip characteristic sequences corresponding to the lip region image samples and taking the lip characteristic sequences as training samples;

203. initializing an end-to-end neural network model with initial network parameters, the end-to-end neural network model comprising: an encoder and a decoder, the encoder comprising a number of layers of a first RNN network, the decoder comprising a number of layers of a second RNN network;

204. inputting the training samples into each first RNN of the encoder to perform pronunciation encoding to obtain first pronunciation vectors corresponding to the training samples;

205. inputting the first pronunciation vector into each second RNN of the decoder to perform pronunciation mapping to obtain pronunciation phoneme prediction results corresponding to each first pronunciation vector;

206. calculating a CTC loss function of the end-to-end neural network model according to the pronunciation phoneme prediction result and the training sample to obtain a model loss value;

207. judging whether the end-to-end neural network model converges according to the model loss value;

208. and if the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language recognition model, otherwise, continuously inputting the pronunciation phoneme prediction result to the end-to-end neural network model in a reverse direction, and updating the network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language recognition model.

In this embodiment, the lip language recognition model in this embodiment is a training process of the lip language recognition model, where the lip language recognition model in this embodiment is a seq2seq model including an encoder and a decoder, and both the encoder and the decoder include a plurality of layers of RNN (Recurrent Neural Network) networks, which are a type of Recurrent Neural Network that takes sequence data as input, recurses in the evolution direction of the sequence, and all nodes (Recurrent units) are connected in a chain manner.

In this embodiment, an encoder of the lip language recognition model is responsible for compressing an input sequence into a vector of a specified length, the vector can be regarded as the semantic of the sequence, and the manner of obtaining the semantic vector is to directly use the hidden state of the last input as the semantic vector C. The last hidden state can be transformed to obtain a semantic vector, and all hidden states of the input sequence can be transformed to obtain a semantic variable. The decoder is responsible for generating the appointed sequence according to the semantic vector, and the mode is that the semantic variable obtained by the encoder is used as an initial state and input into the RNN of the decoder to obtain an output sequence. It can be seen that the output at the previous moment is used as the input at the current moment, and the semantic vector C is only used as the initial state to participate in the operation, and the latter operation is not related to the semantic vector C.

In this embodiment, the RNN learns the probability distribution of the semantics of the lip language image, and then predicts, and in order to obtain the probability distribution, the probability of each classification can be obtained by using a softmax activation function in the output layer of the RNN. Softmax has very wide application in machine learning and deep learning, especially in processing multi-classification (C >2) problem, and the final output unit of the classifier needs numerical processing of Softmax function. The definition of the Softmax function is as follows:

wherein v is_iThe output of a preceding-stage output unit of the classifier is represented by i, the category index is represented by i, the total number of categories is C, and the ratio of the index of the current element to the sum of the indexes of all elements is represented. Softmax translates the output values of multiple classes into relative probabilities for easier understanding and comparison.

209. Collecting multiframe face images of a lip language user in real time;

210. sequentially carrying out key point detection and lip region positioning on each face image to obtain a lip region image corresponding to each face image;

211. sequentially extracting the features of the lip region images corresponding to the face images to obtain a lip feature sequence of the lip language user;

212. inputting the lip feature sequence into a preset lip language recognition model for lip language pronunciation recognition, and outputting a pronunciation phoneme sequence corresponding to the lip language user;

213. converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and scoring the natural sentences through a preset statistical language model to obtain target natural sentences;

214. and carrying out audio conversion on the target natural sentence to obtain lip language pronunciation and broadcasting.

In the embodiment of the invention, in order to improve the accuracy of lip language recognition, a lip language recognition neural network model is established and trained, and the training sample can be any video with complete face speech as sample data and is marked in a manual marking or direct dialogue input mode. The training process is similar to the training of an end-to-end neural network model, the end-to-end neural network model can input original data at an input end, label data are obtained at an output end, and reverse training is carried out through calculation errors, so that a trained lip language recognition model is obtained. The embodiment of the invention can complete the training of the lip language recognition model, and improve the accuracy of lip language recognition.

With reference to fig. 3, the lip language identification device based on image recognition in the embodiment of the present invention is described as follows, and the first embodiment of the lip language identification device based on image recognition in the embodiment of the present invention includes:

the image acquisition module 301 is used for acquiring multi-frame face images of the lip language users in real time;

a lip positioning module 302, configured to perform key point detection and lip region positioning on each face image in sequence to obtain a lip region image corresponding to each face image;

a feature extraction module 303, configured to perform feature extraction on the lip region images corresponding to the face images in sequence to obtain a lip feature sequence of the lip language user;

the sequence recognition module 304 is configured to input the lip feature sequence into a preset lip language recognition model to perform lip language pronunciation recognition, and output a pronunciation phoneme sequence corresponding to the lip language user;

a sentence scoring module 305, configured to convert the pronunciation phoneme sequence into a plurality of natural sentences in a text format, and score the plurality of natural sentences through a preset statistical language model to obtain target natural sentences;

and the lip language broadcasting module 306 is configured to perform audio conversion on the target natural sentence to obtain lip language pronunciation and broadcast the lip language pronunciation.

Optionally, the lip positioning module 302 is specifically configured to:

Optionally, the feature extraction module 303 is specifically configured to:

Optionally, the sequence identification module 304 is specifically configured to:

Optionally, the statistical language model includes: the statement scoring module 305 is specifically configured to:

Optionally, the statement scoring module 305 is further configured to:

Referring to fig. 4, a second embodiment of the lip language recognition apparatus based on image recognition according to the embodiment of the present invention includes:

Optionally, the lip language recognition apparatus based on image recognition further includes:

a sample obtaining module 307, configured to obtain a plurality of lip region image samples with phoneme labels; extracting lip characteristic sequences corresponding to the lip region image samples and taking the lip characteristic sequences as training samples;

a model prediction module 308 configured to initialize an end-to-end neural network model with initial network parameters, the end-to-end neural network model including: an encoder and a decoder, the encoder comprising a number of layers of a first RNN network, the decoder comprising a number of layers of a second RNN network; inputting the training samples into each first RNN of the encoder to perform pronunciation encoding to obtain first pronunciation vectors corresponding to the training samples; inputting the first pronunciation vector into each second RNN of the decoder to perform pronunciation mapping to obtain pronunciation phoneme prediction results corresponding to each first pronunciation vector;

a loss calculating module 309, configured to calculate a CTC loss function of the end-to-end neural network model according to the pronunciation phoneme prediction result and the training sample, so as to obtain a model loss value;

a model generating module 310, configured to determine whether the end-to-end neural network model converges according to the model loss value; and if the end-to-end neural network model converges, taking the end-to-end neural network model as a lip language recognition model, otherwise, continuously inputting the pronunciation phoneme prediction result to the end-to-end neural network model in a reverse direction, and updating the network parameters of the end-to-end neural network model until the end-to-end neural network model converges to obtain the lip language recognition model.

Fig. 3 and 4 above describe the lip language recognition device based on image recognition in the embodiment of the present invention in detail from the perspective of a modular functional entity, and the lip language recognition device based on image recognition in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of an image recognition-based lip recognition apparatus 500 according to an embodiment of the present invention, which may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations for the lip language recognition device 500 based on image recognition. Still further, processor 510 may be configured to communicate with storage medium 530 to execute a series of instruction operations in storage medium 530 on lip recognition device 500 based on image recognition.

Lip recognition based on image recognition device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will appreciate that the configuration of the image recognition based lip recognition device shown in fig. 5 does not constitute a limitation of the image recognition based lip recognition device, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The invention also provides a lip language identification device based on image identification, which comprises a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the lip language identification method based on image identification in the above embodiments.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the image recognition-based lip language identification method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A lip language identification method based on image identification is characterized in that the lip language identification method based on image identification comprises the following steps:

collecting multiframe face images of a lip language user in real time;

2. The lip language identification method based on image identification according to claim 1, wherein the sequentially performing key point detection and lip region positioning on the face images to obtain the lip region images corresponding to the face images comprises:

3. The lip language identification method based on image identification according to claim 1 or 2, wherein the sequentially performing feature extraction on the lip region images corresponding to the face images to obtain the lip feature sequence of the lip language user comprises:

4. The lip language identification method based on image identification according to claim 1, wherein before the acquiring of the multiframe face images of the lip language user in real time, the method further comprises:

5. The lip language recognition method based on image recognition according to claim 4, wherein the inputting the lip feature sequence into a preset lip language recognition model for lip language pronunciation recognition, and the outputting the pronunciation phoneme sequence corresponding to the lip language user comprises:

6. The lip language recognition method based on image recognition according to claim 1, wherein the statistical language model comprises: the step of converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format and scoring the natural sentences through a preset statistical language model to obtain a target natural sentence comprises the following steps:

7. The lip language recognition method based on image recognition according to claim 1 or 6, wherein the converting the pronunciation phoneme sequence into a plurality of natural sentences in a text format comprises:

8. A lip language recognition device based on image recognition is characterized in that the lip language recognition device based on image recognition comprises:

9. A lip language recognition device based on image recognition, characterized in that the lip language recognition device based on image recognition comprises: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the image recognition based lip recognition device to perform the image recognition based lip recognition method according to any one of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the lip language recognition method based on image recognition according to any one of claims 1 to 7.