CN117292437B

CN117292437B - Lip language identification method, device, chip and terminal

Info

Publication number: CN117292437B
Application number: CN202311327121.2A
Authority: CN
Inventors: 王汉波; 郭军; 柯武生
Original assignee: Shandong Ruixin Semiconductor Technology Co ltd
Current assignee: Shandong Ruixin Semiconductor Technology Co ltd
Priority date: 2023-10-13
Filing date: 2023-10-13
Publication date: 2024-03-01
Anticipated expiration: 2043-10-13
Also published as: CN117292437A

Abstract

The embodiment of the invention discloses a lip language identification method, a device, a chip and a terminal, wherein the method comprises the steps of acquiring a first face image; performing fuzzy processing on the first face image according to a preset fuzzy algorithm to obtain a second face image; based on the second face images after the blurring process, calculating respective face blurring degree; calculating a face ambiguity rate based on the adjacent second face images; screening a second face image meeting the requirement of a preset change rate to serve as a third face image; extracting a Mel spectrogram of the third face image; the Mel spectrogram is input into a WaveNet vocoder, and visual voice is synthesized by the WaveNet vocoder to realize lip language recognition. According to the scheme, the face blurring detection and the face blurring degree are introduced in the face detection stage and serve as parameters of training data for judging whether the image frame can carry out lip language recognition modeling, and images of the blurring frames in the video are screened and removed, so that the effect of visual Chinese speech synthesis is improved.

Description

Lip language identification method, device, chip and terminal

Technical Field

The invention relates to the technical field of computer machine learning and artificial intelligence, in particular to a lip language identification method, a lip language identification device, a lip language identification chip and a lip language identification terminal.

Background

Lip language plays a vital role in human communication and speech understanding, and studies have shown that human lip reading is poor and hearing impaired people can only get less than 30% accuracy. Therefore, a good lip recognition technology is used for improving hearing aids, improving acquisition of speech information in silent, safe and noisy environments, etc., and has great practicality, and thus is an increasingly interesting field. However, the existing lip recognition technology has the problem that the lip recognition effect is poor due to the fact that the same mouth shape is different due to the fact that the facial lip region features of the speaker are different and dynamic blurring possibly occurs in video recording.

Disclosure of Invention

Based on the method, the device, the chip and the terminal for identifying the lip language, the method, the device, the chip and the terminal for identifying the lip language mainly solve the problem of poor identification of the Chinese lip language, and the effectiveness and the feasibility of the method and the device are verified through experiments.

In a first aspect, a method for identifying a lip language is provided, including:

acquiring a first face image;

performing fuzzy processing on the first face image according to a preset fuzzy algorithm to obtain a second face image;

based on the second face images after the blurring process, calculating respective face blurring degree;

Calculating a face ambiguity rate based on the adjacent second face images;

screening a second face image meeting the requirement of a preset change rate to serve as a third face image;

extracting a Mel spectrogram of the third face image;

the Mel spectrogram is input into a WaveNet vocoder, and visual voice is synthesized by the WaveNet vocoder to realize lip language recognition.

Optionally, the blurring processing of the first face image according to the preset blurring algorithm includes:

and carrying out fuzzy processing on the first face image by adopting the following formula:

wherein, (x) _fu ，y _fu ) Pixel point coordinates, η (x _fu ，y _fu ) Representing the presence of additive noise,representing the convolution process +.>Representing the original input image, < >>Representing a fuzzy function +.>Representing a blurred image.

Optionally, the calculating the respective face ambiguity includes:

the respective face ambiguities are calculated using the following formula:

CI represents face ambiguity, J represents upper resolution level, PQ _j Representing a blur quality factor at a resolution level j;

wherein PQ is _j Can be defined by the following formula:

wherein PNB is _j (x _fu ，y _fu ) Represents the pixel point (x _fu ，y _fu ) Is a fuzzy state of (2);representing the total number of blurred pixels of the image;

PNE _j (x _fu ，y _fu ) Represents the pixel point (x _fu ，y _fu ) Is clear state of (2);representing the total number of clear pixels of the image.

Optionally, the calculating the face ambiguity change rate includes:

the face ambiguity rate is calculated using the following formula:

wherein,initial image sequence number representing calculated face ambiguity rate,/-for the face>End picture sequence number for calculating face ambiguity change rate and xi represents perceptible edge break value constantThe value is 0.63, CJ represents the change rate of the face ambiguity, and CI represents the face ambiguity.

Optionally, the screening the second face image meeting the preset change rate requirement includes:

comparing the face ambiguity change rate of the second face image with a preset change rate threshold range; wherein the preset rate of change threshold range comprises [0, 2%) and [30%,1];

if the face ambiguity change rate of the second face image is within the range of [0, 2%) or [30%,1], judging that the second face image meets the requirement of the preset change rate, and taking the second face image as a third face image; otherwise, judging that the preset change rate requirement is not met so as to delete the second face image.

Optionally, the extracting the mel spectrogram of the third face image includes:

adopting a text voice synthesis model Tacotron2 based on an LSTM unit and an Attention unit to be improved into a visual voice synthesis model, and extracting a Mel spectrogram of a third face image;

The visual speech synthesis model comprises a visual feature extraction module of multichannel attention, a time sequence feature extraction module of a bidirectional LSTM unit, a semantic feature extraction module of position sensitive attention, a Mel frequency spectrum decoder and a WaveNet vocoder.

Optionally, the method for identifying the lip language further comprises the following steps:

inputting the visual voice output by the WaveNet vocoder into a visual voice recognition model to convert the visual voice into text output by using the visual voice recognition model;

the visual speech recognition model is formed by connecting a Mel frequency spectrum encoder improved by a VGG network with a CTC algorithm.

In a second aspect, there is provided a lip language recognition apparatus, comprising:

the acquisition module is used for acquiring a first face image;

the blurring processing module is used for blurring processing the first face image according to a preset blurring algorithm to obtain a second face image;

the first computing module is used for computing the respective face ambiguity based on the blurred second face images;

the second calculation module is used for calculating the face ambiguity change rate based on the adjacent second face images;

the screening module is used for screening the second face images meeting the requirement of the preset change rate and taking the second face images as third face images;

The feature extraction module is used for extracting a Mel spectrogram of the third face image;

and the recognition module is used for inputting the Mel spectrogram into a WaveNet vocoder, and synthesizing visual voice by using the WaveNet vocoder to realize lip language recognition.

In a third aspect, there is provided a chip comprising a first processor for calling and running a computer program from a memory, such that a device on which the chip is mounted performs the steps of the method of lip language identification as defined in any one of the preceding claims.

In a fourth aspect, there is provided a terminal comprising a memory, a second processor and a computer program stored in said memory and executable on said second processor, the second processor implementing the steps of the lip recognition method as described above when said computer program is executed.

The beneficial effects are that:

the method, the device, the chip and the terminal for lip language identification comprise the steps of obtaining a first face image; performing fuzzy processing on the first face image according to a preset fuzzy algorithm to obtain a second face image; based on the second face images after the blurring process, calculating respective face blurring degree; calculating a face ambiguity rate based on the adjacent second face images; screening a second face image meeting the requirement of a preset change rate to serve as a third face image; extracting a Mel spectrogram of the third face image; the Mel spectrogram is input into a WaveNet vocoder, and visual voice is synthesized by the WaveNet vocoder to realize lip language recognition. According to the scheme, the face blurring detection and the face blurring degree are introduced in the face detection stage and serve as parameters of training data for judging whether the image frame can carry out lip language recognition modeling, and images of the blurring frames in the video are screened and removed, so that the effect of visual Chinese speech synthesis is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a basic flow diagram of a method for identifying a lip language according to an embodiment of the present invention;

FIG. 2 is a schematic view of image blur in an embodiment of the present invention;

FIG. 3 is a basic block diagram of a visual speech synthesis model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a 3D-SENet module according to an embodiment of the invention;

FIG. 5 is a schematic diagram of an encoder structure of an L2W network model according to an embodiment of the present invention;

FIG. 6 is a basic block diagram of a visual speech recognition model according to an embodiment of the present invention;

fig. 7 is a basic structural block diagram of a lip language identity authentication system according to an embodiment of the present invention;

FIG. 8 is a basic block diagram of a system for generating Chinese subtitles according to an embodiment of the invention;

fig. 9 is a basic structural block diagram of a lip language recognition device according to an embodiment of the present invention;

fig. 10 is a basic structural block diagram of a chip in an embodiment of the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.

In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention based on the embodiments of the present invention.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among them, artificial intelligence (AI: artificial Intelligence) is a theory, method, technique and application system that simulates, extends and expands human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, acquires knowledge and uses the knowledge to obtain the best result.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Referring specifically to fig. 1, fig. 1 is a schematic flow chart of a lip language recognition method according to the present embodiment.

As shown in fig. 1, a method for identifying a lip language includes:

s11, acquiring a first face image;

in the embodiment of the application, the GRID and CMLR databases can be used to obtain related face images as training data sets for model training.

The data set is first acquired and preprocessed. The data sets include a chinese sentence-level lip-language disclosure data set CMLR (Chinese Mandarin Lip Reading) and an english sentence-level lip-language disclosure data set GRID.

The CMLR dataset is a large chinese sentence-level, audiovisual bimodal corpus, collected by the university of Zhejiang Visual Intelligence and Pattern Analysis (VIPA) group, aimed at facilitating research into visual speech recognition. The dataset was made from the national news program "news broadcast" from 6 months 2009 to 6 months 2018, which contained high quality speech video, audio, and kanji-level annotations.

The GRID dataset is a large, english sentence-level, bimodal corpus of audiovisual, published by Cooke et al in 2006 in the journal of the American society of acoustics. The dataset then collected 1000 sentence-level speech videos, audio, and word-level labels structured with "word + color + preposition + letter + number + adverb".

The data preprocessing may then be performed as follows:

both GRID and CMLR datasets are 25 frame rate raw video, and for convenience of subsequent operations, the embodiments of the present application use a ffmpeg tool to merge, clip, and de-frame the video. The original video is combined into a complete video segment larger than 3s, the overlong video is cut into video segments with each segment larger than 3 seconds, and finally each data is processed into a picture data set larger than 75 pictures through frame splitting.

On the other hand, the videos of the data sets are all original videos, and the person records the scene as a laboratory or a broadcasting room. Therefore, in the embodiment of the application, the face detector carried by the face_recording library is first used to remove the recorded scene, and the face region of the person is cut as the region of interest. And then, the positioned 68 face key points are subjected to affine transformation to adjust the face pose. Finally, the face region is processed to a preset region size by normalization, typically set to 48×48 or 96×96.

S12, blurring processing is carried out on the first face image according to a preset blurring algorithm, and a second face image is obtained;

a blurred image can be understood as being obtained by convolving a clear image with a blurring function and adding additive noise, the blurred image is often difficult to distinguish an edge area, and the blurring process can be performed on the first face image by adopting the following formula:

wherein, (x) _fu ，y _fu ) Pixel point coordinates, η (x _fu ，y _fu ) Is an additive noise that is added to the noise,is a convolution process->Is the original input image, < >>Is a fuzzy function>Is a blurred image. For different usage scenarios, blur function +.>Also differently, in dynamic face recognition, since the target object is often in motion, the blur is mainly represented by motion blur and focus blur (as in fig. 2), and the change rate of the face blur is mainly discussed for both kinds of blur.

S13, calculating respective face fuzziness based on each second face image after the fuzzy processing;

specifically, firstly, a face ambiguity model is constructed:

and constructing an optimization discriminator and an optimization generator according to the loss function of the global countermeasure network. The loss function of the global countermeasure network is:

wherein x is _g Is a true sample, P _r Is true data distribution, P _g Is to generate data distribution G _en Is a generator, D _ix Is a decision-maker which is used for determining whether the current state is the current state,is a loss function, D _ix (x _g ) Is the output result of the discriminator, < >>For the maximum output result of the discriminator, < >>The minimum output result of the generator.

The formula is a minimum and maximum optimization problem, and the process of optimizing the objective function has two steps: firstly optimizing the discriminator and then optimizing the generator, the loss function can be disassembled into:

optimization discriminator:

an optimization generator:

s14, calculating the change rate of the face ambiguity based on the adjacent second face images;

in an alternative embodiment of the present application, the processing manner is determined according to the CJ rate of change range by introducing the face ambiguity rate CJ, see the following table 1:

TABLE 1

Face ambiguity rate change range	Processing the result
		0≤\|CJ\|＜2％	Go to the next step of treatment
2％≤\|CJ\|＜30％	Requiring erasure of blurred frames
		30％≤\|CJ\|≤1	Go to the next step of treatment

The face ambiguity rate CJ can be expressed by the following formula:

wherein,initial picture order indicating calculated rate of change of ambiguity, < >>The sequence number of the end picture for calculating the ambiguity change rate is represented, xi represents the detectable edge running value constant, the value can be 0.63, CJ represents the face ambiguity change rate, and xi represents the face ambiguity change rate>Representing a cyclic variable. CI represents face ambiguity, which can be defined by the following formula:

j represents the upper limit of the resolution level, and the score range obtained by calculating the ambiguity of the face image according to the above formula is [0,1 ]]。PQ _j The blur quality factor, which is expressed at the resolution level, can be defined by the following formula:

wherein PNB is _j (x _fu ，y _fu ) Represents the pixel point (x _fu ，y _fu ) The edge pixel formula is as follows:

PNE _j (x _fu ，y _fu )＝1-PNB _j (x _fu ，y _fu )

wherein PNE is _j (x _fu ，y _fu ) Represents the pixel point (x _fu ，y _fu ) The clear state of the (1) is calculated first the total number of blurred pixels PNB _j And edge pixel count PNE _j (i.e., the total number of clear pixels).

And finally, judging the change rate of the face ambiguity of the acquired continuous frames, and if the continuous frames are more than or equal to 2% and less than 30% of the absolute CJ, eliminating the continuous frame fragments. Since the frame is at the beginning of the video in the dataset, there is not much impact on processing efficiency.

S15, screening a second face image meeting the requirement of a preset change rate, and taking the second face image as a third face image;

and aiming at the processed data set, removing the fuzzy frame fragments in the data set by introducing face ambiguity detection.

S16, extracting a Mel spectrogram of the third face image;

in step S16, extracting the mel spectrogram of the third face image mainly includes the following steps:

at present, most of lip language recognition algorithms based on deep learning, such as WLAS, CSSMCM, CHLipNet, CHSLP-VP and the like, are algorithm models built by taking an LSTM (Long short-term memory) unit and an Attention (position sensitive Attention mechanism) unit as basic units, wherein the LSTM unit can extract the characteristics of upper and lower text information with Long time sequence, so that the corresponding relation between lip movement information and voice content is better constructed; the Attention unit can make the decoder pay more Attention to the lip-motion part features that should be paid to reduce the influence of visual confusion on the model.

S17, inputting the Mel spectrogram into a WaveNet vocoder, and synthesizing visual voice by using the WaveNet vocoder to realize lip language recognition.

The WaveNet vocoder is a deep neural network capable of generating the original audio waveform, which is a complete probabilistic autoregressive model that predicts the probability distribution of the current audio sample based on all samples that have been previously generated.

In the embodiment of the application, a text-to-speech synthesis model Tacotron2 based on an LSTM unit and an Attention unit is improved to a visual speech synthesis model, so that conversion from a human visual mode to an audible mode is realized, wherein the structural diagram of the visual speech synthesis model is shown in fig. 3:

the visual speech synthesis model mainly comprises five modules, namely a visual feature extraction module of multichannel attention, a time sequence feature extraction module of a bidirectional LSTM unit, a semantic feature extraction module of position sensitive attention, a Mel frequency spectrum decoder and a WaveNet vocoder.

Firstly, dividing the screened face image into n segments, and sending the n segments into a three-dimensional improved refined classification module (namely a visual feature extraction module of multi-channel attention) with a ResNet (depth residual) network structure and an attention mechanism combined to perform short-time sequence visual feature extraction;

secondly, sending the visual features of the short time sequence into a time sequence feature extraction module of the bidirectional LSTM unit to perform the coding of the upper text feature and the lower text feature of the long time sequence, and simultaneously realizing a mixed attention mechanism combining content information and position information by combining a position attention mechanism (a semantic feature extraction module of position sensitive attention);

Then, the text features and the mixed attention weights are sent to a Mel spectrum decoder for decoding, and Mel spectrum diagrams (Mel spectra) are obtained through synthesis of a frame-by-frame decoding sequence; and finally, sending the Mel spectrogram to a WaveNet vocoder to synthesize visual voice.

With continued reference to fig. 3, only the background information of the face is excessively missing when lip information is considered in the lip recognition process, and the addition of the whole face definitely increases the calculation amount and the network load, so that the embodiment of the application designs a short-time visual feature extraction module 3D-SENet based on an SE (solicit-specification) module in a visual speech synthesis model. The short-time sequence video frames have multichannel attention through the channel attention mechanism of the SE module, so that the model can focus on partial areas which should be more focused between adjacent frames, and the convolutional neural network is helped to pay attention to the importance degree of different characteristic information, so that useful information can be screened from the face video.

The specific structure diagram of the visual feature extraction module 3D-SENet is shown in FIG. 4, and mainly comprises 3 important steps of three-dimensional feature compression, feature excitation and feature remapping.

The face video is divided into m groups in the module, and each group has n continuous face images. Firstly compressing the obtained n multiplied by W multiplied by H multiplied by C through 3D convolution into a space n multiplied by 1 multiplied by C through a feature compression operation, then carrying out a feature excitation operation of two fully connected layers on the compressed feature map, and finally carrying out pixel-by-pixel multiplication with an original feature matrix after a Sigmoid activation function, wherein the process can be represented by the following formula:

E ₁ ＝Conv1D(avgpool(F))

E ₂ ＝Sigmoid(W ₂ ReLU(W ₁ ·E ₁ )

E ₃ ＝E ₂ ·F

In the above formula: w1, W2 express two full-connection layers, F= { F ₁ ，F ₂ ，...，F _n Is original }Compressing F to obtain global feature map E common to adjacent frames by using feature matrix of adjacent frames ₁ And then pass E ₁ The relation of the adjacent frames is grabbed by the excitation operation to obtain an activation value E ₂ Finally E is arranged ₂ Multiplying F to obtain feature map E with adjacent frame weight coefficients ₃ 。

As shown in fig. 4, the visual feature extraction module 3D-SENet has three important steps: the description of feature compression, feature excitation, and feature remapping is as follows:

(1) And (3) feature compression: let the original feature dimension be W H C, where W is the width, H is the height, and C is the number of channels. The main task of feature compression is to compress each two-dimensional channel matrix into a one-dimensional feature vector, so that the one-dimensional vector obtains the global field of view of the previous two-dimensional matrix, and the sensing area of the receptive field is enhanced. The general feature compression operation is realized by using global average pooling, so that the effect of encoding the whole spatial feature on one channel into one global feature is achieved, and the specific operation formula is as follows:

wherein S is _c Is the feature vector obtained after the compression operation, F _sq Representing an average pooling function, f _c (i, j) represents the value of the feature map F at the coordinates (i, j).

(2) Characteristic excitation: this operation is similar to the threshold function in RNNs (recurrent neural network, recurrent neural networks), generating weights for each feature channel through parameters of the full connection layer, thereby allowing the model to learn the relationship between the feature channels. The feature vector S generated by feature compression is sent to a 2-layer full-connection layer and a ReLU layer (nonlinear layer), and then is converted into a normalized weight through a Sigmoid activation function, and the specific operation process is shown in the following formula:

wherein, in order to reduce the complexity of the model and improve the generalization capability of the model, a double-layer full-connection layer structure is used, a first layer W ₁ Plays a role in reducing dimension, and passes through the ReLU layer and then is formed by the second layer W ₂ Restoring to the original dimension.

(3) Feature remapping: the operation class remaps the learned normalized weights of all channels to the feature map, and the mapping process is completed by multiplying the channel weights E generated by feature excitation with the feature map channel by channel, and the operation process is shown in the following formula:

wherein x is _c The final feature map is still w×h×c in size, and no dimensional change occurs.

The 3D-SENet-based visual speech synthesis model mainly comprises a video encoder and a Mel spectrogram decoder, wherein a multi-channel attention-based visual feature extraction module and a time sequence feature extraction module of a bidirectional LSTM unit are combined to construct a video encoder part of a visual speech synthesis model L2W, a specific flow chart of the part is shown in fig. 5, and a left side frame part is a newly added multi-channel attention-based face space visual feature extraction module, namely a multi-channel attention-based visual feature extraction module is added. The right part of fig. 5 is a timing feature extraction module of the bidirectional LSTM cell.

Let a group of m segments, each segment having a face video sequence of n frames of face imagesThe video encoder part firstly inputs the human face image sequence I into the 3D-Conv BN-Relu to extract visual characteristics, then sends the characteristic diagram into a multi-layer 3D-SENet module to learn multi-channel attention weights, and finally outputs a characteristic vector F= { F through a plurality of layers (at least 4 layers) of 3D-SENet modules ¹ ，F ² ，...，F ^m The text features are extracted from the two-way LSTM unit, and the whole encoder calculation process is as followsThe formula is shown as follows:

wherein,short-time-series computer vision feature representing image I extracted through 3D-Conv BN-Relu and multilayer 3D-SENet, < ->Representing the state vector generated by the Bi-LSTM encoder. Experiments show that the addition of the 3D-SENet module can enable the model to pay attention to a channel and a region which should be paid attention to better, visual characteristics can be extracted, and accordingly the corresponding effect of images to sound waves is improved.

In order to adapt to the complex change of the voiceprint, the position sensitive attention mechanism used in the embodiment of the present application calculates attention weight vectors of different segments at different moments by marking the video or voice segment to which the right upper corner mark belongs, so that the decoder can better notice the feature to be noted, and the specific process is represented by the following formula:

Wherein,is the state vector of the decoder during the synthesis of visual speech,/-, for example>Mel spectrogram generated for PreNet (module is composed of two fully connected layers), and +.>Encoder state vector, f, generated for bi-directional LSTM cells _i，j Is the position feature obtained by accumulating the previous attention weights and convolving with F. />W, V, U and b are parameters to be trained. The position-sensitive attention provides weights for the features encoded at different times, thereby making the mel-frequency spectrogram generated when the decoder predicts the audio more nearly real.

The position-sensitive attention provides weights for the features encoded at different times, thereby making the mel-frequency spectrogram generated when the decoder predicts the audio more nearly real.

In this embodiment of the present application, the screened data is scrambled and according to 8:1: the scale of 1 is divided into training, testing and validation sets. In order to keep the configuration the same, the training environment is trained on a single-card V100 with 16G video memory, the optimizer defaults to Adam (Adaptive Moment Estimation ) optimizer, and the initial learning rate is 10 ^-3 . Because the GRID dataset has a short video, the hidden layer is usually halved to prevent overfitting, and when the batch processing number is set to 32, at least 3 ten thousand iterative training models are performed to synthesize similar mel spectrograms, and the GRID dataset is trained 5 ten thousand times.

Three objective evaluation methods are selected in the embodiment of the application, namely a voice quality perception evaluation algorithm (Perceptual Evaluation of Speech Quality, PESQ) based on voice distortion and noise influence, a Short-term target intelligibility (Short-Time Objective Intelligibility, STOI) based on a human auditory perception system and a mel cepstrum distance (Mel Cepstral Distance, MCD) based on spectrogram similarity. On the other hand, in order to verify the machine identifiability of the synthesized visual voice, the embodiment of the application carries out recognition verification on the synthesized visual voice through a large-scale scientific communication and hundred-degree voice recognition engine, and the comparison result is evaluated by the word accuracy (Character Accuracy Rate, CAR). The visual speech synthesis model (L2W, lip to Wav) adopted in the embodiment of the present application is optimal for evaluation indexes of other dimensions under the condition that the effect is certain in terms of evaluation of perceived speech quality.

Based on the network structural design of the visual speech synthesis model, the conversion from a visual mode to an auditory mode is realized, and in an optional embodiment of the application, the auditory mode can be converted into a text mode, so that text output is realized, and the method is used for application scenes such as subtitle output and the like.

In an alternative embodiment of the present application, the visual speech output by the WaveNet vocoder is input to the visual speech recognition model W2T (Wav to Text) to convert the visual speech to text output using the visual speech recognition model; the visual speech recognition model is formed by connecting a modified Mel frequency spectrum encoder of a VGG network (belonging to a convolutional neural network) with a CTC algorithm.

VGG networks are existing neural networks whose basic idea is to reuse simple underlying network modules to build deep learning models. In the embodiment of the application, the input end of the neural network is a 255×255 RGB image, the front end of the network has five units, each unit is formed by connecting two to three convolution layers with one pooling layer, and the end of the network is formed by three full connection layers and cross entropy loss. The first four pooling of the model goes through a total of thirteen convolutional layers, plus sixteen total of the last three fully connected layers, so is also known as VGG16 neural network, the VGG16 neural network architecture belongs to a very large scale deep convolutional neural network.

In the embodiment of the application, the mel spectrogram is taken as the direct input of the visual voice recognition model L2W, which is helpful for independent training of the visual voice synthesis model L2W and the visual voice recognition model W2T. The W2T on-model structure encoder part refers to the network model VGG structure with the best image recognition effect, the decoder part adopts CTC loss commonly used in voice recognition, the whole model is extracted to the label space corresponding to the CTC loss through the characteristic of deep convolution, and the end-to-end structure of directly transcribing a Mel spectrogram into Chinese characters is realized.

The structure diagram of the W2T model is shown in FIG. 6, wherein the improved Mel spectrum encoder based on VGG network is arranged in the dashed line frame, the Mel spectrum is used as the characteristic input of the visual speech recognition model, and the Mel spectrum encoder is used for extracting the image multi-scale information to obtain the depth characteristic sequence of the synthesized Mel spectrum sequenceThen the characteristic sequence M _F And (5) sending the mapping relation to a CTC algorithm model to find the most accurate mapping relation with the dictionary Y.

CTC full scale Connectionist temporal classification is an algorithm commonly used in the fields of speech recognition, text recognition, etc., and is used for solving the problems of non-uniform length and non-alignment of input and output sequences. CTCs can find the algorithm for the best alignment without knowing the relationship between input and output. CTC by maximizing Y-concernsTo find M by posterior probability of (C) _F And the accurate mapping relation with the dictionary Y, wherein the posterior probability formula is shown as the following formula.

P (Y|M) _F ) The posterior probability is represented by Y, and the output dictionary corresponding to the label "hello" is, for example, { 'niε', 'hao ε' (where 'ε' represents the label of homonym text), and t represents the time sequence, and the length of the input and output sequence is actually required to eliminate the effect of padding when calculating the loss.

Based on the method for identifying the lip language provided in the above embodiment, the method can be applied to a lip language identity authentication system, and is shown in fig. 7:

the lip language identity authentication system mainly comprises four parts of video input, visual voice synthesis, mel cepstrum distance calculation and identity authentication prediction. The visual speech synthesis takes the dominant role of the lip language identity authentication system, and provides important basis for the lip language identity authentication system. The above manner is specifically adopted for mel cepstrum distance calculation, and will not be described herein. The identity authentication prediction judges whether the distance difference between the predicted Mel spectrogram and the original Mel cepstrum is smaller than a set threshold value by comparing the Mel cepstrum distance of the predicted Mel spectrogram and the original Mel spectrogram quality test, if so, the two can be identified as being matched, namely the two belong to the same identity; and if the distance difference between the two is not smaller than the set threshold, the two are not matched, namely the two are judged to belong to different identities. Thereby realizing the lip language identity authentication function.

In an alternative embodiment of the present application, the method for identifying lips provided in the above embodiment may be applied to a chinese caption generating system, and is shown in fig. 8:

the Chinese caption generating system based on Chinese lip language recognition mainly comprises four parts of video input, visual speech synthesis, visual speech recognition and caption generation. Wherein visual speech synthesis and visual speech recognition dominate the subtitle generation system.

In order to solve the technical problems, the embodiment of the invention also provides a lip language identification device. Referring specifically to fig. 9, fig. 9 is a basic block diagram of a lip language recognition device based on deep learning according to the present embodiment, including:

an acquiring module 81, configured to acquire a first face image;

the blurring processing module 82 is configured to blur the first face image according to a preset blurring algorithm to obtain a second face image;

a first calculating module 83, configured to calculate respective face ambiguities based on the blurred second face images;

a second calculation module 84, configured to calculate a face ambiguity rate based on the adjacent second face images;

the screening module 85 is configured to screen the second face image that meets the requirement of the preset change rate, as a third face image;

the feature extraction module 86 is configured to extract a mel spectrogram of the third face image;

the recognition module 87 is used for inputting the mel spectrogram into a WaveNet vocoder, and synthesizing visual voice by using the WaveNet vocoder to realize lip language recognition.

Specific functions of each module can be referred to the related description of the above-mentioned lip language identification method, and based on the description of the above-mentioned lip language identification method, those skilled in the art can flexibly determine the specific implementation functions and implementation principles of the related modules, which are not described herein again.

In order to solve the above technical problems, the embodiment of the present invention further provides a chip, where the chip may be a general-purpose processor or a special-purpose processor. The chip comprises a first processor for supporting the terminal to perform the above related steps, for example, to invoke and run a computer program from a memory, so that a device on which the chip is mounted performs the above related steps, to implement the lip language recognition method in the above embodiments.

Optionally, in some examples, the chip further includes a transceiver, where the transceiver is controlled by the processor, and is configured to support the terminal to perform the related steps, so as to implement the method for identifying a lip language in the foregoing embodiments.

Optionally, the chip may further comprise a storage medium.

It should be noted that the chip may be implemented using the following circuits or devices: one or more field programmable gate arrays (field programmable gate array, FPGA), programmable logic devices (programmablelogic device, PLD), controllers, state machines, gate logic, discrete hardware components, any other suitable circuit or combination of circuits capable of performing the various functions described throughout this application.

The invention also provides a terminal comprising a memory, a second processor and a computer program stored in the memory and operable on the second processor, wherein the second processor implements the steps of the lip language recognition method described in the above embodiment when executing the computer program.

Referring specifically to fig. 10, fig. 10 is a basic block diagram illustrating a terminal including a processor, a nonvolatile storage medium, a memory, and a network interface connected by a system bus. The nonvolatile storage medium of the terminal stores an operating system, a database and a computer readable instruction, the database can store a control information sequence, and the computer readable instruction can enable the processor to realize a lip language identification method when being executed by the processor. The processor of the terminal is operative to provide computing and control capabilities supporting the operation of the entire terminal. The memory of the terminal may store computer readable instructions that, when executed by the processor, cause the processor to perform a method of lip language identification. The network interface of the terminal is used for connecting and communicating with the terminal. It will be appreciated by those skilled in the art that the structures shown in the drawings are block diagrams of only some of the structures associated with the aspects of the present application and are not intended to limit the terminals to which the aspects of the present application may be applied, and that a particular terminal may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

As used herein, a "terminal" or "terminal device" includes both a device of a wireless signal receiver having no transmitting capability and a device of receiving and transmitting hardware having electronic devices capable of performing two-way communication over a two-way communication link, as will be appreciated by those skilled in the art. Such an electronic device may include: a cellular or other communication device having a single-line display or a multi-line display or a cellular or other communication device without a multi-line display; a PCS (Personal Communications Service, personal communication system) that may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant ) that can include a radio frequency receiver, pager, internet/intranet access, web browser, notepad, calendar and/or GPS (Global Positioning System ) receiver; a conventional laptop and/or palmtop computer or other appliance that has and/or includes a radio frequency receiver. As used herein, "terminal," "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or adapted and/or configured to operate locally and/or in a distributed fashion, to operate at any other location(s) on earth and/or in space. The "terminal" and "terminal device" used herein may also be a communication terminal, a network access terminal, and a music/video playing terminal, for example, may be a PDA, a MID (Mobile Internet Device ), and/or a mobile phone with a music/video playing function, and may also be a smart tv, a set top box, and other devices.

The invention also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the method for lip language identification according to any one of the embodiments described above.

The present embodiment also provides a computer program which can be distributed on a computer readable medium and executed by a computable device to implement at least one step of the above-described lip language recognition method; and in some cases at least one of the steps shown or described may be performed in a different order than that described in the above embodiments.

The present embodiment also provides a computer program product comprising computer readable means having stored thereon a computer program as shown above. The computer readable means in this embodiment may comprise a computer readable storage medium as shown above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A method for identifying a lip language, comprising:

acquiring a first face image;

calculating a face ambiguity rate based on the adjacent second face images;

extracting a Mel spectrogram of the third face image;

inputting the Mel spectrogram into a WaveNet vocoder, and synthesizing visual voice by using the WaveNet vocoder to realize lip language recognition;

the calculating of the respective face ambiguities comprises:

the respective face ambiguities are calculated using the following formula:

wherein PQ is _j Is defined by the following formula:

wherein, RNB _j (x _fu ,y _fu ) Represents the pixel point (x _fu ,y _fu ) Is a fuzzy state of (2);representing the total number of blurred pixels of the image;

PNE _j (x _fu ,y _fu ) Represents the pixel point (x _fu ,y _fu ) Is clear state of (2);representing the total number of clear pixels of the image;

the calculating the face ambiguity change rate comprises the following steps:

The face ambiguity rate is calculated using the following formula:

wherein,initial image sequence number representing calculated face ambiguity rate,/-for the face>End picture sequence number for calculating face ambiguity change rate is represented, and xi represents a perceptible edge break value constant,/and->Representing the cyclic variable, CJ represents the rate of change of the face blur, and CI represents the face blur.

2. The method for recognizing lip according to claim 1, wherein the blurring the first face image according to a preset blurring algorithm comprises:

wherein, (x) _fu ,y _fu ) Pixel point coordinates, η (x _fu ,y _fu ) Representing the presence of additive noise,representing the convolution process +.>Representing the original input image, < >>Representing a fuzzy function +.>Representing a blurred image.

3. The lip recognition method as set forth in claim 1, wherein the screening the second face image satisfying the preset change rate requirement as the third face image includes:

4. The lip recognition method of any one of claims 1-3, wherein the extracting a mel-frequency spectrogram of the third face image comprises:

5. The lip language identification method of any one of claims 1 to 3, wherein the lip language identification method further comprises:

6. A lip language recognition apparatus, comprising:

the acquisition module is used for acquiring a first face image;

the recognition module is used for inputting the Mel spectrogram into a WaveNet vocoder, and synthesizing visual voice by using the WaveNet vocoder to realize lip language recognition;

the first calculating module is further configured to calculate respective face ambiguities by adopting the following formula:

wherein PQ is _j Is defined by the following formula:

wherein PNB is _j (x _fu ,y _fu ) Represents the pixel point (x _fu ,y _fu ) Is a fuzzy state of (2);representing the total number of blurred pixels of the image;

the second calculating module is further configured to calculate a face ambiguity rate according to the following formula:

7. A chip, comprising: a first processor for recalling and running a computer program from memory, causing a device on which the chip is mounted to perform the steps of the lip language recognition method of any one of claims 1 to 5.

8. A terminal, comprising: a memory, a second processor and a computer program stored in the memory and executable on the second processor, characterized in that the second processor implements the steps of the lip language recognition method according to any one of claims 1 to 5 when executing the computer program.