CN111898420A

CN111898420A - Lip language recognition system

Info

Publication number: CN111898420A
Application number: CN202010556817.2A
Authority: CN
Inventors: 鲁远耀; 李宏波
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2020-11-06

Abstract

The invention discloses a lip language recognition system, which comprises: a human-computer interaction interface and an algorithm module; the human-computer interaction interface is connected with the algorithm module through a signal slot; the human-computer interaction interface is used for acquiring a video to be identified; the algorithm module is used for carrying out lip language identification on the video to be identified to obtain a lip language identification result; and the human-computer interaction interface is also used for displaying the lip language recognition result and displaying the lip language recognition process according to time sequence. The lip language recognition system designed by the invention can better observe and analyze the quality of the model performance and each link from the original video to the final recognition process, thereby improving and optimizing the model and the algorithm.

Description

Lip language recognition system

Technical Field

The invention relates to the field of lip language identification, in particular to a lip language identification system.

Background

Lip recognition is the language content that a speaker explains computationally through the dynamic shape change of the speaker's lips. The lip language recognition technology can solve the recognition task of speaker expression content in the scene with a noisy environment or even without an audio collector, so that the lip language recognition can be applied to a plurality of scenes in different fields, and the automatic lip reading technology can be widely applied to the fields of virtual reality systems, information safety, voice recognition, auxiliary driving systems and the like. With the development of the internet of things and the 5G technology, the lip language recognition technology can be applied to the fields of intelligent home, intelligent driving, intelligent communication and the like in the future.

The traditional lip language recognition technology is based on image processing and pattern recognition to complete the process from the sequence image to the recognition result, so the recognition task can be divided into: the method comprises the steps of lip region positioning, lip region-of-interest feature extraction and lip language content identification. Among them, the feature extraction and the design of the classifier are the key points and difficulties of the identification process. The mainstream conversion methods generally used include: principal Component Analysis (PCA), Discrete time cosine transform (DCT), Singular Value Decomposition (SVD), and Independent Component Analysis in signal processing (ICA). The content recognition of lip language is mainly completed by a manually designed classifier. The classification method can be generally divided into: hidden Markov Models (HMMs), Artificial Neural Networks (ANNs), Template Matching methods (TMAs), Support Vector Machines (SVMs), and the like. The current research to lip language identification is mainly focused on how to realize lip language identification to and how to improve the recognition result that obtains higher degree of accuracy to current lip language identification process, and at the recognition process, researcher can not lead to the big reason of recognition result error from the audio-visual obtaining of recognition process, often need a large amount of experiments to confirm the link that produces the error earlier through the exclusion method, then solve the reason that produces the error, this process is not only inefficient, and can not obtain audio-visual lip language identification process in addition.

Disclosure of Invention

In order to solve the above-mentioned deficiency existing in the prior art, the invention provides a lip language identification system, comprising: a human-computer interaction interface and an algorithm module; the human-computer interaction interface is connected with the algorithm module through a signal slot;

the human-computer interaction interface is used for acquiring a video to be identified;

the algorithm module is used for carrying out lip language identification on the video to be identified to obtain a lip language identification result;

and the human-computer interaction interface is also used for displaying the lip language recognition result and displaying the lip language recognition process according to time sequence.

Preferably, the algorithm module comprises:

the fixed frame extraction submodule is used for extracting a video frame to be processed from a video to be identified based on a semi-random fixed frame extraction strategy;

the segmentation submodule is used for segmenting the lip image from the processed video frame to obtain a lip data set;

and the recognition submodule is used for recognizing each lip image in the lip data set based on the designed model to obtain a lip language recognition result.

Preferably, the identifier module includes:

the characteristic extraction unit is used for carrying out characteristic extraction on the lip images to obtain image characteristics, and carrying out slicing operation on the lip images and the image characteristics of the curling layers to obtain visual lip images and high-dimensional image characteristics of the curling layers;

the time sequence feature extraction unit is used for extracting time sequence features from the image features to obtain sequence features, and performing slicing operation on the sequence features to obtain visual sequence features;

and the classification unit is used for classifying the extracted time sequence characteristics to obtain a lip language identification result.

Preferably, the feature extraction unit is a convolutional neural network CNN.

Preferably, the time sequence feature extraction unit is a recurrent neural network RNN.

Preferably, the classification unit is a softmax classifier.

Preferably, the human-computer interaction interface includes:

selecting a video option for acquiring a video to be identified when triggered;

the identification video option is used for carrying out lip language identification on the video to be identified when the video to be identified is triggered to obtain an identification result;

a visualization option for displaying the lip language recognition process and the lip language recognition result based on a configuration file set by visualization requirements when triggered;

the display content comprises a video frame to be recognized, lip images obtained by dividing the video frame, visual curled layer lip images, visual high-dimensional image features, visual sequence features and/or at least one recognition result corresponding to the video to be recognized.

Preferably, the fixed frame extraction sub-module is specifically configured to:

determining a fixed frame number to be extracted based on a priori condition;

dividing a video to be identified into a plurality of area blocks according to the number of overall video frames;

wherein the area coverage in each area block is maximized on average.

Preferably, the video to be identified includes:

and acquiring the video to be identified for the same target object based on at least one acquisition device.

Preferably, the man-machine interface is designed and built through a PyQt5 framework.

Compared with the prior art, the invention has the beneficial effects that:

the lip language recognition system provided by the invention comprises a human-computer interaction interface and an algorithm module; the human-computer interaction interface is connected with the algorithm module through a signal slot; the human-computer interaction interface is used for acquiring a video to be identified; the algorithm module is used for carrying out lip language identification on the video to be identified to obtain a lip language identification result; the human-computer interaction interface is also used for displaying the lip language recognition result and displaying the lip language recognition process according to time sequence, so that the performance of the algorithm module can be better observed and analyzed and each link from the video to be recognized to the final recognition process can be better observed, and the algorithm module is improved and optimized.

Drawings

FIG. 1 is a schematic diagram of a lip language identification system according to the present invention;

FIG. 2 is a schematic diagram of the architecture and algorithm flow of the lip language identification system of the present invention;

FIG. 3 is a functional diagram of the lip language identification system according to the present invention;

FIG. 4 is a schematic diagram of a model architecture of the lip reading recognition system according to the present invention;

FIG. 5 is a diagram of the LSTM basic unit structure in the present invention;

FIG. 6 is a schematic diagram illustrating the probability obtained by the SOFTMAX classifier of the present invention;

FIG. 7 is a human-computer interface of the video lip language recognition system of the present invention;

FIG. 8 is a schematic diagram illustrating selection of a selected video in an operating human-machine interface according to the present invention;

FIG. 9 is a schematic diagram of a system for identifying video results display in accordance with the present invention;

FIG. 10 is a diagram illustrating a visualization result on a human-computer interface according to the present invention;

FIG. 11 is a graph illustrating Loss curves at different times in the present invention;

FIG. 12 is a graph illustrating accuracy curves at different times in accordance with the present invention;

FIG. 13 is a schematic diagram of each independent digital pronunciation video clip Recall according to the present invention.

Detailed Description

For a better understanding of the present invention, reference is made to the following description taken in conjunction with the accompanying drawings and examples.

The application scenarios of the lip language identification algorithm are wide, such as: the requirements in scientific research algorithms are compared with models, pronunciation training is assisted, the functional requirements of auditory speech recognition are made up, the deep development of biological recognition technology and XR technology is realized, and the like. At present, no researcher has been able to complete the engineering design of the lip language feature recognition system in the research on lip language recognition, and no analytical tool for feature visualization exists. In order to overcome the defect of lacking a research tool capable of analyzing visual characteristics and video time sequence characteristics at present, the invention provides a lip language recognition system which can provide good visual operation of characteristic engineering for a computer vision researcher, can deeply observe and supervise the characteristic engineering of the convolution process, and can also know the time sequence characteristic change of a cyclic neural network, thereby bringing great convenience for subsequent researchers to the characteristic engineering design and more effective characteristic extraction engineering.

As shown in fig. 1, the present invention provides a lip language identification system, which includes: a human-computer interaction interface and an algorithm module; the human-computer interaction interface is connected with the algorithm module through a signal slot;

The lip language recognition system uses a PyQt5 framework to complete the design and the construction of a human-computer interaction interface, and an algorithm module uses Qt multithreading to complete the functions of information processing and calculation. The requirement of the lip language recognition system is defined as the operation of visual characteristic visualization, time series characteristic visualization and other display processes, meanwhile, the Top3 accuracy condition is added to analyze which words have similarity, data and schemes need to be optimized and adjusted, and the research progress and achievement of lip language recognition are prevented from being influenced at risk points. In the embodiment, an exception handling mechanism is designed and captured aiming at the conditions that the normal operation of the program is influenced by the collapse, the exception and the like of the lip language recognition system, the exception reason and the information are thrown out and recorded, the exception reason is rapidly positioned, and the repairing efficiency is improved.

The technical scheme is that the high-cohesion low-coupling efficient lip language recognition system is constructed in an interface modularized code form. The lip language recognition system is suitable for various platforms such as Ubuntu, Mac, Windows and the like, can recognize lip languages in videos, and particularly has a good recognition effect on independent digital pronunciations.

As shown in fig. 2, the overall identification process of the system is shown, the front-end human-computer interaction interface and the rear-end module algorithm are connected in a signal slot manner, and the calling of the video to be identified and the acquisition of the display content are realized through interface calling. And the video to be recognized is subjected to decomposition algorithm and then a designed algorithm model, and finally the final result is transmitted by the signal slot and fed back to each front-end component for result display. Fig. 7 specifically shows a lip language recognition system on a human-computer interaction interface, and as shown in fig. 7, the interface of the lip language recognition system includes options of selecting a video, visualizing, and recognizing the video, a display frame of a lip language recognition result, a display frame of Top3 accuracy, a face feature display area, a mouth feature display area, a convolutional layer 1 and a convolutional layer 2 display area, an image feature display area, and a sequence feature display area, and a visualization result is shown in fig. 10 by taking an embodiment as an example.

All the characteristic processes displayed by the front-end human-computer interaction interface can be set by configuration files, and the high-availability and high-customization universal system design engineering is met. The lip language identification system provided by the invention can not only meet the understanding of a beginner on the process of convolution calculation, but also bring new ideas and methods for researchers to the processing of lip language identification characteristic engineering.

The lip language recognition system provided by the invention has the characteristics of high cohesion, low coupling, high availability and high customization. The front end and the back end are designed in a modular interface mode, the high-cohesion low-coupling property is achieved, the system can not be reconstructed due to paralysis of the whole system caused by partial errors, and the system model interface has the advantage of updating iteration. With the deep research on the lip language recognition technology, the system can be continuously used only by replacing model parameters and an inference process, and in addition, the back-end model only provides an interface to return a result for the front end, so that the system has high separability and high usability. Both the visual and time series characterizations are highly customized and can display any intermediate result desired. As shown in fig. 3, the content displayed on the human-computer interaction interface can display the functions of intermediate feature extraction, lip positioning and the like, and can analyze out easily confused similar pronunciations, which can better lay the foundation for subsequent research.

The lip language recognition system provided by the invention is divided into six parts: the method comprises the following steps of human-computer interaction interface, lip segmentation, CNN feature extraction, RNN time sequence feature extraction, full connection classification and result display, wherein the human-computer interaction interface is also called as a user interface.

The user interface is based on a python version Qt5 framework (PyQt5), the transmission of front-end and back-end information is connected by a signal slot for a variable of algorithm inference, a method used for lip segmentation is a Dlib library of 68 points of a human face, the position of the lip can be well detected, and the segmentation operation is carried out by using the following formulas (1) and (2).

Wherein, the lip center position is calculated by the lip coordinate point and is marked as (x)₀,y₀) Let w and h represent the width and height of the mouth image, L, respectively₁And L₂Representing the left and right and upper and lower dividing lines, respectively, surrounding the mouth, the bounding box of the mouth is calculated according to the above formula.

The method adopted for lip reading the whole video is a scheme of extracting a fixed frame, and an algorithm formula for extracting the fixed frame is shown in the formulas (3) and (4):

wherein x denotes dividing v into x blocks, and A denotes dividing v into x blocks_nAnd F represents the sequence number of the frame taken by each block finally.

In this embodiment, a Semi-random Fixed Frame extraction strategy (SFFES) is adopted to extract a Fixed Frame from a video. A large number of experimental researches show that the SFFES designed by the invention has the characteristics of strong flexibility, low time complexity, strong anti-interference performance and the like. The idea of the algorithm strategy is to perform the regional block work on the video according to the total video frame number, the specific operation is to average the regional range of each regional block as much as possible on the premise that the fixed frame number required to be extracted under the known prior condition is n, the number x of the rest frames is not more than the number n of the regional blocks, and then the range of the former x regional blocks is respectively expanded by 1. By this point, the extent of each region block has completed the semi-random allocation.

The calculation method has the characteristics of robustness, high calculation efficiency, strong consistency of the characteristic vectors and the like, and can well remove redundant information.

After the mouth segmentation is completed, the resulting raw lip data set is processed into standard 224x224 pixels. The overall structure of the model is shown in fig. 4, the model designed in this embodiment is only one embodiment, and can complete a basic lip reading recognition function, and a model for lip language recognition can be designed as needed in the actual application process. The model architecture designed in the embodiment is characterized in that feature extraction is performed by VGG16, 4096 × 1 feature vectors are generated at each moment, the length of the model is 10, so that the length of the feature vector of the whole action is 4096 × 10, then the features are input into lstm to perform time sequence feature extraction, and finally classification work is performed by two layers of full connection and softmax according to the output of the last lstm unit structure. And obtaining top-3 accuracy conditions by taking the three maximum probability values before the final result sofmax, and then outputting slicing operation of the intermediate layer result according to the model to obtain the intermediate result of feature extraction and a visual image.

Because the conventional RNN has a long-term dependence problem, when iteration is performed, a problem that the information amount gradually decreases along with the increase of time is generated, so that the influence of the output with a longer time interval on the input of the hidden layer at the current moment is very small. Therefore, conventional RNNs are only suitable for processing data with short time series, and when the time interval of the sequence is long, the RNNs have difficulty in expressing implicit correlation between sequences.

Aiming at the problem that the hidden layer is easy to have gradient disappearance and gradient explosion when transmitting information when RNN processes long sequence data, the method adoptsThe Long short-term memory structure (LSTM) is specially used for processing the problem of information loss when sequences are depended on for a Long time. LSTM stores history information by introducing memory cells and controls the addition and removal of information streams in the network by introducing three control gate structures including input gates, forgetting gates and output gates. Remember the associated information that needs to be remembered in long sequences, forget partially useless information in order to better discover and exploit long-term dependencies from sequence data (e.g., video, audio, and text). FIG. 5 shows operations performed within a single LSTM cell, where x_tRepresenting the input vector of the network node at time t, h_tOutput vector, i, representing network node at time t_t，f_t，o_t，c_tAnd the input gate, the forgetting gate, the output gate and the memory unit respectively represent time t.

The input gate, forget gate, memory cell and output gate inside the LSTM cell will be described separately below:

(1) input gate (Input gate): the gate is used for control of the input node information. The input information consists of two parts, firstly, a sigmoid activation function is used for determining which new information needs to be input, and then a tanh function is used for selecting new information which needs to be stored in a unit. Output of input gate i_tAnd candidate information g_tThe mathematical expression of (a) is:

i_t＝σ(U_ix_t+W_ih_t-1+b_i) (5)

g_t＝tanh(U_gx_t+W_gh_t-1+b_g) (6)

wherein U is_i，W_iAnd b_iRespectively representing the weight and offset, U, of the input gate_g，W_gAnd b_gRespectively representing the weight and the offset of the candidate state, sigma representing a sigmoid activation function, and tanh representing the activation function.

(2) Forget to gate (Forget gate): the gate is used to control which information is discarded by the current LSTM unit. Outputting a function value between 0 and 1 through the sigmoid activation function, wherein the node at the current moment contains useful information when the function value is closer to 1The more information is, more information is reserved to the next moment; as the function value approaches 0, the less useful information the node at the current time contains, the more information will be discarded to the next time. Forget to remember the door f_tThe mathematical expression of (a) is:

f_t＝σ(U_fx_t+W_fh_t-1+b_f) (7)

wherein U is_f，W_fAnd b_fRespectively, the weight and the bias of a forgetting gate are represented, and sigma represents a sigmoid activation function.

(3) Memory cell (Memory cell): the unit is used for storing state information, updating the state, and the memory unit c_tThe mathematical expression of (a) is:

c_t＝f_t⊙c_t-1+i_t⊙g_t(8)

wherein |, represents a hadamard product.

(4) Output gate (Output gate): the gate is used for control of the output node information. Firstly, a sigmoid function is utilized to determine which information needs to be output, and an initial output value o is obtained_tThen c is transformed using the tanh function_tFixed in the (-1, 1) interval and finally output the value o_tMultiplying point by point to obtain output h of LSTM unit_t. So h_tIs formed by_tAnd a memory cell c_tCo-determination of o_tAnd h_tThe mathematical expression of (a) is:

o_t＝σ(U_ox_t+W_oh_t-1+b_o) (9)

h_t＝o_t⊙tanh(c_t) (10)

wherein U is_o，W_oAnd b_oRepresenting the weight and offset of the output gate, respectively.

Logistic, SVM and the like generally solve the problem of two categories, a multi-category can also be formed by combining a plurality of two categories, and from the mathematical point of view, the selection of SOFTMAX for a mutual exclusion event is better; and the non-mutually exclusive event uses a combined classifier such as Logistic or SVM and the like. In SOFTMAX, the probability that a sample x belongs to class j can be expressed as:

where j is 1,2, K, the loss function of SOFTMAX can be expressed as:

where K is the number of classes, p is for {0,1}^KAnd w is the network weight.

Generally, the probability distribution of the input features is obtained by a SOFTMAX classifier after the input features pass through a feature processing layer. As shown in FIG. 6, the probability of each of the three categories is [ 0.880.100.02 ], the sum of which is unit 1, and the probability events are seen to be independent of each other.

As shown in fig. 7, the human-computer interaction interface of the lip language recognition system firstly has the functions of displaying the recognition result Top-3 and counting the probability; secondly, the display function of the video frame to be identified is realized; then, displaying the feature extraction and time sequence feature extraction processes of the video in the lip reading identification process, namely having the feature visualization function of displaying the CNN and the feature visualization function of the RNN, and observing the change of feature vectors in the feature extraction process through the visualization function; and finally, the system has the function of seeing the lip segmentation process of the video to be recognized, and can be well used as a problem analysis system encountered in the model verification process through the function. In this embodiment, the CNN feature visualization includes each convolutional layer lip image and a high-dimensional image feature, and the RNN feature visualization includes a sequence feature of the visualization.

In this embodiment, through the different collection equipment that makes a video recording in same angle, carry out video acquisition to same target object, lip language identification system who provides through this application discerns the video that different collection equipment that makes a video recording gathered, and show the video of gathering on human-computer interaction interface, lip segmentation process and the recognition result to the video, can confirm that there is inaccurate or skew condition in the lip region that different collection equipment that makes a video recording cut apart based on above-mentioned condition, through comparing the recognition result and gathering the video degree of accuracy of different collection equipment that makes a video recording and sequencing, provide the basis when selecting collection equipment that makes a video recording for the researcher.

The embodiment also provides a method for analyzing and predicting various problems encountered by a model through the same camera shooting collection device at different angles, and performing video collection on the same target object, wherein the lip language recognition system provided by the application recognizes videos collected by the same camera shooting collection device at different angles, and displays the collected videos on a human-computer interaction interface, the lip segmentation process and the recognition result of the videos, and based on the above conditions, the accuracy of the recognition result can be determined by comparing the video display results shot by the same camera shooting collection device at different angles, and the most favorable shooting angle for recognition can be determined by the lip language recognition system.

The invention provides a specific embodiment, which utilizes a lip language recognition system to obtain a test result and analyze the test result of the lip language recognition system, and specifically comprises the following steps:

the data set in this example was built up in a voice database, containing 10 independent english numerical pronunciations (from 0 to 9) with 6 different target objects (3 males and 3 females). The number of pronunciations per target object per word is up to 100. Videos are collected from the frontal perspective of each target object that may sit naturally without any action. The original size of each frame picture is 1920 x 1080 resolution, about 25 frames per second. To accurately locate the beginning and end of each pronunciation unit, each pronunciation word is separated using audio as an aid, each word lasting approximately 1 second. Then, each isolated word video is further extracted to a fixed length of 10 frames, and after processing each video frame, a 224 × 224 pixel image is obtained as a standard value for CNN model input.

The lip language identification system aims at identifying the contents of video lip languages, and mainly has the functions of function description of a human-computer interaction interface, Top-3 accuracy display of a prediction result and visualization of an algorithm reasoning process. The Top-3 accuracy can be predicted in the prediction recognition process, and which pronunciation videos in the inference result are similar and easy to confuse are observed. In the visualization process, the characteristic change and the learning condition of each stage of the CNN and the RNN can be visually observed. The system can better observe the middle process of the model in the algorithm reasoning prediction process, and facilitates subsequent deep research and optimization of the algorithm model.

As shown in fig. 7, after the lip language recognition system is operated, a man-machine interaction interface appears. There are three function buttons on the man-machine interface, namely selecting video, visualizing and recognizing video. The right side of the function button is the final result of the system identification video, the right side next to the function button is a Top-3 accuracy and result box, and the display prediction result is an LCD liquid crystal display tube model. And the intermediate process of algorithmic reasoning is displayed below the human-computer interaction interface, and comprises the steps of extracting video frames with fixed lengths, positioning and segmenting the position of a mouth, and visualizing the CNN and the dynamic visualization of the RNN in each period.

As shown in fig. 8, after entering the human-computer interaction interface, clicking the "select video" button pops up a select video folder, selects a folder in which a video to be identified is located, defaults to a test folder under the current project, and if not, the selected folder is the current folder. And selecting a video to be identified, clicking to confirm, then automatically closing the folder window, extracting fixed-length frames, and displaying the background progress through the rear-end window.

Taking a video of the digital pronunciation "Six" as an example, after the selection of the loaded video of the digital pronunciation "Six" is completed, the prediction and reasoning process visualization can be performed, and the "recognition video" is clicked to start recognition prediction. As shown in FIG. 9, the recognition result of the video is the numeric pronunciation "Six", the probability that Top-3 is exactly "Six" is 95.96%, the probability that it is recognized as "zero" is 2.13%, and the probability that it is recognized as "Five" is 0.63%. The reason why the sum of the probability of Top-3 is not equal to 1 is that the classification is a mutual exclusion event, so that each class has a probability value representing the class possibility.

As shown in fig. 10, the "visualization" option of the human-machine interface is triggered, and the result can be displayed after waiting about 1 second. The result of the feature extraction layer of the intermediate convolution can be observed through the visualization process, and the change of the image feature and the time sequence feature can also be observed, so that the reason of poor recognition in the process of fusing the neural network can be analyzed.

The video frames, the curled layer lip images, the high-dimensional image features and/or the visual sequence features displayed on the human-computer interaction interface are a continuous process which advances according to time sequence, for example, as shown in fig. 10, the first line displays a fixed video frame extracted from a video to be recognized, the second line displays lip features obtained by dividing the video frame displayed in the first line, the third line, the fourth line and the fifth line display intermediate processes for feature extraction of the lip features, the third line is the lip image of the curled layer 1, the fourth line is the lip image of the curled layer 2, the fifth line is the high-dimensional lip image features, the sixth line displays the sequence features in the time sequence feature extraction process, the lip language recognition process performed in the model can be observed through visual display, and the model with high recognition accuracy is determined by displaying different model recognition processes, the efficiency is high, and the method is also beneficial to quickly determining the model parameters by displaying the recognition results of the same model under different parameters.

The effect of the activation function can be well seen on the visualization of the image characteristics and the sequence characteristics, so that the subsequent calculation has the effect of reducing the characteristic dimension, the dependence on calculation force is reduced, and meanwhile, in the process of RNN time sequence characteristic reasoning, the initial characteristics are not obvious, and the final characteristics have two polarizations, which shows that the model used by the system has good fitting force, and the requirement of identifying the visual lip reading pronunciation can be met. The lip language recognition system gives consideration to the intermediate process display of the CNN model and the RNN model, and completes lip reading segmentation work, so that the background and support of deep study deep learning theory on lip language recognition work can be well completed.

The specific display content and the display mode on the human-computer interaction interface can be set in the configuration file according to the visualization requirement. In addition, the human-computer interaction interface can provide the replacement of Chinese and English bilingual, so that the system has better human-computer interaction and brings convenience to more users.

After the interactive interface and the system function are completed, whether the model converges in the training process needs to be paid attention to. When the hyperparameter empirical value is not properly set or the model is not reasonably constructed, the model is difficult to converge, so that the model does not work. Therefore, the loss function change curves of the training set and the test set in different periods are recorded in the training process, and whether the algorithm model provided by the text can learn the characteristics of the data set or not can be obtained through the curves, so that the algorithm convergence is evaluated.

To determine the variation trend of Loss, 70 runs of the experiment were recorded, with results recorded once per iteration. Fig. 11 is a variation curve of the Loss function Loss at different periods, wherein 1 epoch represents the round of training the whole data set, and generally, most data sets are converged to be completed after 10 rounds of training, and overfitting may be caused by continuous training.

As can be seen from FIG. 11, the model tends to stabilize when the epochs is about 15 times, at which time the model has reached the optimal solution, and the model continues to oscillate during the subsequent training learning process, which indicates that the limit of the model learning has been reached. Because the update parameters are updated and iterated according to the training set, the loss of the training set is relatively smaller than that of the verification set, and meanwhile, the losses of the training set and the verification set gradually converge along with the number of training iterations, which shows that the data set has no abnormal problem and the model has good performance on lip language recognition. By this point, the conclusion that both the data set and the model are operational can be verified.

The attention-based CNN-LSTM model also proposed in this example further tested its performance in the test set after the verification model had converged. Fig. 12 shows a recognition accuracy change curve in which recording is performed once per iteration and the ordinate registration accuracy is the recognition rate (%). In order to compare the improvement of the model performance, the CNN-LSTM network model is compared in experiments, so that the improvement of the model performance by introducing an attention mechanism can be obtained by a variable control method. The overall training trend is matched with the Loss function Loss trend, and the motion tends to be stable when the epochs is about 15 times, which shows that the parameters are continuously close to the optimal solution of the model after being updated in the training process, and the optimal solution is achieved at the moment. Meanwhile, the accuracy of the basic network CNN-LSTM model is greatly inferior to that of the network, which shows that the attention mechanism can well understand the important key frames of the video and weaken the noise of the sequence images. Therefore, the attention mechanism has good improvement and robustness on the model performance.

It can be seen from fig. 12 that the overall performance of the method introducing attention in this embodiment is better than that of the method of the general fusion neural network, and the accuracy rate increases obviously with the continuation of the training, and the method of this embodiment requires to calculate the distribution of the attention weight, so the accuracy rate will fluctuate obviously at the beginning of the learning phase, which indicates that the learning of the attention weight is not completed yet, and the model parameters need to be trained continuously, and finally when the epochs is about 15, the method of this embodiment basically completes the training, and the overall performance is better than that of the general fusion neural network.

Experimental results show that the CNN-LSTM model based on the attention mechanism is better than the CNN-LSTM model on the result of each digital pronunciation. FIG. 13 compares the results of each independent utterance, and the "Two", Four "and" Nine "utterances are significantly enhanced, which indicates that noise is more likely to occur in the monosyllabic-uttered word video, resulting in poor recognition of the results, and the model performance is greatly enhanced when the video noise is reduced. While the sound "Five" and the sound "one" of the complicated lip movements are not obviously improved, and the video noise is difficult to be explained and the space-time characteristics are difficult to learn. The lip motion of the pronunciation of the zero is small, and the tongue motion is a key factor in the pronunciation process, so that the model is difficult to predict.

The lip language recognition system provided by the invention can better observe and analyze the quality of the model performance and each link from the original video to the final recognition process, thereby improving and optimizing the model and the algorithm. Therefore, the lip language recognition system has important practical significance. On the other hand, the invention looks at the development direction of the lip language recognition technology from the perspective of a lightweight model, and reduces the calculated amount of the model to a great extent by reducing the convolution and the fully-connected structure, thereby reducing the dependence on the GPU to a certain extent and reducing the cost and the requirement of hardware.

Through the specific embodiment, the effectiveness of the algorithm model is verified by adopting the test data set to carry out a test experiment. Experimental results show that the video lip language recognition system designed by the invention has high available feasibility, the provided RNN fusion neural network model based on the CNN and the attention mechanism has high-efficiency feasibility, and the algorithm provided by the invention is verified to have higher accuracy in comparison with other algorithm models in prediction recognition.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The present invention is not limited to the above embodiments, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the present invention are included in the scope of the claims of the present invention which are filed as the application.

Claims

1. A lip language identification system, comprising: a human-computer interaction interface and an algorithm module; the human-computer interaction interface is connected with the algorithm module through a signal slot;

2. The system of claim 1, wherein the algorithm module comprises:

3. The system of claim 2, wherein the identification submodule comprises:

4. The system of claim 3, wherein the feature extraction unit is a Convolutional Neural Network (CNN).

5. The system of claim 3, wherein the timing feature extraction unit is a Recurrent Neural Network (RNN).

6. The system of claim 3, wherein the classification unit is a softmax classifier.

7. The system of claim 3, wherein the human-machine interface comprises:

selecting a video option for acquiring a video to be identified when triggered;

8. The system of claim 2, wherein the fixed frame decimation sub-module is specifically configured to:

determining a fixed frame number to be extracted based on a priori condition;

wherein the area coverage in each area block is maximized on average.

9. The system of claim 2, wherein the video to be identified comprises:

10. The system of claim 1, wherein the human-machine interface is designed and constructed through a PyQt5 framework.