CN113822125B

CN113822125B - Processing method and device of lip language recognition model, computer equipment and storage medium

Info

Publication number: CN113822125B
Application number: CN202110703815.6A
Authority: CN
Inventors: 何盛烽; 任苏成; 孙子荀; 邓大付; 王巨宏; 刘婷婷
Original assignee: South China University of Technology SCUT; Tencent Technology Shenzhen Co Ltd
Current assignee: South China University of Technology SCUT; Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2024-04-30
Anticipated expiration: 2041-06-24
Also published as: CN113822125A

Abstract

The application relates to a processing method, a processing device, computer equipment and a storage medium of a lip language identification model. The method relates to an artificial intelligence computer vision technology, the whole distillation process is divided into a student training stage and a big teacher's instructions training stage, in the big teacher's instructions training stage, a temporary training sample is utilized to update a student model updated in the previous time of the alternate training, the obtained temporary student model feeds back the current learning state to a master model through a verification sample, and the master model is guided to adaptively adjust teaching knowledge according to the current feedback; in addition, the teacher model is also supervised by a teacher teacher's instructions training sample, and the teaching content is adjusted through the teacher recognition loss determined by the teacher training sample. And training the student model in the student training stage, and obtaining the lip language recognition model according to the student model after repeated iteration for a plurality of times. According to the scheme, the teaching content can be flexibly adjusted while the accuracy of the teaching knowledge of the master model is improved, and the knowledge distillation effect is improved.

Description

Processing method and device of lip language recognition model, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for processing a lip language recognition model, a computer device, and a storage medium.

Background

Lip recognition aims at predicting speaking content from silent lip videos or face videos, and the visual task is usually to enable student models to learn the ability of lip recognition from a trained teacher model in a knowledge distillation manner.

Knowledge distillation may transfer knowledge from a teacher model to a student model. However, at present, the teacher model is usually a pre-trained model, and is not trained according to the current lip language recognition task capability of the student model, and because the needs of the student model are ignored, the teacher model often lacks flexibility in adjusting teaching knowledge, and cannot dynamically adjust teaching content according to the development of the student model, so that knowledge distillation effect is affected.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for processing a lip recognition model, which can improve the effect of guiding a student model to learn lip recognition.

A method of processing a lip language recognition model, the method comprising:

Acquiring training samples and acquiring a student model and a master model which are updated by previous alternate training, wherein each training sample comprises a video frame sequence and a corresponding audio signal;

Determining temporary student loss according to results obtained by respectively carrying out lip language recognition on temporary training samples obtained from the training samples according to the student model and the master model, and updating the student model based on the temporary student loss to obtain a temporary student model;

Determining student feedback loss according to a result obtained by performing lip language identification on a verification sample obtained from the training sample and tag data of the verification sample by the temporary student model, and determining teacher identification loss according to a result obtained by performing lip language identification on a large teacher's instructions training sample obtained from the training sample and tag data of the large teacher's instructions training sample by the teacher model;

obtaining a current alternate training updated master model according to the student feedback loss and the master recognition loss, and performing model training on the student model updated by the previous alternate training based on the current alternate training updated master model and the training sample to obtain a current alternate training updated student model;

And returning to the step of acquiring the student model and the master model updated by the previous alternate training based on the student model and the master model updated by the current alternate training, continuing the alternate training, and acquiring the lip language recognition model according to the student model updated when the training is stopped.

In one embodiment, the determining the learning difficulty coefficient corresponding to each training sample in the training samples includes:

Processing a video frame sequence in each training sample through a pre-trained video teaching aid network to obtain video confidence coefficients of lip language prediction categories of each training sample;

Processing audio signals in each training sample through a pre-trained audio teaching aid network to obtain audio confidence coefficients of lip language prediction categories of each training sample;

And fusing the video confidence coefficient and the audio confidence coefficient to obtain category confidence coefficient of each training sample, and determining a learning difficulty coefficient corresponding to each training sample according to the category confidence coefficient.

In one embodiment, the method further comprises:

According to the current iteration times, determining the number of target samples required by the current alternate training, wherein the number of target samples gradually increases along with the iteration times;

and acquiring the training samples of the target sample number for current alternate training.

In one embodiment, the method further comprises:

Acquiring a video frame sequence to be identified;

inputting the video frame sequence to be identified into the trained lip language identification model;

And processing the video frame sequence to be identified through a video processing network in the lip language identification model, and outputting speaking content corresponding to a speaker in the video frame sequence to be identified.

A processing apparatus of a lip language recognition model, the apparatus comprising:

The system comprises a sample acquisition module, a training module and a training module, wherein the sample acquisition module is used for acquiring training samples and acquiring a student model and a master model which are updated by previous alternate training, and each training sample comprises a video frame sequence and a corresponding audio signal;

The temporary student model acquisition module is used for determining temporary student loss according to results obtained by respectively carrying out lip language recognition on temporary training samples acquired from the training samples according to the student model and the master model, and updating the student model based on the temporary student loss to acquire a temporary student model;

The master model training module is used for determining student feedback loss according to a result obtained by performing lip language identification on a verification sample obtained from the training sample and tag data of the verification sample according to the temporary student model, and determining master recognition loss according to a result obtained by performing lip language identification on a master teacher's instructions training sample obtained from the training sample and tag data of the master teacher's instructions training sample according to the master model; obtaining a current alternate training updated master model according to the student feedback loss and the master recognition loss, and performing model training on the student model updated by the previous alternate training based on the current alternate training updated master model and the training sample to obtain a current alternate training updated student model;

And the iteration module is used for returning the step of acquiring the student model and the master model updated by the previous alternate training based on the student model and the master model updated by the current alternate training to continue the alternate training, and acquiring the lip language recognition model according to the student model updated when the training is stopped.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

A computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium by a processor of a computer device, the computer instructions being executed by the processor to cause the computer device to perform the steps of the method of processing a lip language recognition model as described above.

Compared with the traditional mode of guiding the student model to learn by using the pre-training teacher model, the processing method, the processing device, the computer equipment and the storage medium for the lip language recognition model not only train the student model, but also train the model guiding the student model to learn, and the model is called a master model, so that the whole distillation process is divided into a student training stage and a master teacher's instructions training stage of alternate training.

Specifically, in the training stage of the large teacher's instructions, the student model updated by the previous alternate training is updated again by using the temporary training sample, so as to obtain a temporary student model, and the temporary student model is used as an auxiliary model and is updated continuously. The temporary student model feeds back the current learning state to the master model through the verification sample, namely, the master model is guided to adaptively adjust teaching knowledge according to the feedback of the current lip language recognition task through student feedback loss; in addition, the teacher model is also supervised by a teacher teacher's instructions training sample, and the teaching content is adjusted through the teacher recognition loss determined by the teacher training sample. That is, the supervision information in the training process of the master model includes two parts, one part is a student feedback loss reflecting the current learning state of the student model, the other part is a master recognition loss reflecting the current teaching ability of the master model, and the updated master model is adjusted according to the two losses, so that the teaching knowledge accuracy of the master model can be improved, and meanwhile, the teaching content can be flexibly and dynamically adjusted, so that the whole knowledge distillation effect is improved. Therefore, after the current alternative training updated master model is obtained, the current alternative training updated master model and the training sample are used for carrying out model training on the previous alternative training updated student model in the student training stage, and after repeated iteration for a plurality of times, the recognition performance of the lip language recognition model obtained according to the student model is greatly improved.

A method of processing a lip language recognition model, the method comprising:

performing lip language recognition on a video frame sequence in a student training sample obtained from the training sample according to the student model to obtain a student recognition result, and constructing cross entropy loss according to the student recognition result and label data of the student training sample;

Constructing a cross-modal fusion loss according to the student identification result, a first lip identification result obtained by performing lip identification on the student training sample by a video processing network in the master model, a second lip identification result obtained by performing lip identification on the student training sample by an audio processing network in the master model and a third lip identification result obtained by an audio processing network in the master model based on the video frame sequence and the audio signal;

determining student loss according to the cross entropy loss and the cross-modal fusion loss;

after updating the student model updated by the previous alternate training according to the student loss, obtaining a student model updated by the current alternate training, and carrying out model training on the master model updated by the previous alternate training based on the student model updated by the current alternate training and the training sample to obtain the master model updated by the current alternate training;

The label loss construction module is used for carrying out lip language recognition on a video frame sequence in a student training sample obtained from the training sample according to the student model to obtain a student recognition result, and constructing cross entropy loss according to the student recognition result and label data of the student training sample;

The cross-modal fusion loss construction module is used for constructing cross-modal fusion loss according to the student identification result, a first lip identification result obtained by performing lip identification on the student training sample by a video processing network in the master model, a second lip identification result obtained by performing lip identification on the student training sample by an audio processing network in the master model and a third lip identification result obtained by an audio processing network in the master model based on the video frame sequence and the audio signal;

the student model updating module is used for determining student loss according to the cross entropy loss and the cross-modal fusion loss; after updating the student model updated by the previous alternate training according to the student loss, obtaining a student model updated by the current alternate training, and carrying out model training on the master model updated by the previous alternate training based on the student model updated by the current alternate training and the training sample to obtain the master model updated by the current alternate training;

Specifically, in the student training stage, the student model builds cross entropy loss through the label data of the student training sample, in addition, the video processing network in the master model extracts the knowledge of the video mode from the student training sample, the audio processing network of the master model extracts the knowledge of the audio mode from the student training sample, the audio-visual processing network of the master model extracts the audio-visual combined knowledge of the student training sample, and the cross mode fusion loss obtained by fusing the knowledge of the three different modes can enable the student model to learn the multi-mode information mining capability from the master model, guide the training of the student model together according to the cross entropy loss and the cross mode fusion loss, and greatly improve the learning effect of the student model. After the student model updated by the current alternate training is obtained, the current alternate training and the training sample are used for carrying out model training on the teacher model updated by the previous alternate training in the period of the large teacher's instructions training, and after repeated iteration for a plurality of times, the recognition performance of the lip language recognition model obtained according to the student model is greatly improved.

Drawings

FIG. 1 is an application environment diagram of a method for processing a lip language recognition model in one embodiment;

FIG. 2 is a flow chart of a method for processing a lip language recognition model in one embodiment;

FIG. 3 is a schematic diagram of a model framework for training a master model in a master teacher's instructions training stage in one embodiment;

FIG. 4 is a schematic diagram of a network architecture of video streams in one embodiment;

FIG. 5 is a schematic diagram of a network structure of an audio stream in one embodiment;

FIG. 6 is a schematic diagram of a network architecture of a combination of video and audio streams in a sentence-level lip language recognition scenario in one embodiment;

FIG. 7 is a schematic flow chart of lip recognition of training samples by a teacher model in one embodiment;

FIG. 8 is a flow diagram of a student model updated by current alternate training in one embodiment;

FIG. 9 is a flow diagram of determining student loss in one embodiment;

FIG. 10 is a flow diagram of constructing cross-modal fusion losses in one embodiment;

FIG. 11 is a schematic diagram of a model framework for training a student model during a student training phase in one embodiment;

FIG. 12 is a flow diagram of determining temporary student loss in one embodiment;

FIG. 13 is a schematic diagram of a network structure for alternately training a master model and a student model in one embodiment;

Fig. 14 is a flowchart of a method for processing a lip recognition model in an embodiment;

fig. 15 is a flowchart of a method for processing a lip recognition model according to another embodiment;

fig. 16 is a block diagram showing a configuration of a processing apparatus of the lip language recognition model in one embodiment;

Fig. 17 is a block diagram showing a configuration of a processing apparatus of a lip language recognition model in another embodiment;

Fig. 18 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

According to the processing method of the lip language identification model, provided by the application, the training of the lip language identification model and the lip language identification are realized by using a computer vision technology and a machine learning technology in an artificial intelligence technology (ARTIFICIAL INTELLIGENCE, AI).

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others. It can be understood that the application performs lip language identification according to the video frame sequence to be processed, which belongs to the video semantic understanding technology in the computer vision technology, and realizes the lip language identification.

Machine learning (MACHINE LEARNING, ML), which is a multi-domain interdisciplinary involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc., is specialized in studying how computers simulate or implement learning behavior of humans to obtain new knowledge or skills, reorganizing existing knowledge structures to continuously improve their own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. The artificial neural network is an important machine learning technology and has wide application prospect in the fields of system identification, pattern identification, intelligent control and the like. It will be appreciated that the present application trains and uses a lip recognition model through the use of machine learning techniques. The video frame sequences of the present application, including faces or lips, may be stored on a blockchain network to prevent theft.

The processing method of the lip language identification model provided by the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may obtain training samples and obtain a student model and a master model updated by previous alternate training, where each training sample includes a video frame sequence and a corresponding audio signal; according to the student model and the master model, determining temporary student loss according to results obtained by performing lip language recognition on temporary training samples obtained from training samples, and updating the student model based on the temporary student loss to obtain a temporary student model; determining student feedback loss according to a result obtained by performing lip language identification on a verification sample obtained from a training sample and tag data of the verification sample by a temporary student model, and determining teacher identification loss according to a result obtained by performing lip language identification on a large teacher's instructions training sample obtained from the training sample and tag data of a large teacher's instructions training sample by a teacher model; obtaining a current alternate training updated master model according to the student feedback loss and the master recognition loss, and performing model training on a student model updated by previous alternate training based on the current alternate training updated master model and the training sample to obtain a current alternate training updated student model; based on the updated student model and master model of the current alternate training, returning to the step of acquiring the updated student model and master model of the previous alternate training to continue the alternate training, and acquiring the lip language recognition model according to the updated student model when the training is stopped.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a method for processing a lip language recognition model is provided, and the method is applied to the computer device (the terminal 102 or the server 104) in fig. 1 for illustration, and includes the following steps:

Step 202, acquiring training samples and acquiring a student model and a master model updated by previous alternate training, wherein each training sample comprises a video frame sequence and a corresponding audio signal.

Lip recognition refers to the process of recognizing the speaking content of a speaker from silent lip videos or face videos. In the related art, the ability to lip-recognize silent video is usually learned from a teacher model pre-trained with audio signals by means of knowledge distillation. Obviously, the student model needs to learn knowledge of another mode from the pre-trained teacher model, the audio knowledge and the video knowledge are cross-modal, and potential modal differences among cross-modal data can cause the student model to not learn accurate video knowledge, so that lip language recognition effect of the student model is affected. Therefore, in the embodiment of the application, each training sample comprises a video frame sequence and a corresponding audio signal, so that a master model can understand knowledge of a video mode and an audio mode and audiovisual combined knowledge of the video mode and the audio mode, so as to make up for the inherent modal difference between the modal knowledge, and the student model can learn the modal knowledge from the master model.

In an embodiment of the present application, each training sample includes a sequence of video frames and an audio signal. The audio signal is denoted as X _A, the video frame sequence is denoted as X _V, and the speech content of the audio signal corresponds to the lip content corresponding to the video frame sequence, e.g. a training sample corresponds to the word "me". The audio signal may be an original waveform in a time domain, and the video frame sequence may be a video frame sequence obtained by sampling the original video signal at a preset sampling rate, for example, 25fps. The computer device may also perform an alignment process on the audio signals with the sequence of video frames, for example, each audio signal having a length of 1.16 seconds and the corresponding sequence of video frames having a length of 29.

Each training sample also includes label data corresponding thereto, the label data representing lip content corresponding to each training sample. The lip recognition can be divided into two application scenarios, one is word-level lip recognition, and the other is sentence-level lip recognition, and when the sentence-level lip recognition is performed, each word is predicted in sequence and then connected to obtain a predicted sentence.

In the word-level lip recognition scenario, the tag data for each word U ε R ^K may be represented by a single hot vector of length K, where R ^K represents the vocabulary and K represents the vocabulary, e.g., 500 may be taken. The computer device may construct training samples of word-level lip language recognition using the word-level dataset.

In a sentence-level lip language recognition scenario, the tag data for each character Z _q∈{R^K |q=1, 2,..q } in a sentence may be represented by a single hot vector, where Q represents the length of the sentence and Zq represents the Q-th character in the sentence. For example, the character size may be set to 40, and includes 26 letters, 10 numbers, and 4 special marks (space, keyboard, EOS, and punctuation marks) in total, and then the label data corresponding to each sentence is a vector matrix of q×40, for example, "we" corresponds to "wo men", and the label data is a vector matrix composed of single-hot vectors corresponding to 5 letters and 1 space. The computer device may construct training samples for sentence-level lip language recognition using the sentence-level dataset.

In one embodiment, a computer device obtains an original video, determines a lip region by detecting a face region in the original video, and clips the original video centered on the lip region to obtain a sequence of video frames. In addition, the computer equipment can also perform random rotation and scaling treatment on the cut lip area to obtain a richer training sample.

The purpose of knowledge distillation is to transfer knowledge from a teacher model (Teacher) to a Student model (Student), and in the related art of lip recognition, most of the students' models extract knowledge from a pre-trained teacher model to learn lip recognition, however, since the teacher model is pre-trained, the teaching content of the teacher model cannot be flexibly and dynamically adjusted according to the current learning state of the Student model. For this reason, instead of using a pre-trained teacher model, embodiments of the present application design a network of trainable, dynamically adjustable teaching content, called a Master model (Master). In the training process, the master model and the student model are trained alternately, and in the training stage of the master teacher's instructions, model parameters of the student model are fixed and are not updated, so that the master model is optimized through supervision of label data of training samples, and is required to be subjected to temporary feedback of the student model updated alternately in the previous time for optimization; in the student training phase, model parameters of the master model are fixed and are not updated, and the student model learns the ability to extract cross-modal knowledge from training samples from the master model updated by previous alternate training, and is optimized through supervision of label data of the training samples.

Specifically, the computer device acquires the student model and the master model updated by the previous alternate training when the alternate training is performed, and continues the alternate training when the alternate training is performed on the basis of the student model and the master model updated by the previous alternate training. For example, in the current alternate training process, in the large teacher's instructions training stage, the computer device acquires 10 small-batch training samples for iterating the master model 10 times, the number of the small-batch training samples in each batch is 30, and the 10 th iteration is ended to acquire the master model updated by the previous alternate training. Similarly, in the student training phase, the computer equipment acquires 10 batches of small-batch training samples again to iterate the student model 10 times, and the student model updated by the previous alternate training is acquired at the end of the 10 th iteration. And so on, continuing the alternate training. It should be noted that, because of the alternate training, the training sequence of the master model and the student model is not limited.

It may be understood that the "last alternate training updated student model and master model" is used to describe the student model and master model obtained after the last alternate training, and the "last" and "last" are concepts of relative change, for example, after the last model training is performed by using the "last alternate training updated student model and master model" to obtain the last alternate training updated student model and master model, the last alternate training updated student model and master model may be used as a new "last alternate training updated student model and master model" during the next alternate training, and the next alternate training becomes a new current alternate training.

And 204, determining temporary student loss according to results obtained by respectively performing lip language recognition on temporary training samples obtained from the training samples according to the student model and the master model, and updating the student model based on the temporary student loss to obtain the temporary student model.

In the related art, the teacher model is usually pre-trained, is not trained according to the current lip language recognition capability of students, neglects the learning requirement of the student model, and often lacks flexibility when the teacher model adjusts teaching knowledge. To this end, during the macro teacher's instructions training phase, the computer device uses one or more temporary training samples to temporarily update the student model updated by the previous alternate training to obtain a temporary student model (Temporary Student), and the lip language recognition capability of the temporary student model may be used to feed back the current learning state of the student model to the teacher model.

Specifically, the computer device may obtain a temporary training sample from the training sample, predict the temporary training sample by alternately training the updated student model and the master model for the previous time to obtain respective prediction results, and when the student model is updated to obtain the temporary student model, the master model is not updated, and the label data of the temporary training sample and the prediction results of the master model are used as the basis for updating the student model. It will be appreciated that the temporary student model obtained is updated from the student model updated from the previous alternate training, so that the temporary student model is continuously updated during the macro teacher's instructions training phase of each alternate training.

And 206, determining student feedback loss according to the result obtained by performing lip language identification on the verification sample obtained from the training sample and the label data of the verification sample by the temporary student model, and determining teacher identification loss according to the result obtained by performing lip language identification on the large teacher's instructions training sample obtained from the training sample and the label data of the large teacher's instructions training sample by the teacher model.

The verification sample is a sample for verifying the current lip language recognition capability of the student model, and the learning state of the current student model can be determined according to the student feedback loss constructed by the temporary student model on the lip language recognition result of the verification sample and the tag data of the verification sample. Thus, when the teacher model optimizes based on the student feedback loss, the feedback of the student model can be received, so that the teacher can flexibly adjust teaching contents in the optimization process, and the capability of transmitting knowledge to the student model is improved.

Specifically, the computer device may obtain a verification sample from the training sample, perform lip language recognition on the verification sample through the temporary student model to obtain a prediction result, and construct a cross entropy loss according to the prediction result and tag data of the verification sample, as a student feedback loss. In addition, in order to improve the lip language recognition performance of the student model, the teacher model needs to extract more comprehensive teaching knowledge, and the student model can learn more comprehensive knowledge from the teacher model. For this purpose, the computer device further obtains a large teacher's instructions training sample from the training sample, performs lip language recognition on the large teacher's instructions training sample through the updated large teacher model of previous alternate training, and constructs a teacher recognition loss of the large teacher's instructions training sample.

That is, the supervision information in the training process of the master model comprises two parts, one part is the student feedback loss reflecting the current learning state of the student model, the other part is the master recognition loss reflecting the current teaching ability of the master model, and the updated master model is adjusted according to the two losses, so that the teaching knowledge accuracy of the master model can be improved, and meanwhile, the teaching content can be flexibly and dynamically adjusted, so that the whole knowledge distillation effect is improved.

In some embodiments, the verification sample used to verify the learning effect of the current student model may be the same training sample as the master teacher's instructions training sample used to refine the teaching knowledge of the master model. In some embodiments, since the validation sample is a lip recognition capability of the current student model, the validation sample may be a training sample in the validation set, and the large teacher's instructions training sample is a training sample in the training set, i.e., the validation sample and the large teacher's instructions training sample employ different training samples.

And step 208, obtaining a current alternate training updated master model according to the student feedback loss and the master recognition loss, and performing model training on a student model updated by previous alternate training based on the current alternate training updated master model and the training sample to obtain a current alternate training updated student model.

Specifically, in the training stage of the university teacher's instructions, the computer equipment performs gradient back propagation through the student feedback loss and the university recognition loss so as to update the model parameters of the university model, and after the current alternative training updated university model is obtained, the computer equipment continues to perform model training on the student model updated in the previous alternative training based on the current alternative training updated university model and the training sample in the student training stage, so as to obtain the current alternative training updated student model.

Step 210, based on the updated student model and master model of the current alternate training, returning to the step of obtaining the updated student model and master model of the previous alternate training to continue the alternate training, and obtaining the lip language recognition model according to the updated student model when the training is stopped.

Specifically, the computer device performs an iterative process of alternately training the master model and the student model according to the previous steps, which is called an iterative process of alternately training, and according to the previous steps, the computer device can iterate a plurality of times, and returns to the step of acquiring the student model and the master model updated by the previous alternate training to continue the alternate training until the iterative stop condition is met, and then acquires the lip language recognition model according to the updated student model.

As shown in FIG. 3, a schematic diagram of a model framework for training a master model in a master teacher's instructions training phase of alternating training is shown in one embodiment. Referring to fig. 3, after a student model and a master model updated by previous alternate training are obtained, a video frame sequence of a temporary training sample is input to the student model, a video frame sequence of the temporary training sample and an audio signal are both input to the master model, a temporary student loss is constructed by using output results of the student model and the master model, and the temporary student model is obtained after the student model is updated according to the temporary student loss. And then, inputting the video frame sequence in the verification sample into a temporary student model, constructing student feedback loss according to the output result of the temporary student model, inputting the video frame sequence and the audio signal in the training sample of the university teacher's instructions into a university model, and constructing the university recognition loss according to the output result of the university model. Model parameters of the master model are updated based on student feedback loss and master recognition loss.

Compared with the traditional mode of guiding the student model to learn by using the pre-training teacher model, the processing method of the lip language recognition model not only trains the student model, but also trains the model guiding the student model to learn, and the model is called a master model, so that the whole distillation process is divided into a student training stage and a master teacher's instructions training stage of alternate training. Specifically, in the training stage of the large teacher's instructions, the student model updated by the previous alternate training is updated again by using the temporary training sample, so as to obtain a temporary student model, and the temporary student model is used as an auxiliary model and is updated continuously. The temporary student model feeds back the current learning state to the master model through the verification sample, namely, the master model is guided to adaptively adjust teaching knowledge according to the feedback of the current lip language recognition task through student feedback loss; in addition, the teacher model is also supervised by a teacher teacher's instructions training sample, and the teaching content is adjusted through the teacher recognition loss determined by the teacher training sample. After the current alternative training updated master model is obtained, the current alternative training updated master model and the training sample are used for carrying out model training on the previous alternative training updated student model in the student training stage, and after repeated iteration for a plurality of times, the recognition performance of the lip language recognition model obtained according to the student model is greatly improved.

In one embodiment, the student model needs to learn the ability to perform lip recognition on silent video through model training, so the student model is a video stream-based model, which is input as a sequence of video frames of training samples, and output as lip recognition results. In order to improve the lip language recognition performance of the student model, the teacher model needs to extract more comprehensive teaching knowledge, the student model can learn from the teacher model to the more comprehensive knowledge, and in the embodiment of the application, the teacher model is a model based on the combination of video streams and audio streams, the more comprehensive knowledge can be extracted from data of different modes, the inherent mode difference among cross-mode data is compensated, and the input of the teacher model comprises the audio signals and the video frame sequences of training samples.

The audio stream is an audio processing network for generating a prediction result based on an audio signal, the video stream is a video processing network for generating a prediction result based on a video signal, and the audio-visual combined stream aims to combine the audio signal and the video signal to generate the prediction result. The audio stream and the video stream comprise a front-end feature extraction layer, a rear-end feature mapping layer and an output layer for classification. The combination of the video stream and the audio stream comprises the audio stream and the video stream, a vector cascade layer and an output layer for classification, wherein the vector cascade layer is used for obtaining an audio-visual combination output vector according to output vectors respectively generated at the rear end of the audio stream and the rear end of the video stream.

In one embodiment, the feature extraction layer of the front end of the audio stream may use ResNet-18, and since the audio signal is located in a signal in 1-dimensional space, the computer device may replace the two-dimensional convolution kernel of the front end of the audio stream with a one-dimensional convolution and set the convolution kernel size of the first-layer convolution according to the sampling rate of the audio signal. The feature mapping layer at the back end of the audio stream may use a time convolution or transformer sequence to sequence (TM-Seq 2 Seq) in a word-level lip recognition scenario and transformer sequence to sequence (TM-Seq 2 Seq) in a sentence-level lip recognition scenario.

In one embodiment, the feature extraction layer of the front end of the video stream may use ResNet-18 and, since the video signal is an image signal, also including a time dimension, the computer device may replace the first layer convolution of the front end of the video stream with a three-dimensional convolution. The feature mapping layer at the back end of the video stream may use a temporal convolution or transformer sequence to sequence (TM-Seq 2Seq, including multi-headed attention and feed forward networks) in a word-level lip recognition scenario, and TM-Seq2Seq in a sentence-level lip recognition scenario.

In one embodiment, a combination of the video stream and the audio stream is used to obtain a prediction of a merging feature derived from the audio stream and the video stream. The vector cascade layer in the combination of the video stream and the audio stream directly connects output vectors respectively generated at the rear ends of the audio stream and the video stream into a new vector in a word-level lip language identification scene; in the sentence-level lip language identification scene, the video coding vector and the audio coding vector are respectively obtained through the attention of the context information to the audio output vector and the video output vector and then are connected into a new audio-visual combination output vector.

As shown in fig. 4, a schematic diagram of a network structure of a video stream in one embodiment. Referring to fig. 4, the input is a video frame sequence, video features are obtained through a feature extraction layer at the front end, and then a video output vector is obtained by using the TM-Seq2 Seq-based rear end.

Fig. 5 is a schematic diagram of a network structure of an audio stream in one embodiment. Referring to fig. 4, the input is an audio signal, an audio feature is obtained through a feature extraction layer (one-dimensional convolution) of the front end, and an audio output vector is obtained by using the TM-Seq2 Seq-based rear end.

Fig. 6 is a schematic diagram of a network structure of a combination of video and audio streams in a sentence-level lip-recognition scenario in one embodiment. Referring to fig. 6, the network structure includes, in addition to the audio stream and the video stream as shown in fig. 4 and 5, an audiovisual processing network including a multi-headed attention encoding layer and a concatenation layer for obtaining an audiovisual combined output vector according to the attention of the context to the current output character, and an output layer for obtaining a lip recognition result according to the audiovisual combined output vector.

In one embodiment, the step of the student model performing lip language recognition on the training sample comprises: inputting a video frame sequence in a training sample into a student model; extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model; obtaining a video output vector according to video features through a feature mapping layer of the student model; and obtaining a lip language identification result according to the video output vector through an output layer of the student model.

The student model is, as mentioned above, a model based on a video stream, i.e. the student model is a model based on a video processing network. Referring to the network structure of fig. 4, the video stream includes a feature extraction layer, a feature mapping layer, and an output layer for classification. When the computer equipment needs to carry out lip language recognition on the training sample through the student model, a video frame sequence in the training sample is input into the student model, and a corresponding lip language recognition result is obtained. In the word-level lip recognition scenario, the lip recognition result output by the student model based on the video stream is a K-dimensional vector, where K represents the vocabulary, and each element in the K-dimensional vector represents the probability that the lip content of the video frame sequence is each word in the vocabulary. In the sentence-level lip language recognition scene, the lip language recognition result output by the student model based on the video stream is a matrix vector.

In one embodiment, as shown in fig. 7, the step of performing lip language recognition on the training sample by the master model includes:

step 702, a training sample is input into a master model.

As mentioned above, the master model is a model based on a combination of a video stream and an audio stream, as shown in fig. 6, which includes a vector concatenation layer and an output layer for classification in addition to the above-mentioned audio stream and video stream. In this embodiment, the master model includes a video processing network based on a video stream and an audio processing network based on an audio stream, and also includes an audio-visual processing network. When the computer equipment needs to carry out lip language recognition on the training sample through the master model, the video frame sequence and the audio signal in the training sample are input into the master model.

And step 704, processing the video frame sequence in the training sample through a video processing network in the master model to obtain a first lip language identification result.

Wherein the video processing network in the master model, i.e. the network structure based on the video stream, i.e. the master model is a model based on the audio processing network and the video processing network. The computer equipment inputs the video frame sequence in the training sample into the video processing network to obtain a first lip language identification result. The first lip recognition result is a recognition result obtained based on video information of the training sample.

In one embodiment, processing, by a video processing network in a master model, a sequence of video frames in a training sample to obtain a first lip language recognition result includes: inputting a video frame sequence in the training sample into a video processing network of the master model; extracting video features corresponding to the video frame sequences through a feature extraction layer of the video processing network, obtaining video output vectors according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language recognition result according to the video output vectors through an output layer of the video processing network.

Specifically, the video processing network is a model based on a video stream, and referring to the network structure of fig. 4, the video stream includes a feature extraction layer, a feature mapping layer, and an output layer for classification. The computer equipment inputs the video frame sequence in the training sample into a video processing network, and the first lip language recognition result is obtained through the processing of a feature extraction layer, a feature mapping layer and an output layer of the video processing network in sequence.

And step 706, processing the audio signals in the training samples through an audio processing network in the master model to obtain a second lip language identification result.

Wherein the audio processing network in the master model, i.e. the network structure based on the audio stream. The computer equipment inputs the audio signal in the training sample into the audio processing network to obtain a second lip language identification result. The second lip recognition result is a recognition result obtained based on the audio information of the training sample.

In one embodiment, step 706 includes: inputting the audio signal in the training sample into an audio processing network of the master model; extracting audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtaining audio output vectors according to the audio features through a feature mapping layer of the audio processing network, and obtaining a second lip language recognition result according to the audio output vectors through an output layer of the audio processing network.

Specifically, the audio processing network is a model based on an audio stream, and referring to the network structure of fig. 5, the audio stream includes a feature extraction layer, a feature mapping layer, and an output layer for classification. The computer equipment inputs the audio signals in the training samples into an audio processing network, and the second lip language recognition result is obtained through the processing of a feature extraction layer, a feature mapping layer and an output layer of the audio processing network in sequence.

Step 708, obtaining, by the audio-visual processing network in the master model, an audio-visual combined output vector based on a video output vector obtained by the video processing network according to the video frame sequence and an audio output vector obtained by the audio processing network according to the audio signal, and obtaining a third lip recognition result based on the audio-visual combined output vector.

Wherein the audio-visual processing network in the master model is configured to obtain derived audio-visual combined output vectors based on the video output vectors and the audio output vectors. The computer equipment obtains an audio-visual combined output vector according to the video output vector output by the video processing network and the audio output vector output by the audio processing network, and obtains a third lip language recognition result according to the audio-visual combined output vector. The audio-visual combined output vector is a feature derived from the video output vector and the audio output vector that can reflect potential cross-modal knowledge between the video modality and the audio modality.

In one embodiment, when the student model is used for word-level lip recognition, step 708 includes: inputting the video output vector and the audio output vector into an audio-visual processing network of a master model; and cascading the video output vector and the audio output vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language recognition result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

Specifically, the audio-visual processing network comprises a cascade layer and an output layer, in a word-level lip language identification scene, the computer equipment cascades the video output vector and the audio output vector through the cascade layer to obtain an audio-visual combined output vector, and then obtains a third lip language identification result through the output layer for classification.

In one embodiment, when the student model is used for sentence-level lip language recognition, step 708 includes: determining a feature vector of a preceding output character; inputting the feature vector of the previous output character, the video output vector obtained by the video processing network according to the video frame sequence, and the audio output vector obtained by the audio processing network according to the audio signal into the audio-visual processing network of the master model; the multi-head attention coding layer of the audio-visual processing network is used for obtaining a video coding vector and an audio coding vector according to the feature vector, the video output vector and the audio output vector; and cascading the video coding vector and the audio coding vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language recognition result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

Referring to the network structure of fig. 6, the computer device inputs the video frame sequence in the training sample into the video processing network, obtains the video output vector through the processing of the feature extraction layer and the feature mapping layer of the video processing network, and inputs the audio signal in the training sample into the audio processing network, and obtains the audio output vector through the processing of the feature extraction layer and the feature mapping layer of the audio processing network in sequence.

In order to utilize the influence of the previous output character on the current output character, in a multi-head attention coding layer of an audio-visual processing network, the video output vector and the audio output vector are continuously coded by utilizing the characteristic vector of the previous character to obtain the video coding vector and the audio coding vector, the video coding vector and the audio coding vector are cascaded through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and a third lip-language recognition result is obtained according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

With respect to step 204, the specific implementation manner of determining the temporary student loss according to the results obtained by performing the lip language recognition on the temporary training samples obtained from the training samples according to the student model and the master model, that is, the temporary student loss construction manner is consistent with the manner of constructing the student loss on the student model in the student training phase of the alternate training, which will be described in detail later.

Regarding student feedback loss in step 206, cross entropy loss may be used.

In one embodiment, determining the student feedback loss according to the result obtained by performing lip language recognition on the verification sample obtained from the training sample by the temporary student model and the label data of the verification sample comprises: inputting the video frame sequence in the verification sample into a student model; extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model; obtaining a video output vector according to video features through a feature mapping layer of the student model; obtaining a lip language identification result according to the video output vector through an output layer of the student model; and constructing cross entropy loss according to the lip language identification result and the label data of the verification sample, and using the cross entropy loss as student feedback loss.

In an embodiment, the computer device obtains the student model updated alternately the previous time, updates the student model again by using the temporary student loss determined by the temporary training sample to obtain the temporary student model, and the computer device may obtain the temporary student model by adopting the following formula:

Wherein L _s represents the provisional student loss determined using the provisional training sample; θ _s represents model parameters of the student model updated by previous alternate training, θ _ts represents model parameters of the temporary student model, and α represents learning rate.

After inputting the validation sample into the temporary student model, the computer device may construct the student feedback loss using the following formula:

Where y' represents a result obtained by performing lip recognition on the video frame sequence in the verification sample by the provisional student model f _ts, and y ₁ represents tag data of the verification sample.

Regarding the master identification loss in step 206, cross entropy loss may also be used. In one embodiment, determining a teacher recognition loss from results obtained from lip recognition of a training sample of the bench teacher's instructions training samples and tag data of the training sample of the bench teacher's instructions based on a teacher model includes: inputting a large teacher's instructions training sample into a master model to obtain a corresponding first lip language identification result, a second lip language identification result and a third lip language identification result; determining a first cross entropy loss according to the label data of the training sample of the large teacher's instructions and the first lip language identification result, determining a second cross entropy loss according to the label data of the training sample of the large teacher's instructions and the second lip language identification result, determining a third cross entropy loss according to the label data of the training sample of the large teacher's instructions and the third lip language identification result, and fusing the first cross entropy loss, the second cross entropy loss and the third cross entropy loss to obtain the recognition loss of the teacher.

The embodiment of inputting the training sample of the master teacher's instructions into the master model to obtain the corresponding first lip recognition result, the second lip recognition result and the third lip recognition result may refer to the process flow of performing lip recognition on the training sample by using the master model described in fig. 7, and the detailed description of the master model based on the combination of the video stream and the audio stream.

Specifically, the computer equipment inputs a video frame sequence in a large teacher's instructions training sample into a video processing network of a master model; extracting video features corresponding to the video frame sequences through a feature extraction layer of the video processing network, obtaining video output vectors according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language recognition result according to the video output vectors through an output layer of the video processing network. The computer equipment inputs the audio signal in the training sample of the large teacher's instructions into an audio processing network of a master model, extracts the audio feature corresponding to the audio signal through a feature extraction layer of the audio processing network, obtains an audio output vector according to the audio feature through a feature mapping layer of the audio processing network, and obtains a second lip language recognition result according to the audio output vector through an output layer of the audio processing network. And obtaining an audio-visual combined output vector based on a video output vector obtained by the video processing network according to the video frame sequence and an audio output vector obtained by the audio processing network according to the audio signal through an audio-visual processing network in the master model, and obtaining a third lip language recognition result based on the audio-visual combined output vector.

In one embodiment, the computer device may construct the master identification loss using the following formula:

L_m＝λ_m(L_CE(y₂,f_m(X_A,X_V;θ_A,θ_V))+L_CE(y₂,f_m(X_A;θ_A))+L_CE(y₂,f_m(X_V;θ_V)));

Wherein λ _m represents a balance factor, f _m(X_A,X_V;θ_A,θ_V) represents a third lip recognition result corresponding to the large teacher's instructions training sample, f _m(X_A;θ_A) represents a second lip recognition result corresponding to the large teacher's instructions training sample, f _m(X_V;θ_V) represents a first lip recognition result corresponding to the large teacher's instructions training sample, and y ₂ represents tag data of the large teacher's instructions training sample.

Then, the total loss of optimization of the master model during the macro teacher's instructions training phase can be expressed by the following formula:

L_master＝L_ts+λ_m(L_CE(y₂,f_m(X_A,X_V;θ_A,θ_V))+L_CE(y₂,f_m(X_A;θ_A))+L_CE(y₂,f_m(X_V;θ_V)))

And (3) carrying out gradient back propagation through the student feedback loss and the teacher identification loss so as to update the model parameters of the teacher model and obtain the updated teacher model which is trained alternately at present.

Next, the optimization process of the student model in the student training phase of the alternate training is described.

In the student training stage, only model parameters of the student model are updated, and training targets comprise cross entropy loss and cross-modal fusion loss, wherein the cross entropy loss is used for improving classification accuracy of the student model, and the cross-modal fusion loss is used for matching output between the student and a master model, so that the student model learns from the master model to cross-modal knowledge.

In one embodiment, as shown in fig. 8, model training the student model updated in the previous alternate training based on the master model and the training sample updated in the current alternate training in step 208 to obtain the student model updated in the current alternate training includes:

Step 802, obtaining a student training sample from the training samples;

Step 804, determining the loss of the student according to the result obtained by performing lip language recognition on the student training sample by the student model updated by previous alternate training and the result obtained by performing lip language recognition on the student training sample by the teacher model updated by current alternate training.

Step 806, updating the updated student model of the previous alternate training according to the student loss, and obtaining the updated student model of the current alternate training.

In one embodiment, as shown in FIG. 9, step 804 includes:

and 902, performing lip language recognition on a video frame sequence in a student training sample through a student model updated by previous alternate training to obtain a student recognition result, and constructing cross entropy loss according to the student recognition result and label data of the student training sample.

Specifically, the computer device may input a video frame sequence of a student training sample into a student model updated by previous alternate training, extract video features corresponding to the video frame sequence through a feature extraction layer of the student model, obtain a video output vector according to the video features through a feature mapping layer of the student model, and obtain a student recognition result according to the video output vector through an output layer of the student model.

In one embodiment, in the word-level lip language recognition scenario, after the student training sample is input into the student model to obtain the student recognition result, the corresponding cross entropy loss can be expressed by the following formula:

y＝[y1,y2,y3,...,yK]；

y′＝[y1′,y2′,y3′,...,yK′]

Wherein y represents the label data of the student training sample, K represents the vocabulary quantity of the vocabulary, y' represents the student recognition result of the student model on the student training sample, and can be marked as f _s(X_v;θ_s),L_CE to represent cross entropy loss.

In the sentence-level lip language recognition scenario, the computer device may obtain the loss generated by each character in the sentence using the above formula, and obtain the cross entropy loss of the sentence according to the loss generated by all the characters.

And step 904, constructing a cross-modal fusion loss according to the student identification result, a first lip identification result, a second lip identification result and a third lip identification result obtained by performing lip identification on the student training sample by using the teacher model updated by current alternate training.

In this embodiment, knowledge extraction from a speech mode to a video mode is necessary for lip language recognition, because different phoneme features and video features can avoid ambiguity, so that a master model outputs different types of knowledge across modes, namely audio knowledge, video knowledge and audiovisual knowledge, so as to further refine teaching promotion and promote guiding effects on student models.

In particular, the computer device may input a sequence of video frames of a student training sample into a video processing network of a master model; extracting video features corresponding to the video frame sequences through a feature extraction layer of the video processing network, obtaining video output vectors according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language recognition result according to the video output vectors through an output layer of the video processing network. The computer equipment inputs the audio signals in the student training samples into an audio processing network of a master model, extracts audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtains audio output vectors according to the audio features through a feature mapping layer of the audio processing network, and obtains a second lip language recognition result according to the audio output vectors through an output layer of the audio processing network. And obtaining an audio-visual combined output vector based on a video output vector obtained by the video processing network according to the video frame sequence and an audio output vector obtained by the audio processing network according to the audio signal through an audio-visual processing network in the master model, and obtaining a third lip language recognition result based on the audio-visual combined output vector.

Then, the computer equipment can construct the cross-modal fusion loss according to the student recognition result output by the student model, the first lip recognition result, the second lip recognition result and the third lip recognition result output by the master model pair.

Further, due to inherent modal differences between video modal data and audio modal data, how to fuse cross-modal knowledge becomes a further problem to be solved when updating a student model. According to the embodiment of the application, two pre-training teaching aid networks, namely a video teaching aid network (tutorV) and an audio teaching aid network (tutorA), are introduced, video information and audio information which are respectively output are used as additional cross-modal guidance, the video information and the audio information are encoded into weighting coefficients, and the weighting coefficients are used as preference degrees of a student model on the video information and the audio information, so that students can self-balance the preference of learning on the video characteristics and the audio characteristics during training.

In one embodiment, as shown in FIG. 10, step 904 includes:

Step 1002, obtaining a video output vector corresponding to a video frame sequence in a training sample of a student through a pre-training video teaching network, and then encoding the video output vector into a video preference coefficient.

The video teaching aid network is a network based on video streams, the audio teaching aid network is a network based on audio streams, and parameters of the video teaching aid network and the audio teaching aid network are not updated in the process of alternately training a master model and a student model. The video teaching aid network is used for extracting video information of a video frame sequence in the training sample, and the audio teaching aid network is used for extracting audio information of an audio signal in the training sample. The information provided by both of them can be used to balance the knowledge of the different modalities.

Specifically, the computer device inputs a video frame sequence in a student training sample into a pre-trained video teaching aid network, extracts video features corresponding to the video frame sequence through a feature extraction layer of the video teaching aid network, obtains a video output vector according to the video features through a feature mapping layer of the video teaching aid network, and the video output vector obtained by the video teaching aid network can be marked as H _V, codes the video output vector into a video preference coefficient and can be marked as W _V.

Step 1004, obtaining an audio output vector corresponding to the audio signal in the training sample of the student through the pre-training audio teaching network, and then encoding the audio output vector into an audio frequency offset coefficient.

Similarly, the computer device inputs the audio signal in the training sample of the student into the pre-trained audio teaching aid network, extracts the audio feature corresponding to the audio signal through the feature extraction layer of the audio teaching aid network, obtains the audio output vector according to the audio feature through the feature mapping layer of the audio teaching aid network, and then the audio output vector obtained by the audio teaching aid network can be marked as H _A, codes the audio output vector into the audio frequency offset coefficient and can be marked as W _A.

Step 1006, determining a first focus loss according to the student identification result and the first lip language identification result, determining a second focus loss according to the student identification result and the second lip language identification result, and determining a third focus loss according to the student identification result and the third lip language identification result.

In this embodiment, in order for the student model to dynamically learn the cross-modal knowledge extracted by the master model, the learning effect of the student is balanced, and focus Loss (Focal Loss) is adopted to alleviate the problem of unbalanced difficulty of the training sample.

Step 1008, weighting the first focus loss according to the video preference coefficient, weighting the second focus loss according to the audio preference coefficient, and fusing with the third focus loss to obtain the cross-modal fusion loss.

In one embodiment, the computer device may employ the following formula as a cross-modal fusion loss:

L_DF＝L_F(f_S(X_V;θ_S),f_m(X_A,X_V;θ_A,θ_V))+W_AL_F(f_S(X_V;θ_S),f_m(X_A;θ_A))

+W_VL_F(f_S(X_V;θ_S),f_m(X_V;θ_V))；

Wherein L _F represents focus loss, f _m(X_V;θ_V) represents a first lip recognition result of the teacher model output to the student training sample, f _m(X_A;θ_A) represents a second lip recognition result of the teacher model output to the student training sample, f _m(X_A,X_V;θ_A,θ_V) represents a third lip recognition result of the teacher model output to the student training sample, f _S(X_V;θ_S) represents a student recognition result of the student model output to the student training sample, W _A represents an audio preference coefficient, and W _V represents a video preference coefficient.

Step 906, determining student loss according to the cross entropy loss and the cross mode fusion loss.

From the above derivation, the overall student loss for the student training phase can be expressed as follows:

L_s＝L_CE(y,f_s(X_V,θ_s))+λ_aL_DF；

Where λ _a represents the regularized balance factor, and then calculate the optimization parameter θ _s ^*:

And the computer equipment performs gradient back propagation through the student loss in the student training stage so as to update the model parameters of the student model, and after the updated student model is obtained by the current alternate training, continues the next alternate training process, namely, continues to perform alternate training on the updated student model by the current alternate training based on the updated master model and the training sample by the current alternate training until the iteration stop condition is met, and obtains a lip language recognition model according to the updated student model.

In one embodiment, encoding an audio output vector into audio frequency offset coefficients includes: performing full connection processing on the video output vector through a first full connection layer in the cross-modal fusion network to obtain a video full connection vector; performing full connection processing on the audio output vector through a second full connection layer in the cross-modal fusion network to obtain an audio full connection vector; and connecting the video full-connection vector and the audio full-connection vector in series through a third full-connection layer in the cross-modal fusion network, and then performing full-connection processing to obtain the audio preference coefficient.

Wherein a cross-modality fusion network is a network for fusing knowledge of different modalities. The cross-modal fusion network is used as a part of a master model, is updated in a training stage of a master teacher's instructions, and is not updated in a training stage of students. In this embodiment, the cross-mode fusion network includes three full-connection layers, which are respectively a first full-connection layer for connecting video information, a second full-connection layer for performing full-connection processing on audio information, a third full-connection layer for fusing video information and audio information, and a network parameter of the third full-connection layer may be denoted as θ _FV, θ _FA, and θ _FAV.

Specifically, the computer device may obtain the audio preference coefficients and the video preference coefficients using the following formula:

H′_A＝FC(H_A;θ_FA)；

H′_V＝FC(H_V;θ_FV)；

W_A＝W;W_V＝1-W；

Wherein H _V represents a video output vector obtained through a video teaching aid network, H _A represents an audio output vector obtained through an audio teaching aid network, FC (x, θ) represents a fully connected layer with network parameters θ, Represents tandem operation, phi represents a sigmoid function.

It will be appreciated that during the student training phase, the cross-modal fusion network is not updated as part of the master model, and is updated during the macro teacher's instructions training phase by the student feedback loss, that is, the network parameters of all three fully connected layers are updated during the macro teacher's instructions training phase.

FIG. 11 is a schematic diagram of a model framework for training a student model in an alternate training phase, in one embodiment. Referring to fig. 11, after a student model updated by previous alternate training and a master model updated by current alternate training are obtained, a video frame sequence of a student training sample is input to the student model, an audio signal of the student training sample is input to the master model, cross entropy loss is constructed by using a student recognition result of the student model, cross-modal fusion loss is constructed by using a student recognition result of the student model and an output result of the master model, and the student model updated by current alternate training is obtained after the student model is updated according to the cross entropy loss and the cross-modal fusion loss.

The update process of the student model during the student training phase has been described above. The foregoing is that, regarding the manner of constructing temporary student loss during the large teacher's instructions training stage of the alternate training in step 204, the manner of constructing student loss for the student model during the student training stage of the alternate training is consistent, and the manner of constructing temporary student loss is briefly and additionally described herein, and details may refer to the foregoing content of the update process for the student model during the student training stage, which is not repeated herein.

In one embodiment, as shown in fig. 12, step 1204, determining temporary student loss according to results obtained by performing lip language recognition on temporary training samples obtained from training samples by the student model and the master model, respectively, includes:

Step 1202, performing lip language recognition on a video frame sequence in a temporary training sample through a student model to obtain a temporary student recognition result, and constructing cross entropy loss according to the temporary student recognition result and tag data of the temporary training sample.

The student model is temporarily updated in the training stage of the university teacher's instructions, and the student model is optimized in the training stage of the student, except that the temporary student model obtained by temporary updating is not stored and is only used for determining the learning state of the current student model by the university model.

In the student training stage, only model parameters of the student model are updated, and training targets comprise cross entropy loss and cross-modal fusion loss, wherein the cross entropy loss is used for improving classification accuracy of the student model, and the cross-modal fusion loss is used for matching output between the student and a master model, so that the student model learns from the master model to cross-modal knowledge. Here, the large teacher's instructions training stage updates the student model updated in the previous alternation again to obtain a temporary student model, which is the same processing step.

Specifically, the computer device may input the video frame sequence of the temporary training sample into the student model updated by previous alternate training, extract the video feature corresponding to the video frame sequence through the feature extraction layer of the student model, obtain the video output vector according to the video feature through the feature mapping layer of the student model, and obtain the temporary student identification result according to the video output vector through the output layer of the student model.

And 1204, constructing a cross-modal fusion loss according to a first lip language identification result, a second lip language identification result and a third lip language identification result which are obtained by performing lip language identification on the temporary training sample according to the temporary student identification result and the master model.

Specifically, the computer device may input the video frame sequence of the temporary training sample into a video processing network of the master model, extract video features corresponding to the video frame sequence through a feature extraction layer of the video processing network, obtain a video output vector according to the video features through a feature mapping layer of the video processing network, and obtain a first lip language recognition result according to the video output vector through an output layer of the video processing network. The computer equipment inputs the audio signal in the temporary training sample into an audio processing network of the master model, extracts audio characteristics corresponding to the audio signal through a characteristic extraction layer of the audio processing network, obtains an audio output vector according to the audio characteristics through a characteristic mapping layer of the audio processing network, and obtains a second lip language recognition result according to the audio output vector through an output layer of the audio processing network. And obtaining an audio-visual combined output vector based on a video output vector obtained by the video processing network according to the video frame sequence and an audio output vector obtained by the audio processing network according to the audio signal through an audio-visual processing network in the master model, and obtaining a third lip language recognition result based on the audio-visual combined output vector.

Then, the computer equipment can construct the cross-modal fusion loss according to the temporary student identification result output by the student model, the first lip identification result, the second lip identification result and the third lip identification result output by the master model pair.

In one embodiment, step 1204 includes: the method comprises the steps of obtaining a video output vector corresponding to a video frame sequence in a temporary training sample through a pre-training video teaching aid network, and then encoding the video output vector into a video preference coefficient; through a pre-trained audio teaching aid network, after an audio output vector corresponding to an audio signal in a temporary training sample is obtained, the audio output vector is encoded into an audio frequency offset coefficient; determining first focus loss according to the temporary student identification result and the first lip language identification result, determining second focus loss according to the temporary student identification result and the second lip language identification result, and determining third focus loss according to the temporary student identification result and the third lip language identification result; and weighting the first focus loss according to the video preference coefficient, and fusing the second focus loss with the third focus loss after weighting the second focus loss according to the audio preference coefficient to obtain the cross-modal fusion loss.

Step 1206, determining temporary student loss based on the cross entropy loss and the cross modal fusion loss.

From the above deductions, the temporary student loss in the macro teacher's instructions training stage can be expressed as follows:

L_s＝L_CE(y,f_s(X_V,θ_s))+λ_aL_DF；

Wherein y represents tag data of the temporary training sample, f _s(X_V,θ_s) represents a temporary student identification result obtained by performing lip language identification on the temporary training sample by using the student model updated by previous alternate training, L _CE represents cross entropy loss, and L _DF represents cross-modal fusion loss.

As previously deduced, during the large teacher's instructions training phase, after obtaining temporary student losses, the computer device may use the following formula to obtain a temporary student model:

In the macro teacher's instructions training stage, after obtaining the temporary student model, the computer device may construct the student feedback loss using the following formula:

L_ts＝L_CE(y₁,f_ts(X_v;θ_ts))；

From the deduction process, the computer equipment performs gradient back propagation through the student feedback loss in the training stage teacher's instructions so as to update the parameters of the full-connection layer in the cross-mode fusion network, so that the network parameters of the full-connection layer are trained in the training stage teacher's instructions.

As shown in fig. 13, in one particular embodiment, a schematic diagram of a network structure for alternately training a master model and a student model. Referring to fig. 13, the network includes four modules: a master model (master), a student model (student), and a pre-trained audio teaching network (tutorA), a video teaching network (tutorV), wherein subscripts a and V represent audio and video modalities, respectively. The teacher model is a model based on a combination of video and audio streams, the student model and the video teaching aid network are both video stream based models, and the audio teaching aid network is an audio stream based model. The master model takes as input an audio signal X _A and a video frame sequence X _V, providing three types of knowledge: f _m(X_A;θ_A generated from an audio stream), f _m(X_V;θ_V generated from a video stream), and f _m(X_A,X_V;θ_A,θ_V generated from an audio-visual combination), the student model takes as input a video frame sequence X _V, an output probability f _s(X_V;θ_s), the video teaching aid network takes as input a video frame sequence X _V, an output probability f _tV(X_V;θ_tV), the audio teaching aid network takes as input an audio signal X _A, an output probability f _tA(X_A;θ_tA.

In the training phase of alternately training students, only model parameters theta _s of the student model are updated, and the training targets comprise two items: cross entropy loss and dynamic fusion loss of student training samples. Inputting a video frame sequence of a student training sample into a student model to obtain a student identification result f _s(X_V;θ_s), and constructing cross entropy loss according to the student identification result f _s(X_V;θ_s) and label data y of the student training sample. Inputting a video frame sequence and an audio signal of a student training sample into a master model to obtain f _m(X_A;θ_A)、f_m(X_V;θ_V) and f _m(X_A,X_V;θ_A,θ_V). The video frame sequence of the student training sample is input into a video teaching aid network to obtain a video output vector H _V, and the audio signal of the student training sample is input into an audio teaching aid network to obtain an audio output vector H _A. Dynamic fusion losses were constructed according to f_s(X_V;θ_s)、f_m(X_A;θ_A)、f_m(X_V;θ_V)、f_m(X_A,X_V;θ_A,θ_V)、H_V and H _A.

In the training stage of the large teacher's instructions of alternate training, only model parameters of a master model and a student model of a cross-modal fusion network are updated, and the training targets comprise two items: student feedback loss and master recognition loss. Firstly, using a temporary training sample, inputting the temporary training sample into a student model, a master model, a video teaching aid network and an audio teaching aid network according to the same steps as training the student model in the student training stage to obtain temporary student loss, and updating the student model again according to the temporary student loss to obtain the temporary student model (temporary student). Next, using the verification sample, inputting a video frame sequence of the verification sample into a temporary student model to obtain a temporary student identification result f _ts(X_V;θ_ts), and constructing a student feedback loss according to the temporary student identification result f _ts(X_V;θ_ts) and tag data y1 of the verification sample. Using the large teacher's instructions training samples, the video frame sequence and audio signal of the large teacher's instructions training samples are input into a master model, and a master recognition loss is constructed according to the obtained f _m(X_A;θ_A)、f_m(X_V;θ_V) and f _m(X_A,X_V;θ_A,θ_V) and the tag data y2 of the large teacher's instructions training samples.

And continuously updating the student model and the master model again according to the flow until the training stopping condition is met, and obtaining the lip language recognition model according to the optimized student model.

In the related art, when a training sample is selected from a training set, the training sample is generally randomly sampled and then input into a model to be optimized, and the training sample is not ordered in the method, so that the effectiveness of the training process is affected to a certain extent. Therefore, the embodiment of the application enables the model to learn lip language identification knowledge from a simple sample based on the course learning strategy, and gradually increases the sample difficulty so as to be beneficial to better convergence of the model.

In one embodiment, the method further comprises: determining a learning difficulty coefficient corresponding to each training sample in the training samples; in the training process of the student model and the master model, sequentially selecting student training samples and a master teacher's instructions training sample required by alternate training from training samples according to the sequence of the learning difficulty coefficient from small to large.

Specifically, after the computer device obtains the training set, for each training sample in the training set, a corresponding learning difficulty coefficient is determined respectively, the smaller the learning difficulty coefficient is, the easier the model classifies the training sample, the lower the learning difficulty of the training sample is, otherwise, the larger the learning difficulty coefficient is, the easier the model classifies the training sample, and the higher the learning difficulty of the training sample is. For a big teacher's instructions training stage and a student training stage of alternate training, when training samples are obtained from a training set, the training samples are sequentially selected according to the sequence of the learning difficulty coefficient from small to large and then are input into a model.

In one embodiment, determining a learning difficulty coefficient corresponding to each of the training samples includes: processing the video frame sequences in each training sample through a pre-training video teaching-aid network to obtain video confidence coefficients of lip language prediction categories of each training sample; processing the audio signals in each training sample through a pre-trained audio teaching aid network to obtain the audio confidence coefficient of the lip language prediction category of each training sample; and fusing the video confidence coefficient and the audio confidence coefficient to obtain the category confidence coefficient of each training sample, and determining the learning difficulty coefficient corresponding to each training sample according to the category confidence coefficient.

The confidence coefficient is inversely proportional to the learning difficulty coefficient, and the higher the category confidence coefficient is, the more easily the model predicts the training sample accurately, and the lower the learning difficulty coefficient of the training sample is, otherwise, the lower the category confidence coefficient is, the higher the learning difficulty coefficient of the training sample is.

In one embodiment, the computer device obtains the learning difficulty coefficient of the training sample using the scoring function:

Wherein, Representing the nth segment in the audio signal,/>Representing the mth video frame in the video frame sequence, C (-) represents the confidence level, and sort (-) represents the sorting operation, and the higher the confidence level is, the easier the training sample is to be learned by the model, and the learning difficulty coefficient/>, of the training sample isThe lower. Alternatively, when multiple training samples have the same learning difficulty coefficient, the confidence/>, in the video modality, of these training samples can be usedTraining samples with higher confidence in the video modality are preferentially selected.

In one embodiment, the method further comprises: according to the current iteration times, determining the number of target samples required by the current alternate training, wherein the number of target samples gradually increases along with the iteration times; the training samples with the target sample number are obtained for current alternate training.

For example, in the first alternate training process, in the training stage of the large teacher's instructions and the training stage of the student, the computer device respectively acquires 10 batches of small-batch training samples to optimize the master model 10 times, the number of each batch of small-batch training samples is 30, in the next alternate training process, in the training stage of the large teacher's instructions and the training stage of the student, the computer device respectively still acquires 10 batches of small-batch training samples to optimize the master model 10 times, and the number of each batch of small-batch training samples is 40.

For another example, in the first alternate training process, in the training stage of teacher's instructions and the training stage of the student, the computer device obtains 10 small-batch training samples respectively, optimizes the master model 10 times, the number of each small-batch training sample is sequentially increased, the number of the first training sample is 10, the number of the second training sample is 15, the number of the third training sample is 20, the number of the 10 th training sample is sequentially increased, and the number of the 10 th training sample is 55.

In one embodiment, the computer device uses the following pacing function to determine the increment of the training sample during the training process:

Where G _i represents the input percentage of the number of training samples in the ith iteration, G ₀ is the initial percentage, P represents an exponential factor, P may take 1.75, and ζ represents the number of iterations in the alternating training.

Based on the evaluation function and the pace function, the difficulty of training samples and the increment of the number of the training samples can be determined more reasonably, and the strategy can reduce the learning ambiguity when training is just started and can lead learners to better converge.

In one embodiment, the method further comprises: acquiring a video frame sequence to be identified; inputting a video frame sequence to be identified into a trained lip language identification model; and processing the video frame sequence to be identified through a video processing network in the lip language identification model, and outputting speaking content corresponding to a speaker in the video frame sequence to be identified.

Specifically, at the end of training, the computer device may obtain a lip recognition model from the student model. The computer device may directly use the lip recognition model. The computer equipment can also obtain the model parameters of the lip language identification model, set the model structure of the student model when needed and import the model parameters to obtain the lip language identification model.

The obtained lip language identification model is based on a video processing network, and the computer equipment such as a terminal or a server can input the video frames to be processed into the trained lip language identification model and output the speaking content corresponding to the speaker in the video frame sequence to be identified. The sequence of video frames to be processed may be obtained from a silent video or may be based on an audio video, for example, in a noisy environment, when the content of a speaker in the video cannot be heard, the content of the speaker can be identified by a lip recognition model.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a part of other steps or stages.

In a specific embodiment, as shown in fig. 14, the processing method of the lip language identification model includes the following steps:

Step 1402, obtaining training samples and obtaining a student model and a master model updated by previous alternate training, wherein each training sample comprises a video frame sequence and a corresponding audio signal;

the processing steps in the large teacher's instructions training stage comprise:

Step 1404, obtaining a temporary training sample from the training samples;

Step 1406, inputting the temporary training sample into a student model based on a video stream, obtaining a temporary student identification result, and constructing cross entropy loss according to the temporary student identification result and tag data of the temporary training sample.

In step 1408, the temporary training sample is input into a teacher model based on the video stream and the audio stream, and a cross-modal fusion loss is constructed according to the temporary student identification result, the first lip identification result, the second lip identification result, and the third lip identification result obtained by performing lip identification on the temporary training sample by the teacher model.

Step 1410, determining a temporary student loss according to the cross entropy loss and the cross-modal fusion loss, and updating the student model based on the temporary student loss to obtain a temporary student model.

Step 1412, obtaining a verification sample from the training sample, inputting the verification sample into a temporary student model, obtaining a lip language recognition result, and constructing a student feedback loss according to the lip language recognition result and the label data of the verification sample.

Step 1414, obtaining a training sample of da teacher's instructions from the training sample, inputting the training sample of da teacher's instructions into a master model, and determining the master recognition loss according to a first lip recognition result, a second lip recognition result, a third lip recognition result and tag data of the training sample of da teacher's instructions, which are obtained by performing lip recognition on the training sample of da teacher's instructions according to the master model.

Step 1416, updating the master model based on the student feedback loss and the master identification loss.

The processing steps in the student training phase include:

step 1418, obtaining a student training sample from the training sample, inputting the student training sample into a student model based on a video stream, obtaining a student identification result, and constructing cross entropy loss according to the student identification result and label data of the student training sample.

Step 1420, inputting the student training sample into a teacher model based on the video stream and the audio stream, and constructing a cross-modal fusion loss according to a first lip recognition result, a second lip recognition result and a third lip recognition result obtained by performing lip recognition on the student training sample by the teacher model according to the student recognition result and the teacher model.

Step 1422, determining a student loss according to the cross entropy loss and the cross modal fusion loss, and updating the student model based on the student loss.

Fig. 15 is a schematic flow chart of a method for processing a lip language recognition model in one embodiment. Fig. 15 is a method for processing a lip language recognition model based on a training phase of students, specifically including the following steps:

In step 1502, training samples are obtained and a student model and a master model updated in the previous alternate training are obtained, each training sample including a sequence of video frames and a corresponding audio signal.

Step 1504, performing lip language recognition on a video frame sequence in a student training sample obtained from the training sample according to the student model, obtaining a student recognition result, and constructing cross entropy loss according to the student recognition result and label data of the student training sample.

Step 1506, constructing a cross-modal fusion loss according to the student recognition result, the first lip recognition result obtained by performing lip recognition on the student training sample by the video processing network in the master model, the second lip recognition result obtained by performing lip recognition on the student training sample by the audio processing network in the master model, and the third lip recognition result obtained by the audio processing network in the master model based on the video frame sequence and the audio signal.

Step 1508, determining student loss from the cross-entropy loss and the cross-modal fusion loss.

And 1510, updating the updated student model of the previous alternate training according to the student loss, obtaining the updated student model of the current alternate training, and performing model training on the updated master model of the previous alternate training based on the updated student model of the current alternate training and the training sample, so as to obtain the updated master model of the current alternate training.

And step 1512, based on the updated student model and the master model which are alternately trained at the time, returning to the step of acquiring the updated student model and the master model which are alternately trained at the previous time to continue the alternate training, and acquiring the lip language recognition model according to the updated student model when the training is stopped.

Specific embodiments of the above steps have been described in the foregoing, and are not repeated here.

Compared with the traditional mode of guiding the student model to learn by using the pre-training teacher model, the processing method of the lip language recognition model not only trains the student model, but also trains the model guiding the student model to learn, and the model is called a master model, so that the whole distillation process is divided into a student training stage and a master teacher's instructions training stage of alternate training.

The following describes the evaluation effect of the model training method provided by the embodiment of the application.

Regarding the data set used for training: for evaluating the method provided by the embodiment of the application, three reference data sets, namely one word-level data set LRW 3 and two sentence-level data sets LRS2-BBC, LRS3-TED, are used. The LRW dataset is a large vocabulary level dataset with 500 words and 45 tens of thousands of utterances, each video being 1.16 seconds in length, 29 frames. The LRS2-BBC dataset is from the conversation of the BBC, which dataset is decomposed into a pre-training dataset, a fine-tuning training dataset, and a validation dataset. The LRS3-TED dataset is from a TED speech, comprising 15 ten thousand utterances and over 420 ten thousand words.

Pretreatment of training samples: to crop the lip region of the video, facial markers are detected using dlib and the result is randomly cropped and interpolated to yield 112 x 112 lip-centered images, which are also rotated and scaled.

Details regarding implementation: in the word-level lip recognition scenario, the vocabulary size is set to 500, which is consistent with the vocabulary in the LRW. For the sentence-level lip-recognition scenario, LRS2-BBC and LRS3-TED, the vocabulary size was set to 40, including 26 letters, 10 numbers, and 4 special marks ([ space ], [ keyboard ], [ EOS ], and punctuation marks).

In addition, during the training process, the student model and the master model are trained alternately by using the SGD optimizer, the momentum is 0.9, and the weight attenuation is 1e-4. In an audio stream, the original waveform is taken as input. In the video stream, the input video is sampled at 25 fps.

The whole training process comprises two steps of pre-training and fine tuning. Specifically, the student model and the master model are pre-trained at word level using a Time Convolution (TC) based backend, pre-trained sets of LRW and LRS2-BBS and LRS3-TED are used, and the pre-trained models are fine-tuned using LRW. In the sentence-level lip language recognition scene, the TM-Seq2Seq is used for replacing TC as the back end in the pre-training model, the pre-training set of LRS2-BBS or LRS3-TED is used for continuous training, and then the related training value set is used for fine tuning of the new pre-training model.

Before training, the learning rate α was set to 10-3. In the fine tuning process, α is initialized to 10-4, and each time the verification loss curve is flat, it is reduced by half, and the final learning rate is reduced to 10-6. Some of the super-parameters in the foregoing formulas are set as follows: λs=10, λm=10, g0=0.25, p=1.75, ζ=107.

Regarding the evaluation index: in all experiments, word Error Rate (WER) was used as a measure, defined as wer= (s+d+i)/NUM, where S, D, I is the number of words that the predicted value was replaced, deleted, and inserted, respectively, compared to the tag data, and NUM is the total number of words in the tag data.

Table one: WER obtained by LRW, LRS2-BBC and LRS3-TED

Table 2: error Rate on LRW

Method of	THESE	THERE	THING	UNDER
					Ours (without discontilation)	74％	70％	70％	66％
Ours (with distilation)	70％	59％	68％	60％

Table 3: WER of students learned from different pre-trained teachers or co-trained teachers

Method of	Distill from	LRS-BBC
			Audio Teacher	x	17.2
Student1	Audio Teacher	54.2
			Video Teacher	x	57.5
Student2	Video Teacher	53.4
			Audio-Visual Teacher	x	15.6
Student3	Audio-Visual Teacher	54.1
			Audio Master	x	19.1
Student4	Audio Master	52.1
			Video Master	x	59.1
Student5	Video Master	53.0
			Audio-Visual Master	x	16.9
Student6	Audio-Visual Master	51.5

Comparison with the related art: the methods provided by the examples of the present application were compared to several methods, including MT, temporal Conv, WAS, bi LSTM, TM-CTC, TM-Seq2Seq, conv-Seq2Seq, LIBS, and TM-CTC-KD.

For word-level lip language recognition. Table 1 shows a quantitative comparison of word-level lip recognition with respect to the LRW dataset-related method. It can be seen that Ours-TC provided by the examples of the present application was significantly better than baseline time convolution without knowledge distillation (sample Conv), and the WER was improved by 6.7%. Furthermore ours-TM achieved the best performance compared to other methods. In particular, the increase is 2% compared to the second best method Conv-Seq2 Seq.

For sentence-level lip language recognition, the experimental results are listed in the last two columns of table 2. It can be observed that the TM provided by the embodiments of the present application performed best on LRS2-BBC and LRS3-TED compared to other methods. More importantly, the method provided by the embodiment of the application improves the LRS2-BBC and LRS3-TED by 0.6% and 0.9% respectively, using less training data than the TM-Seq2 Seq. TM-Seq2Seq employs the same backend as the TM provided by the embodiments of the present application and is trained on an additional non-public dataset MV-LRS. In addition, compared with Conv-Seq2Seq, conv-Seq2Seq uses a structure more advanced than the student model provided by the embodiment of the application, the TM provided by the embodiment of the application still realizes better performance, the WER of LRS2-BBC is improved by 2.5%, and the WER of LRS3-TED is improved by 1.1%.

Examples of misclassifications: the inventors further investigated the first four LRW cases with highest error rates and listed in table 3 the comparison of our TCs without KD and our TCs. It can be observed that when multiple phonemes are mapped to one dimension bit, for example TH and DH phonemes compared to dimension bit/t/the accuracy of the method provided by embodiments of the application improves on average by approximately 6%.

In summary, the research results show that: (i) The master model distillation method provided by the embodiment of the application can effectively improve the performance of the task specific network. (ii) While the model provided by the embodiments of the present application focuses primarily on the advantages of standard distillation methods, it is possible to obtain better performance when replacing task-specific network structures with more advanced network structures.

Regarding ablation experiments: the effectiveness of the proposed module, including the main network, the cross-modal fusion network and the course learning strategy, was investigated using a single-mode lip language recognition network as the baseline.

Validity of master (master). To investigate the performance of the university, the inventors studied 6 pairs of different model teacher or master designs and tested the respective performance on LRS 2-BBC. The results are summarized in table 3. The reported audio-visual master representation comes from its audiovisual branch and PRETRAINED TEAACHER is identical in architecture to its corresponding master. Furthermore, course learning strategies are not used here.

The inventors have observed and analyzed the following: first, in the absence of a single model of KD, the descending order of its performance in different modalities is always { audiovisual modality (AV), audio modality (a) and video modality (V) }, whether the model is trainable (i.e. master model) or not trainable (i.e. teacher model). This verifies the importance of learning from cross-modal data rather than uni-modal data. And (II) under the condition that knowledge is extracted from the teacher model and the master model, the student models sequentially represent { V, AV, A } and { AV, A, V } from large to small in different forms. The first ordering order means that the audio-visual modality can provide additional information compared to the audio modality, helping to mitigate ambiguity caused by cross-modality gaps, but using simple fusion strategies (cascading) is limited. While another ordering order shows the effectiveness of the master model, which can reduce cross-modal differences to some extent, because the master model is a dynamic adjustment based on task-specific feedback of the student model. And (III) regardless of the form used, the student model learned from the master model always performs better than the student model learned from the teacher model. These facts indicate that the co-trained master model is more efficient than a pre-trained teacher model due to its adaptability to student models, despite the sacrifice in their own performance.

In one embodiment, as shown in fig. 16, a processing apparatus 1600 of a lip language recognition model is provided, which may use a software module or a hardware module, or a combination of both, as a part of a computer device, and specifically includes: a sample acquisition module 1602, a temporary student model acquisition module 1604, a master model training module 1606, and an iteration module 1608, wherein:

The sample obtaining module 1602 is configured to obtain training samples and obtain a student model and a master model updated by previous alternate training, where each training sample includes a video frame sequence and a corresponding audio signal;

The temporary student model acquisition module 1604 is configured to determine temporary student loss according to results obtained by performing lip language recognition on temporary training samples acquired from training samples according to a student model and a master model, and update the student model based on the temporary student loss to acquire a temporary student model;

The master model training module 1606 is configured to determine a student feedback loss according to a result obtained by performing lip language recognition on a verification sample obtained from a training sample and tag data of the verification sample according to a temporary student model, and determine a master recognition loss according to a result obtained by performing lip language recognition on a master teacher's instructions training sample obtained from the training sample and tag data of a master teacher's instructions training sample according to a master model; obtaining a current alternate training updated master model according to the student feedback loss and the master recognition loss, and performing model training on a student model updated by previous alternate training based on the current alternate training updated master model and the training sample to obtain a current alternate training updated student model;

and the iteration module 1608 is used for returning to the step of acquiring the student model and the master model updated by the previous alternate training based on the student model and the master model updated by the current alternate training to continue the alternate training, and acquiring the lip language recognition model according to the student model updated when the training is stopped.

In one embodiment, the processing device 1600 of the lip language recognition model further includes a student recognition module for inputting a sequence of video frames in the training sample into the student model; extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model; obtaining a video output vector according to video features through a feature mapping layer of the student model; and obtaining a lip language identification result according to the video output vector through an output layer of the student model.

In one embodiment, the processing device 1600 of the lip language recognition model further includes a master recognition module for inputting training samples into the master model; processing a video frame sequence in a training sample through a video processing network in a master model to obtain a first lip language identification result; processing the audio signal in the training sample through an audio processing network in the master model to obtain a second lip language identification result; and obtaining an audio-visual combined output vector based on a video output vector obtained by the video processing network according to the video frame sequence and an audio output vector obtained by the audio processing network according to the audio signal through an audio-visual processing network in the master model, and obtaining a third lip language recognition result based on the audio-visual combined output vector.

In one embodiment, the master identification module is further configured to input a sequence of video frames in the training samples into a video processing network of the master model; extracting video features corresponding to the video frame sequences through a feature extraction layer of the video processing network, obtaining video output vectors according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language recognition result according to the video output vectors through an output layer of the video processing network.

In one embodiment, the master identification module is further configured to input the audio signal in the training sample into an audio processing network of the master model; extracting audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtaining audio output vectors according to the audio features through a feature mapping layer of the audio processing network, and obtaining a second lip language recognition result according to the audio output vectors through an output layer of the audio processing network.

In one embodiment, when the student model is used for word-level lip language recognition, the master recognition module is further for inputting the video output vector and the audio output vector into an audiovisual processing network of the master model; and cascading the video output vector and the audio output vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language recognition result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

In one embodiment, when the student model is used for sentence-level lip language recognition, the master recognition module is further for determining feature vectors of the previously output characters; inputting the feature vector of the previous output character, the video output vector obtained by the video processing network according to the video frame sequence, and the audio output vector obtained by the audio processing network according to the audio signal into the audio-visual processing network of the master model; the multi-head attention coding layer of the audio-visual processing network is used for obtaining a video coding vector and an audio coding vector according to the feature vector, the video output vector and the audio output vector; and cascading the video coding vector and the audio coding vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language recognition result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

In one embodiment, the temporary student model obtaining module 1604 is further configured to perform lip language recognition on the video frame sequence in the temporary training sample through the student model, obtain a temporary student recognition result, and construct cross entropy loss according to the temporary student recognition result and tag data of the temporary training sample; constructing a cross-modal fusion loss according to a temporary student identification result, a first lip identification result, a second lip identification result and a third lip identification result which are obtained by performing lip identification on a temporary training sample by a master model; and determining temporary student loss according to the cross entropy loss and the cross-modal fusion loss.

In one embodiment, the temporary student model acquisition module 1604 is further configured to obtain, through the pre-training video teaching aid network, a video output vector corresponding to the video frame sequence in the temporary training sample, and then encode the video output vector into a video preference coefficient; through a pre-trained audio teaching aid network, after an audio output vector corresponding to an audio signal in a temporary training sample is obtained, the audio output vector is encoded into an audio frequency offset coefficient; determining first focus loss according to the temporary student identification result and the first lip language identification result, determining second focus loss according to the temporary student identification result and the second lip language identification result, and determining third focus loss according to the temporary student identification result and the third lip language identification result; and weighting the first focus loss according to the video preference coefficient, and fusing the second focus loss with the third focus loss after weighting the second focus loss according to the audio preference coefficient to obtain the cross-modal fusion loss.

In one embodiment, the temporary student model acquisition module 1604 is further configured to perform full connection processing on the video output vector through a first full connection layer in the cross-modal fusion network to obtain a video full connection vector; performing full connection processing on the audio output vector through a second full connection layer in the cross-modal fusion network to obtain an audio full connection vector; and connecting the video full-connection vector and the audio full-connection vector in series through a third full-connection layer in the cross-modal fusion network, and then performing full-connection processing to obtain the audio preference coefficient.

In one embodiment, master model training module 1606 is also used to input a sequence of video frames in the validation sample into the student model; extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model; obtaining a video output vector according to video features through a feature mapping layer of the student model; obtaining a lip language identification result according to the video output vector through an output layer of the student model; and constructing cross entropy loss according to the lip language identification result and the label data of the verification sample, and using the cross entropy loss as student feedback loss.

In one embodiment, the master model training module 1606 is further configured to input a master teacher's instructions training sample into a master model to obtain a corresponding first lip recognition result, a second lip recognition result, and a third lip recognition result; determining a first cross entropy loss according to the label data of the training sample of the large teacher's instructions and the first lip language identification result, determining a second cross entropy loss according to the label data of the training sample of the large teacher's instructions and the second lip language identification result, determining a third cross entropy loss according to the label data of the training sample of the large teacher's instructions and the third lip language identification result, and fusing the first cross entropy loss, the second cross entropy loss and the third cross entropy loss to obtain the recognition loss of the teacher.

In one embodiment, the processing device 1600 of the lip language recognition model further includes a student training module for obtaining a student training sample from the training samples; according to the result obtained by performing lip language recognition on the student training sample by the student model updated by the previous alternate training, determining the student loss by the result obtained by performing lip language recognition on the student training sample by the teacher model updated by the current alternate training; and updating the student model updated by the previous alternate training according to the student loss, and obtaining the student model updated by the current alternate training.

In one embodiment, the student training module is further configured to perform lip language recognition on the video frame sequence in the student training sample through the student model updated by previous alternate training, obtain a student recognition result, and construct cross entropy loss according to the student recognition result and label data of the student training sample; according to student identification results and a teacher model updated by current alternate training, performing lip language identification on a student training sample to obtain a first lip language identification result, a second lip language identification result and a third lip language identification result, and constructing cross-modal fusion loss; and determining the student loss according to the cross entropy loss and the cross mode fusion loss.

In one embodiment, the processing device 1600 of the lip language recognition model further includes a training sample selection module, configured to determine a learning difficulty coefficient corresponding to each training sample in the training samples; in the training process of the student model and the master model, sequentially selecting student training samples and a master teacher's instructions training sample required by alternate training from training samples according to the sequence of the learning difficulty coefficient from small to large.

In one embodiment, the training sample selection module is further configured to process, through a pre-trained video teaching-aid network, a video frame sequence in each training sample, to obtain a video confidence level of a lip prediction category of each training sample; processing the audio signals in each training sample through a pre-trained audio teaching aid network to obtain the audio confidence coefficient of the lip language prediction category of each training sample; and fusing the video confidence coefficient and the audio confidence coefficient to obtain the category confidence coefficient of each training sample, and determining the learning difficulty coefficient corresponding to each training sample according to the category confidence coefficient.

In one embodiment, the processing device 1600 of the lip language recognition model further includes a training sample number determining module, configured to determine, according to the current iteration number, a target sample number required for the alternate training at the current time, where the target sample number gradually increases with the iteration number; the training samples with the target sample number are obtained for current alternate training.

In one embodiment, the processing device 1600 of the lip language identification model further includes an identification module, configured to obtain a video frame sequence to be identified; inputting a video frame sequence to be identified into a trained lip language identification model; and processing the video frame sequence to be identified through a video processing network in the lip language identification model, and outputting speaking content corresponding to a speaker in the video frame sequence to be identified.

Compared with the traditional mode of guiding the student model to learn by using the pre-training teacher model, the processing device 1600 of the lip language recognition model not only trains the student model, but also trains the model guiding the student model to learn, which is called a master model, so that the whole distillation process is divided into a student training stage and a master teacher's instructions training stage of alternate training.

In one embodiment, as shown in fig. 17, a processing apparatus 1700 of a lip language recognition model is provided, which may employ a software module or a hardware module, or a combination of both, as a part of a computer device, and specifically includes: a sample acquisition module 1702, a tag loss construction module 1704, a cross-modal fusion loss construction module 1706, a student model update module 1708, and an iteration module 1710, wherein:

A processing apparatus for a lip language recognition model, the apparatus comprising:

The sample obtaining module 1702 is configured to obtain training samples and obtain a student model and a master model updated by previous alternate training, where each training sample includes a video frame sequence and a corresponding audio signal;

The label loss construction module 1704 is configured to perform lip language recognition on a video frame sequence in a student training sample obtained from the training sample according to the student model, obtain a student recognition result, and construct cross entropy loss according to the student recognition result and label data of the student training sample;

The cross-modal fusion loss construction module 1706 is configured to construct a cross-modal fusion loss according to a student recognition result, a first lip recognition result obtained by performing lip recognition on a student training sample by a video processing network in a master model, a second lip recognition result obtained by performing lip recognition on the student training sample by an audio processing network in the master model, and a third lip recognition result obtained by an audio processing network in the master model based on a video frame sequence and an audio signal;

The student model updating module 1708 is configured to determine a student loss according to the cross entropy loss and the cross-modal fusion loss; after the student model updated by the previous alternate training is updated according to the student loss, the student model updated by the current alternate training is obtained, and the master model updated by the previous alternate training is subjected to model training based on the student model updated by the current alternate training and the training sample, so that the master model updated by the current alternate training is obtained;

And the iteration module 1710 is configured to, based on the updated student model and the master model that are alternately trained at the time, return to the step of obtaining the updated student model and the master model that are alternately trained at the previous time, continue the alternate training, and obtain the lip language recognition model according to the updated student model when the training is stopped.

In the above-mentioned lip language recognition model apparatus 1700, compared with the conventional manner of guiding the learning of the student model by using the pre-training teacher model, not only the student model but also the model guiding the learning of the student model is trained, which is called as a master model, so that the whole distillation process is divided into a student training stage and a master teacher's instructions training stage of alternate training.

For specific limitations on the processing apparatus of the lip recognition model, reference may be made to the above limitation on the processing method of the lip recognition model, and no further description is given here. The above-mentioned all modules in the processing device of the lip language identification model can be implemented in whole or in part by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, the internal structure of which may be as shown in FIG. 18. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The computer program is executed by a processor to implement a method of processing a lip language recognition model.

It will be appreciated by those skilled in the art that the structure shown in FIG. 18 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method for processing a lip language recognition model, the method comprising:

Acquiring training samples and acquiring a student model and a master model which are updated by previous alternate training, wherein each training sample comprises a video frame sequence and a corresponding audio signal; the master model and the student model are trained alternately, and specifically comprise the following steps: in a training stage of a college teacher's instructions, model parameters of the student model are fixed and not updated, the college model is optimized through label data of training samples, and the model is optimized according to temporary feedback of the student model which is updated alternately in the previous time; in the training stage of students, model parameters of the master model are fixed and are not updated, the students learn the capability of extracting cross-modal knowledge from training samples from the master model which is updated by previous alternate training, and optimize the training samples through label data of the training samples;

2. The method of claim 1, wherein the step of the student model performing lip language recognition on the training sample comprises:

Inputting a video frame sequence in the training sample into the student model;

extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model;

obtaining a video output vector according to the video features through a feature mapping layer of the student model;

and obtaining a lip language identification result according to the video output vector through the output layer of the student model.

3. The method of claim 1, wherein the step of the master model performing lip language recognition on the training sample comprises:

inputting the training sample into the master model;

processing a video frame sequence in the training sample through a video processing network in the master model to obtain a first lip language identification result;

processing the audio signals in the training sample through an audio processing network in the master model to obtain a second lip language identification result;

And obtaining an audio-visual combined output vector based on a video output vector obtained by the video processing network according to the video frame sequence and an audio output vector obtained by the audio processing network according to the audio signal through an audio-visual processing network in the master model, and obtaining a third lip language recognition result based on the audio-visual combined output vector.

4. A method according to claim 3, wherein said processing, by the video processing network in the master model, the sequence of video frames in the training samples to obtain a first lip recognition result comprises:

inputting a sequence of video frames in the training samples into a video processing network of the master model;

extracting video features corresponding to the video frame sequences through a feature extraction layer of the video processing network, obtaining video output vectors according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language recognition result according to the video output vectors through an output layer of the video processing network.

5. A method according to claim 3, wherein said processing the audio signal in the training sample through the audio processing network in the master model to obtain a second lip language recognition result comprises:

Inputting the audio signal in the training sample into an audio processing network of the master model;

Extracting audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtaining audio output vectors according to the audio features through a feature mapping layer of the audio processing network, and obtaining second lip language recognition results according to the audio output vectors through an output layer of the audio processing network.

6. A method according to claim 3, wherein when the student model is used for word-level lip recognition, the obtaining, by the audiovisual processing network in the master model, an audiovisual combined output vector based on a video output vector obtained by the video processing network from the sequence of video frames and an audio output vector obtained by the audio processing network from the audio signal, and obtaining, based on the audiovisual combined output vector, a third lip recognition result, comprises:

inputting the video output vector and the audio output vector into an audiovisual processing network of a master model;

And cascading the video output vector and the audio output vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language recognition result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

7. A method according to claim 3, wherein when the student model is used for sentence-level lip language recognition, the obtaining, by the audiovisual processing network in the master model, an audiovisual combined output vector based on a video output vector obtained by the video processing network from the sequence of video frames and an audio output vector obtained by the audio processing network from the audio signal, and obtaining, based on the audiovisual combined output vector, a third lip language recognition result, comprises:

Determining a feature vector of a preceding output character;

inputting the feature vector of the previous output character, the video output vector obtained by the video processing network according to the video frame sequence and the audio output vector obtained by the audio processing network according to the audio signal into an audio-visual processing network of a master model;

Obtaining a video coding vector and an audio coding vector according to the feature vector, the video output vector and the audio output vector through a multi-head attention coding layer of the audio-visual processing network;

and cascading the video coding vector and the audio coding vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language recognition result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

8. The method according to claim 1, wherein determining temporary student loss from results obtained by performing lip language recognition on temporary training samples obtained from the training samples according to the student model and the master model, respectively, comprises:

Performing lip language recognition on the video frame sequence in the temporary training sample through the student model to obtain a temporary student recognition result, and constructing cross entropy loss according to the temporary student recognition result and tag data of the temporary training sample;

constructing a cross-modal fusion loss according to the temporary student identification result, a first lip identification result, a second lip identification result and a third lip identification result which are obtained by performing lip identification on the temporary training sample by the master model;

And determining temporary student loss according to the cross entropy loss and the cross-modal fusion loss.

9. The method of claim 8, wherein constructing a cross-modal fusion loss from the temporary student identification result, the first lip identification result, the second lip identification result, and the third lip identification result obtained by performing lip identification on the temporary training sample by the master model comprises:

through a pre-trained video teaching aid network, after obtaining a video output vector corresponding to a video frame sequence in the temporary training sample, encoding the video output vector into a video preference coefficient;

through a pre-trained audio teaching network, after an audio output vector corresponding to an audio signal in the temporary training sample is obtained, the audio output vector is encoded into an audio frequency good coefficient;

Determining first focus loss according to the temporary student identification result and the first lip language identification result, determining second focus loss according to the temporary student identification result and the second lip language identification result, and determining third focus loss according to the temporary student identification result and the third lip language identification result;

And weighting the first focus loss according to the video preference coefficient, weighting the second focus loss according to the audio preference coefficient, and then fusing the second focus loss and the third focus loss to obtain the cross-modal fusion loss.

10. The method of claim 9, wherein said encoding the audio output vector into audio frequency offset coefficients comprises:

Performing full connection processing on the video output vector through a first full connection layer in the cross-modal fusion network to obtain a video full connection vector;

Performing full connection processing on the audio output vector through a second full connection layer in the cross-modal fusion network to obtain an audio full connection vector;

And connecting the video full-connection vector and the audio full-connection vector in series through a third full-connection layer in the cross-modal fusion network, and then performing full-connection processing to obtain an audio preference coefficient.

11. The method according to claim 1, wherein determining the student feedback loss from the result of lip language recognition of the verification sample obtained from the training sample and the tag data of the verification sample according to the temporary student model comprises:

inputting a sequence of video frames in the verification sample into the student model;

obtaining a lip language identification result according to the video output vector through an output layer of the student model;

And constructing cross entropy loss according to the lip language identification result and the label data of the verification sample, and using the cross entropy loss as student feedback loss.

12. The method of claim 1, wherein determining a teacher recognition loss from a result of performing lip language recognition on a training sample of the large teacher's instructions samples obtained from the training sample according to the teacher model and tag data of the training sample of the large teacher's instructions samples comprises:

Inputting the training sample of the large teacher's instructions into a master model to obtain a corresponding first lip language identification result, a second lip language identification result and a third lip language identification result;

Determining a first cross entropy loss according to the label data of the large teacher's instructions training sample and the first lip language identification result, determining a second cross entropy loss according to the label data of the large teacher's instructions training sample and the second lip language identification result, determining a third cross entropy loss according to the label data of the large teacher's instructions training sample and the third lip language identification result, and fusing the first cross entropy loss, the second cross entropy loss and the third cross entropy loss to obtain the teacher identification loss.

13. The method of claim 1, wherein the model training the student model updated for the previous alternate training based on the current alternate training updated master model and the training sample to obtain the student model updated for the current alternate training comprises:

obtaining a student training sample from the training sample;

according to the result obtained by performing lip language recognition on the student training sample by the student model updated by the previous alternate training, determining student loss by the result obtained by performing lip language recognition on the student training sample by the university model updated by the current alternate training;

And updating the student model updated by the previous alternate training according to the student loss, and obtaining the student model updated by the current alternate training.

14. The method of claim 13, wherein determining the student loss from the results obtained from lip language recognition of the student training sample by the student model updated from the previous alternate training and from the results obtained from lip language recognition of the student training sample by the university model updated from the current alternate training comprises:

performing lip language recognition on a video frame sequence in the student training sample through the student model updated by previous alternate training to obtain a student recognition result, and constructing cross entropy loss according to the student recognition result and label data of the student training sample;

constructing a cross-modal fusion loss according to the student identification result, a first lip identification result, a second lip identification result and a third lip identification result which are obtained by carrying out lip identification on the student training sample by the current alternative training updated master model;

and determining student loss according to the cross entropy loss and the cross-modal fusion loss.

15. The method according to claim 1, wherein the method further comprises:

Determining a learning difficulty coefficient corresponding to each training sample in the training samples;

In the training process of the student model and the master model, sequentially selecting student training samples and a master teacher's instructions training sample required by alternate training from the training samples according to the sequence of the learning difficulty coefficient from small to large.