CN113822125A

CN113822125A - Processing method and device of lip language recognition model, computer equipment and storage medium

Info

Publication number: CN113822125A
Application number: CN202110703815.6A
Authority: CN
Inventors: 何盛烽; 任苏成; 孙子荀; 邓大付; 王巨宏; 刘婷婷
Original assignee: South China University of Technology SCUT; Tencent Technology Shenzhen Co Ltd
Current assignee: South China University of Technology SCUT; Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-12-21
Anticipated expiration: 2041-06-24
Also published as: CN113822125B

Abstract

The application relates to a processing method and device of a lip language recognition model, computer equipment and a storage medium. The method relates to an artificial intelligence computer vision technology, the whole distillation process is divided into a student training stage and an instructor training stage of alternate training, in the instructor training stage, a temporary training sample is used for updating a student model updated in the previous alternate training again, the obtained temporary student model feeds back the current learning state to the instructor model through a verification sample, and the instructor model is guided to adaptively adjust teaching knowledge according to the current feedback; in addition, the master model is also supervised by a master training sample, and teaching contents are adjusted through master recognition loss determined by the master training sample. And then training the student model in a student training stage, and obtaining a lip language recognition model according to the student model after repeated iteration for multiple times. According to the scheme, the teaching content can be flexibly adjusted while the accuracy of master model teaching knowledge is improved, and the knowledge distillation effect is improved.

Description

Processing method and device of lip language recognition model, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a lip language recognition model, a computer device, and a storage medium.

Background

Lip language recognition aims at predicting speaking content from silent lip videos or face videos, and the visual task is to enable a student model to learn the lip language recognition capability from a trained teacher model by adopting a knowledge distillation mode.

Knowledge distillation may transfer knowledge from the teacher model to the student model. However, at present, teacher models are generally pre-trained models, training is not considered according to the capability of lip language recognition tasks currently possessed by student models, and due to neglecting the needs of the student models, the teacher models often lack flexibility in adjusting teaching knowledge, teaching contents cannot be dynamically adjusted according to the development of the student models, and therefore knowledge distillation effects are affected.

Disclosure of Invention

In view of the above, it is necessary to provide a method and an apparatus for processing a lip language recognition model, a computer device, and a storage medium, which can improve the effect of guiding a student model to learn lip language recognition.

A method for processing a lip language recognition model, the method comprising:

acquiring training samples and acquiring a student model and an instructor model updated by previous alternate training, wherein each training sample comprises a video frame sequence and a corresponding audio signal;

determining temporary student loss according to results obtained by performing lip language recognition on temporary training samples obtained from the training samples respectively by the student model and the master model, and updating the student model based on the temporary student loss to obtain a temporary student model;

determining student feedback loss according to a result obtained by performing lip language recognition on a verification sample obtained from the training sample by the temporary student model and label data of the verification sample, and determining master recognition loss according to a result obtained by performing lip language recognition on a master training sample obtained from the training sample by the master model and label data of the master training sample;

obtaining an updated master model of the current alternate training according to the student feedback loss and the master identification loss, and performing model training on the student model updated by the previous alternate training based on the updated master model of the current alternate training and the training samples to obtain an updated student model of the current alternate training;

and returning to the step of obtaining the student model and the master model updated by the previous alternate training to continue the alternate training based on the student model and the master model updated by the current alternate training, and obtaining a lip language recognition model according to the student model updated when the training is stopped.

In one embodiment, the determining the learning difficulty coefficient corresponding to each of the training samples includes:

processing the video frame sequence in each training sample through a pre-trained video teaching aid network to obtain the video confidence of the lip language prediction category of each training sample;

processing the audio signals in the training samples through a pre-trained audio teaching aid network to obtain the audio confidence of the lip language prediction category of each training sample;

and fusing the video confidence coefficient and the audio confidence coefficient to obtain the category confidence coefficient of each training sample, and determining the learning difficulty coefficient corresponding to each training sample according to the category confidence coefficient.

In one embodiment, the method further comprises:

determining the number of target samples required by the current alternate training according to the current iteration times, wherein the number of the target samples is gradually increased along with the iteration times;

and acquiring the training samples with the target sample number to perform the current alternate training.

In one embodiment, the method further comprises:

acquiring a video frame sequence to be identified;

inputting the video frame sequence to be recognized into the trained lip language recognition model;

and outputting the speaking content corresponding to the speaker in the video frame sequence to be recognized after processing the video frame sequence to be recognized through a video processing network in the lip language recognition model.

A device for processing a lip language recognition model, the device comprising:

the system comprises a sample acquisition module, a data acquisition module and a data acquisition module, wherein the sample acquisition module is used for acquiring training samples and acquiring a student model and an instructor model updated by previous alternate training, and each training sample comprises a video frame sequence and a corresponding audio signal;

a temporary student model obtaining module, configured to determine a temporary student loss according to a result obtained by performing lip language recognition on a temporary training sample obtained from the training sample according to the student model and the master model, and update the student model based on the temporary student loss to obtain a temporary student model;

the master model training module is used for determining student feedback loss according to a result obtained by performing lip language recognition on the verification sample obtained from the training sample by the temporary student model and the label data of the verification sample, and determining master recognition loss according to a result obtained by performing lip language recognition on the master training sample obtained from the training sample by the master model and the label data of the master training sample; obtaining an updated master model of the current alternate training according to the student feedback loss and the master identification loss, and performing model training on the student model updated by the previous alternate training based on the updated master model of the current alternate training and the training samples to obtain an updated student model of the current alternate training;

and the iteration module is used for returning to the step of obtaining the student model and the master model updated by the previous alternate training to continue the alternate training based on the student model and the master model updated by the current alternate training, and obtaining the lip language recognition model according to the student model updated when the training is stopped.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

A computer program comprising computer instructions stored in a computer-readable storage medium, the computer instructions being read from the computer-readable storage medium by a processor of a computer device, the processor executing the computer instructions to cause the computer device to perform the steps of the processing method of the above-mentioned lip recognition model.

Compared with the traditional mode of guiding the learning of the student model by using the pre-training teacher model, the processing method, the device, the computer equipment and the storage medium of the lip language recognition model not only train the student model, but also train the model for guiding the learning of the student model, wherein the model is called as an master model, so that the whole distillation process is divided into a student training stage and a master training stage of alternate training.

Specifically, in the master training stage, the student model updated in the previous alternate training is updated again by using the temporary training sample to obtain a temporary student model, and the temporary student model is continuously updated as an auxiliary model. The temporary student model feeds back the current learning state to the master model through the verification sample, namely the master model is guided to adaptively adjust teaching knowledge according to the feedback of the current lip language recognition task through the student feedback loss; in addition, the master model is also supervised by a master training sample, and teaching contents are adjusted through master recognition loss determined by the master training sample. That is to say, the supervision information in the training process of the master model comprises two parts, one part is the student feedback loss reflecting the current student model learning state, the other part is the master identification loss reflecting the current teaching ability of the master model, and the master model updated by the previous alternate training is adjusted according to the two loss losses, so that the teaching content can be flexibly and dynamically adjusted while the teaching knowledge accuracy of the master model is improved, and the whole knowledge distillation effect is improved. Therefore, after the master model updated in the current alternate training is obtained, the master model updated in the current alternate training and the training samples can be used for carrying out model training on the student model updated in the previous alternate training in the student training stage, and after repeated iteration is carried out for multiple times, the recognition performance of the lip language recognition model obtained according to the student model is greatly improved.

performing lip language recognition on a video frame sequence in a student training sample obtained from the training sample according to the student model to obtain a student recognition result, and constructing cross entropy loss according to the student recognition result and label data of the student training sample;

constructing a trans-modal fusion loss according to the student recognition result, a first lip language recognition result obtained by performing lip language recognition on the student training sample by a video processing network in the master model, a second lip language recognition result obtained by performing lip language recognition on the student training sample by an audio processing network in the master model, and a third lip language recognition result obtained by an audio-visual processing network in the master model based on the video frame sequence and the audio signal;

determining student loss according to the cross entropy loss and the cross modal fusion loss;

updating the updated student model of the previous alternate training according to the student loss, obtaining the updated student model of the current alternate training, and performing model training on the updated master model of the previous alternate training based on the updated student model of the current alternate training and the training samples to obtain the updated master model of the current alternate training;

the label loss construction module is used for carrying out lip language identification on the video frame sequence in the student training sample obtained from the training sample according to the student model to obtain a student identification result, and constructing cross entropy loss according to the student identification result and label data of the student training sample;

a trans-modal fusion loss construction module, configured to construct a trans-modal fusion loss according to the student recognition result, a first lip language recognition result obtained by performing lip language recognition on the student training sample by using a video processing network in the master model, a second lip language recognition result obtained by performing lip language recognition on the student training sample by using an audio processing network in the master model, and a third lip language recognition result obtained by performing audio-visual processing network in the master model based on the video frame sequence and the audio signal;

the student model updating module is used for determining student loss according to the cross entropy loss and the cross-modal fusion loss; updating the updated student model of the previous alternate training according to the student loss, obtaining the updated student model of the current alternate training, and performing model training on the updated master model of the previous alternate training based on the updated student model of the current alternate training and the training samples to obtain the updated master model of the current alternate training;

Specifically, in a student training stage, a student model constructs cross entropy loss through label data of a student training sample, in addition, a video processing network in an instructor model extracts knowledge of video modalities from the student training sample, an audio processing network of the instructor model extracts knowledge of audio modalities from the student training sample, an audio-visual processing network of the instructor model extracts audio-visual combined knowledge of the student training sample, cross-modality fusion loss obtained by fusing the knowledge of the three different modalities can enable the student model to learn and mine the capability of multi-modality information from the instructor model, training of the student model is guided together according to the cross entropy loss and the cross-modality fusion loss, and the learning effect of the student model can be greatly improved. After the updated student model of the current alternate training is obtained, the updated student model of the current alternate training and the training sample can be used for carrying out model training on the updated master model of the previous alternate training in the master training stage, and after repeated iteration is carried out for multiple times, the recognition performance of the lip language recognition model obtained according to the student model is greatly improved.

Drawings

FIG. 1 is a diagram showing an application environment of a processing method of a lip language recognition model in one embodiment;

FIG. 2 is a flowchart illustrating a processing method of a lip language identification model according to an embodiment;

FIG. 3 is a diagram of a model framework for training a master model during a master training phase in one embodiment;

FIG. 4 is a diagram illustrating a network architecture of video streams in one embodiment;

FIG. 5 is a diagram illustrating a network structure of audio streams in one embodiment;

FIG. 6 is a diagram illustrating a network architecture for a combination of video and audio streams in a sentence-level lip language recognition scenario, according to an embodiment;

FIG. 7 is a schematic flow chart illustrating lip language recognition of a training sample by a teacher model in one embodiment;

FIG. 8 is a schematic flow chart of obtaining a student model updated by current alternate training in one embodiment;

FIG. 9 is a schematic flow chart illustrating the determination of student loss in one embodiment;

FIG. 10 is a schematic flow diagram illustrating the construction of a cross-modality fusion penalty in one embodiment;

FIG. 11 is a model framework diagram illustrating the training of a student model during a student training phase in one embodiment;

FIG. 12 is a schematic flow chart illustrating the determination of temporary student loss in one embodiment;

FIG. 13 is a diagram of a network architecture for training a teacher model and a student model alternately in an exemplary embodiment;

FIG. 14 is a flowchart illustrating a method for processing the lip language identification model in accordance with an exemplary embodiment;

FIG. 15 is a flowchart illustrating a processing method of a lip language identification model according to another embodiment;

FIG. 16 is a block diagram showing a configuration of a processing device of the lip language identification model in one embodiment;

FIG. 17 is a block diagram showing a configuration of a processing means of a lip language identification model in another embodiment;

FIG. 18 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The processing method of the lip language recognition model provided by the application realizes the training of the lip language recognition model and also realizes the lip language recognition by using a computer vision technology and a machine learning technology in an Artificial Intelligence (AI) technology.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition. It can be understood that the lip language recognition performed according to the video frame sequence to be processed in the present application belongs to a video semantic understanding technology in a computer vision technology, and realizes the lip language recognition.

Machine Learning (ML), which is a multi-domain cross discipline relating to probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other multi-disciplines, is used for specially researching how a computer simulates or realizes human Learning behaviors to acquire new knowledge or skills and reorganizes an existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. The artificial neural network is an important machine learning technology and has wide application prospects in the fields of system identification, pattern recognition, intelligent control and the like. It is to be appreciated that the present application trains and uses lip language recognition models by using machine learning techniques. The video frame sequence including the face or the lip in the application can be stored on a block chain network to prevent stealing.

The processing method of the lip language recognition model provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may obtain training samples and obtain a student model and an instructor model updated by previous alternate training, where each training sample includes a sequence of video frames and a corresponding audio signal; determining temporary student loss according to results obtained by lip language recognition of temporary training samples obtained from training samples by a student model and an instructor model respectively, and updating the student model based on the temporary student loss to obtain a temporary student model; determining student feedback loss according to a result obtained by performing lip language recognition on a verification sample obtained from a training sample and label data of the verification sample by a temporary student model, and determining master recognition loss according to a result obtained by performing lip language recognition on a master training sample obtained from the training sample and label data of the master training sample by a master model; obtaining an updated master model of the current alternate training according to the student feedback loss and the master identification loss, and performing model training on the student model updated by the previous alternate training based on the updated master model of the current alternate training and the training sample to obtain an updated student model of the current alternate training; and returning to the step of obtaining the student model and the master model updated in the previous alternate training to continue the alternate training based on the student model and the master model updated in the current alternate training, and obtaining the lip language recognition model according to the student model updated when the training is stopped.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a method for processing a lip language recognition model is provided, which is described by taking the method as an example applied to a computer device (terminal 102 or server 104) in fig. 1, and includes the following steps:

step 202, training samples are obtained, and a student model and an instructor model updated in the previous alternate training are obtained, wherein each training sample comprises a video frame sequence and a corresponding audio signal.

Lip language recognition refers to a process of recognizing the content of a speaker's speech from a silent lip video or a face video. In the related art, the ability of lip-language recognition on silent video is generally learned from a teacher model pre-trained by an audio signal by means of knowledge distillation. Obviously, the student model needs to learn knowledge of another mode from a pre-trained teacher model, the modes are crossed from audio knowledge to video knowledge, and potential mode difference between data of the modes can cause the student model not to learn accurate video knowledge, so that the lip language recognition effect of the student model is influenced. For this reason, in the embodiment of the present application, each training sample includes a sequence of video frames and a corresponding audio signal, so that the master model can understand knowledge of the video modality, the audio modality, and the audiovisual combination knowledge after the video modality and the audio modality are combined, so as to compensate for modality differences between the inherent cross-modality knowledge, and thus the student model can learn the cross-modality knowledge from the master model.

In the embodiment of the present application, each training sample includes a sequence of video frames and an audio signal. The audio signal is denoted X_AVideo frame sequence denoted X_VThe speech content of the audio signal corresponds to the lip language content of the sequence of video frames, e.g. a certain training sample corresponds to the word "i". The audio signal may be an original waveform in a time domain, and the video frame sequence may be a video frame sequence obtained by sampling the original video signal at a preset sampling rate, which may be, for example, 25 fps. The computer device may also perform an alignment process on the audio signals with the sequence of video frames, for example, each audio signal having a length of 1.16 seconds and the corresponding sequence of video frames having a length of 29.

Each training sample further comprises label data corresponding to the training sample, and the label data represents the lip language content corresponding to each training sample. The lip language recognition can be divided into two application scenes, one is word-level lip language recognition, the other is sentence-level lip language recognition, and when the sentence-level lip language recognition is carried out, each word is predicted in sequence and then connected to obtain a predicted sentence.

Under the situation of recognition of the lip language at the word level, each word U belongs to R^KCan be represented by a one-hot vector of length K, where R^KRepresents a vocabulary, and K represents a vocabulary, which may be 500, for example. The computer device may construct training samples for word-level lip language recognition using the word-level data sets.

In the sentence-level lip language recognition scenario, each character Z in the sentence_q∈{R^KThe tag data of | Q ═ 1, 2., Q } can be represented by a one-hot vector, where Q represents the length of the sentence and Zq represents the qth character in the sentence. For example, the number of characters may be set to 40, which includes 26 letters, 10 numbers and 4 special marks (space bar, keyboard, EOS and punctuation mark), and the label corresponding to each sentenceThe data is a vector matrix of Q40, for example, "we" corresponds to "wo men", and the label data is a vector matrix formed by unique heat vectors corresponding to 5 letters and 1 space. The computer device may construct training samples for sentence-level lip language recognition using the sentence-level data set.

In one embodiment, a computer device obtains an original video, determines a lip region by detecting a face region in the original video, and clips the original video centered on the lip region to obtain a sequence of video frames. In addition, the computer equipment can also carry out random rotation and scaling treatment on the cut lip region, so that richer training samples are obtained.

The purpose of knowledge distillation is to transfer knowledge from a Teacher model (Teacher) to a Student model (Student), and in the related technology of lip language recognition, most of the Student models extract knowledge from a pre-trained Teacher model to learn lip language recognition, however, because the Teacher model is pre-trained, the teaching content of the Teacher model cannot be flexibly and dynamically adjusted according to the current learning state of the Student model. For this reason, the embodiment of the present application does not use a pre-trained teacher model, but designs a trainable network capable of dynamically adjusting teaching contents, which is called Master model (Master). In the training process, an instructor model and a student model are alternately trained, in the instructor training stage, model parameters of the student model are fixed and are not updated, and the instructor model is optimized by monitoring label data of a training sample and needs to receive temporary feedback of the student model alternately updated at the previous time; in the student training stage, the model parameters of the master model are fixed and are not updated, and the student model learns the capability of extracting cross-modal knowledge from the training samples from master model updated in the previous alternate training and is optimized through supervision of the label data of the training samples.

Specifically, the computer device obtains the updated student model and the master model of the previous alternate training when the current alternate training is performed, and continues the current alternate training on the basis of the updated student model and the master model of the previous alternate training. For example, in the current alternate training process, in the master training stage, the computer device acquires 10 batches of small batch training samples, the number of each batch of small batch training samples is 30, and 10 iterations are finished to obtain the master model updated by the previous alternate training. Similarly, in the student training phase, the computer device acquires 10 batches of training samples again to iterate 10 times on the student model, and the updated student model of the previous alternate training is acquired at the end of the 10 th iteration. And so on, continuing the alternate training. It should be noted that, because the training is performed alternately, the training sequence of the master model and the student model is not limited.

It can be understood that, after the current model training is performed by using the "student model and teacher model updated by the previous alternate training" to obtain the student model and teacher model updated by the current alternate training, the "previous" and "current" are concepts that are relatively changed, for example, after the current model training is performed by using the "student model and teacher model updated by the previous alternate training" to obtain the student model and teacher model updated by the current alternate training, the student model and teacher model updated by the current alternate training can be used as a new "student model and teacher model updated by the previous alternate training", and the next alternate training becomes a new current alternate training.

And 204, respectively carrying out lip language recognition on the temporary training samples obtained from the training samples according to the student model and the master model to obtain results, determining temporary student loss, and updating the student model based on the temporary student loss to obtain the temporary student model.

In the related technology, a teacher model is usually pre-trained, training is not performed according to the current lip language recognition capability of students, the learning requirement of the student model is ignored, and the teacher model often lacks flexibility in adjusting teaching knowledge. To this end, in the master training phase, the computer device temporarily updates the Student model updated by the previous alternate training by using one or more Temporary training samples to obtain a Temporary Student model (temporal Student), and the lip language recognition capability of the Temporary Student model can be used for feeding back the current learning state of the Student model to the master model.

Specifically, the computer device may obtain a temporary training sample from the training sample, predict the temporary training sample through the previous alternate training of the updated student model and the updated master model, respectively, to obtain respective prediction results, when the student model is updated to obtain the temporary student model, the master model is not updated, and the label data of the temporary training sample and the prediction result of the master model are used as the basis for updating the student model. It can be understood that the obtained temporary student model is updated according to the student model updated in the previous alternate training, so that the temporary student model is continuously updated in the master training stage of each alternate training.

And step 206, determining the feedback loss of the student according to the result obtained by performing lip language recognition on the verification sample obtained from the training sample by the temporary student model and the label data of the verification sample, and determining the recognition loss of the master according to the result obtained by performing lip language recognition on the master training sample obtained from the training sample by the master model and the label data of the master training sample.

The verification sample is a sample used for verifying the current lip language identification capability of the student model, and the learning state of the current student model can be determined according to the lip language identification result of the temporary student model on the verification sample and the student feedback loss constructed by the label data of the verification sample. Like this, when master's model is optimized based on this student's feedback loss, can receive student's model's feedback to impel master's optimization in-process can adjust the teaching content in a flexible way, promote the ability to student's model's transmission knowledge.

Specifically, the computer device can obtain a verification sample from the training sample, perform lip language identification on the verification sample through the temporary student model to obtain a prediction result, and construct cross entropy loss according to the prediction result and label data of the verification sample to serve as student feedback loss. In addition, in order to improve the lip language recognition performance of the student model, the master model needs to extract more comprehensive teaching knowledge, and the student model can learn more comprehensive knowledge from the master model. For this purpose, the computer device further acquires master training samples from the training samples, and constructs master recognition loss of the master model by performing lip language recognition on the master training samples through the master model updated by the previous alternate training and the label data of the master training samples.

That is to say, the supervision information in the master model training process includes two parts, one part is the student feedback loss that reflects the current student model learning state, and the other part is the master identification loss that reflects the current teaching ability of master model, loses the master model that adjusts the last training in turn and updates according to these two losses, can be when promoting master model teaching knowledge accuracy nimble dynamically adjust the teaching content to promote the effect of whole knowledge distillation.

In some embodiments, the validation sample used to validate the learning effect of the current student model may be the same training sample as the master training sample used to improve master model refined teaching knowledge. In some embodiments, since the verification sample is used for verifying the lip language recognition capability of the current student model, the verification sample may be a training sample in a verification set, and the master training sample is a training sample in the training set, that is, the verification sample and the master training sample use different training samples.

And step 208, obtaining an updated master model of the current alternate training according to the student feedback loss and the master identification loss, and performing model training on the student model updated by the previous alternate training based on the master model updated by the current alternate training and the training samples to obtain the student model updated by the current alternate training.

Specifically, in the master training stage, the computer device performs gradient back propagation through the student feedback loss and the master identification loss to update the model parameters of the master model, and after the master model updated in the current alternate training is obtained, continues in the student training stage, performs model training on the student model updated in the previous alternate training based on the master model updated in the current alternate training and the training samples, and obtains the student model updated in the current alternate training.

And step 210, returning to the step of obtaining the student model and the master model updated in the previous alternate training to continue the alternate training based on the student model and the master model updated in the current alternate training, and obtaining a lip language recognition model according to the student model updated when the training is stopped.

Specifically, the computer device performs an alternating training process on the master model and the student model according to the previous steps, which is called an alternating training one-time iterative process, according to the steps, the computer device can perform iteration for multiple times, and returns to the step of obtaining the updated student model and master model of the previous alternating training to continue the alternating training until the iteration stop condition is met, and then obtains the lip language recognition model according to the updated student model.

FIG. 3 is a schematic diagram of a model framework for training a master model in the master training phase of the alternating training in one embodiment. Referring to fig. 3, after obtaining the student model and the master model updated by the previous alternate training, inputting the video frame sequence of the provisional training sample into the student model, and inputting both the video frame sequence of the provisional training sample and the audio signal into the master model, constructing a provisional student loss by using the output results of the student model and the master model, and obtaining the provisional student model after updating the student model according to the provisional student loss. And then, inputting the video frame sequence in the verification sample into a temporary student model, constructing student feedback loss according to the output result of the temporary student model, inputting both the video frame sequence and the audio signal in the master training sample into a master model, and constructing master identification loss according to the output result of the master model. Updating model parameters of the master model based on the student feedback loss and the master identification loss.

Compared with the traditional mode of guiding the learning of the student model by using the pre-training teacher model, the processing method of the lip language recognition model not only trains the student model, but also trains the model for guiding the learning of the student model, and the model is called as an master model, so that the whole distillation process is divided into a student training stage and a master training stage of alternate training. Specifically, in the master training stage, the student model updated in the previous alternate training is updated again by using the temporary training sample to obtain a temporary student model, and the temporary student model is continuously updated as an auxiliary model. The temporary student model feeds back the current learning state to the master model through the verification sample, namely the master model is guided to adaptively adjust teaching knowledge according to the feedback of the current lip language recognition task through the student feedback loss; in addition, the master model is also supervised by a master training sample, and teaching contents are adjusted through master recognition loss determined by the master training sample. After the updated master model of the current alternate training is obtained, the master model and the training sample updated by the current alternate training can be used for carrying out model training on the student model updated by the previous alternate training in the student training stage, and after repeated iteration is carried out for multiple times, the recognition performance of the lip language recognition model obtained according to the student model is greatly improved.

In one embodiment, the student model needs to learn the capability of lip language recognition on silent video through model training, so the student model is a video stream-based model, which inputs a video frame sequence of training samples and outputs a lip language recognition result. In order to improve the lip language recognition performance of the student model, the master model needs to extract more comprehensive teaching knowledge, and the student model can learn more comprehensive knowledge from the master model.

Where the audio stream is an audio processing network that generates a prediction result based on an audio signal, the video stream is a video processing network that generates a prediction result based on a video signal, and the audiovisual combined stream is intended to combine the audio signal with the video signal to generate a prediction result. The audio stream and the video stream respectively comprise a feature extraction layer at the front end, a feature mapping layer at the rear end and an output layer for classification. The combination of the video stream and the audio stream includes, in addition to the audio stream and the video stream, a vector cascade layer and an output layer for classification, wherein the vector cascade layer is configured to obtain an audiovisual combination output vector according to output vectors generated by a back end of the audio stream and a back end of the video stream.

In one embodiment, the feature extraction layer of the audio stream front-end may use ResNet-18, and, since the audio signal is located in a 1-dimensional space, the computer device may replace all of the two-dimensional convolution kernels of the audio stream front-end with one-dimensional convolutions and set the convolution kernel size of the first layer of convolutions according to the sampling rate of the audio signal. The feature mapping layer at the back end of the audio stream may use time convolution or transform sequence to sequence (TM-sequence 2 sequence) in a word-level lip language recognition scenario, and may use transform sequence to sequence (TM-sequence 2 sequence) in a sentence-level lip language recognition scenario.

In one embodiment, the feature extraction layer of the video stream front-end may use ResNet-18, and, since the video signal is an image signal, and also includes a time dimension, the computer device may replace the first layer convolution of the video stream front-end with a three-dimensional convolution. The feature mapping layer at the back end of the video stream may use time convolution or transform sequence to sequence (TM-Seq2Seq, including multi-head attention and feed-forward networks) in the word-level lip recognition scenario, and may use TM-Seq2Seq in the sentence-level lip recognition scenario.

In one embodiment, a combination of a video stream and an audio stream is used to obtain a prediction of a merging feature derived from the audio stream and the video stream. A vector cascade layer in the combination of the video stream and the audio stream directly connects output vectors respectively generated at the rear ends of the audio stream and the video stream into a new vector in a word-level lip language recognition scene; in a sentence-level lip language identification scene, through the attention of context information to audio output vectors and video output vectors, the video coding vectors and the audio coding vectors are respectively obtained and then are connected into new audio-visual combined output vectors.

Fig. 4 is a schematic diagram of a network structure of a video stream in one embodiment. Referring to fig. 4, an input is a video frame sequence, video features are obtained through a feature extraction layer at the front end, and then a video output vector is obtained by using a back end based on TM-Seq2 Seq.

Fig. 5 is a schematic diagram of a network structure of audio streams in one embodiment. Referring to fig. 4, an input is an audio signal, audio features are obtained through a feature extraction layer (one-dimensional convolution) at the front end, and an audio output vector is obtained by using the back end based on TM-Seq2 Seq.

Fig. 6 is a schematic diagram of a network structure of a combination of a video stream and an audio stream in a sentence-level lip language recognition scene in an embodiment. Referring to fig. 6, the network structure includes, in addition to the audio stream and the video stream shown in fig. 4 and 5, an audiovisual processing network, which includes a multi-head attention coding layer and a cascade layer for obtaining an audiovisual combination output vector according to the attention of the context to the currently output character, and an output layer for obtaining a lip language recognition result according to the audiovisual combination output vector.

In one embodiment, the step of performing lip language recognition on the training sample by the student model comprises: inputting a video frame sequence in a training sample into a student model; extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model; obtaining a video output vector according to the video characteristics through a characteristic mapping layer of the student model; and obtaining a lip language recognition result according to the video output vector through an output layer of the student model.

It was mentioned before that the student model is a video stream based model, i.e. the student model is a video processing network based model. Referring to the network structure of fig. 4, the video stream includes a feature extraction layer, a feature mapping layer, and an output layer for classification. When the computer equipment needs to perform lip language recognition on the training sample through the student model, the video frame sequence in the training sample is input into the student model to obtain a corresponding lip language recognition result. In the word-level lip language recognition scene, the lip language recognition result output by the student model based on the video stream is a K-dimensional vector, wherein K represents a vocabulary, and each element in the K-dimensional vector represents the probability that the lip language content of the video frame sequence is each word in a vocabulary. In the sentence-level lip language recognition scene, the lip language recognition result output by the student model based on the video stream is a matrix vector.

In one embodiment, as shown in fig. 7, the step of lip language recognition of the training sample by the master model includes:

step 702, inputting the training sample into the master model.

As mentioned above, the master model is a model based on a combination of a video stream and an audio stream, and as shown in fig. 6, the combination of the video stream and the audio stream includes a vector cascade layer and an output layer for classification in addition to the audio stream and the video stream described above. In this embodiment, the master model includes a video processing network based on a video stream, an audio processing network based on an audio stream, and an audio-visual processing network. When the computer equipment needs to perform lip language recognition on the training sample through the master model, the video frame sequence and the audio signal in the training sample are input into the master model.

Step 704, processing the video frame sequence in the training sample through the video processing network in the master model to obtain a first lip language recognition result.

The master model is a model based on an audio processing network and a video processing network, that is, the master model is a network structure based on a video stream. And the computer equipment inputs the video frame sequence in the training sample into the video processing network to obtain a first lip language recognition result. The first lip language recognition result is a recognition result obtained based on video information of the training sample.

In one embodiment, processing a sequence of video frames in a training sample through a video processing network in a master model to obtain a first lip language recognition result includes: inputting a video frame sequence in a training sample into a video processing network of a master model; the method comprises the steps of extracting video features corresponding to a video frame sequence through a feature extraction layer of a video processing network, obtaining a video output vector according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language recognition result according to the video output vector through an output layer of the video processing network.

In particular, the video processing network is based on a model of a video stream, which includes a feature extraction layer, a feature mapping layer, and an output layer for classification, referring to the network structure of fig. 4. The computer equipment inputs the video frame sequence in the training sample into the video processing network, and the first lip language recognition result is obtained through the processing of the feature extraction layer, the feature mapping layer and the output layer of the video processing network in sequence.

Step 706, processing the audio signal in the training sample through the audio processing network in the master model to obtain a second lip language recognition result.

Among them, the audio processing network in the master model, i.e., the network structure based on audio streams. And the computer equipment inputs the audio signals in the training samples into the audio processing network to obtain a second lip language recognition result. The second lip language recognition result is a recognition result obtained based on the audio information of the training sample.

In one embodiment, step 706 includes: inputting the audio signals in the training samples into an audio processing network of a master model; and extracting audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtaining an audio output vector according to the audio features through a feature mapping layer of the audio processing network, and obtaining a second lip language recognition result according to the audio output vector through an output layer of the audio processing network.

Specifically, the audio processing network is based on a model of an audio stream, and referring to the network structure of fig. 5, the audio stream includes a feature extraction layer, a feature mapping layer, and an output layer for classification. And the computer equipment inputs the audio signals in the training samples into the audio processing network, and the second lip language recognition result is obtained through the processing of the feature extraction layer, the feature mapping layer and the output layer of the audio processing network in sequence.

Step 708, obtaining a combined audio-visual output vector based on the audio output vector obtained by the audio processing network according to the audio signal and the video output vector obtained by the video processing network according to the video frame sequence through the audio-visual processing network in the master model, and obtaining a third lip language recognition result based on the combined audio-visual output vector.

Wherein, the audio-visual processing network in the master model is used for obtaining a derived audio-visual combined output vector based on the video output vector and the audio output vector. And the computer equipment obtains an audio-visual combined output vector according to the video output vector output by the video processing network and the audio output vector output by the audio processing network, and obtains a third lip language recognition result according to the audio-visual combined output vector. The audio-visual combined output vector is derived from the video output vector and the audio output vector, and can reflect the characteristics of potential cross-modal knowledge between the video modality and the audio modality.

In one embodiment, when the student model is used for word-level lip language recognition, step 708 comprises: inputting the video output vector and the audio output vector into an audio-visual processing network of a master model; and cascading the video output vector and the audio output vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language identification result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

Specifically, the audio-visual processing network comprises a cascade layer and an output layer, in a word-level lip language recognition scene, the computer device cascades a video output vector and an audio output vector through the cascade layer to obtain an audio-visual combined output vector, and then obtains a third lip language recognition result through the output layer for classification.

In one embodiment, when the student model is used for sentence-level lip language recognition, step 708 comprises: determining a feature vector of a previously output character; inputting the feature vector of the previous output character, the video output vector obtained by the video processing network according to the video frame sequence and the audio output vector obtained by the audio processing network according to the audio signal into the audio-visual processing network of the master model; obtaining a video coding vector and an audio coding vector according to the feature vector, the video output vector and the audio output vector through a multi-head attention coding layer of the audio-visual processing network; and cascading the video coding vector and the audio coding vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language identification result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

Referring to the network structure of fig. 6, the computer device inputs the sequence of video frames in the training sample into the video processing network, and obtains a video output vector through the processing of the feature extraction layer and the feature mapping layer of the video processing network, and the computer device inputs the audio signal in the training sample into the audio processing network, and obtains an audio output vector through the processing of the feature extraction layer and the feature mapping layer of the audio processing network in sequence.

In order to utilize the influence of previous output characters on current output characters, in a multi-head attention coding layer of an audio-visual processing network, a video output vector and an audio output vector are continuously coded by utilizing the characteristic vector of the previous characters to obtain a video coding vector and an audio coding vector, the video coding vector and the audio coding vector are cascaded through a cascade layer of the audio-visual processing network to obtain an audio-visual combined output vector, and a third lip language recognition result is obtained according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

Regarding step 204, the specific implementation of determining the temporary student loss according to the results obtained by lip language recognition of the temporary training samples obtained from the training samples by the student model and the master model, i.e. the temporary student loss construction mode, is consistent with the way of constructing the student loss by the student model in the student training phase of the alternate training, which will be described in detail later.

With respect to student feedback loss in step 206, cross-entropy loss may be used.

In one embodiment, determining the feedback loss of the student according to the result obtained by lip language recognition of the verification sample obtained from the training sample and the label data of the verification sample by the temporary student model comprises: inputting the video frame sequence in the verification sample into a student model; extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model; obtaining a video output vector according to the video characteristics through a characteristic mapping layer of the student model; obtaining a lip language recognition result according to the video output vector through an output layer of the student model; and constructing cross entropy loss according to the lip language identification result and the label data of the verification sample, and taking the cross entropy loss as the feedback loss of the student.

In one embodiment, the computer device obtains a student model that was updated alternately last time, updates the student model again with the temporary student loss determined by the temporary training sample to obtain a temporary student model, and obtains the temporary student model by using the following formula:

wherein L is_sRepresenting a temporary student loss determined using the temporary training sample; theta_sModel parameters, θ, representing the student model updated from the previous alternating training_tsModel parameters representing the temporary student model, and α represents the learning rate.

After the computer device inputs the verification sample into the temporary student model, the student feedback loss can be constructed by adopting the following formula:

wherein y' represents a temporary student model f_tsResults obtained by lip recognition of a sequence of video frames in a validation sample, y₁Label data representing a verification sample.

Cross-entropy penalties may also be used in connection with the teacher identifying penalties in step 206. In one embodiment, determining the teacher recognition loss according to the result of lip language recognition on the teacher training sample obtained from the training sample by the teacher model and the label data of the teacher training sample comprises: inputting the master training sample into a master model to obtain a corresponding first lip language recognition result, a second lip language recognition result and a third lip language recognition result; determining first cross entropy loss according to the label data of the master training sample and the first lip language recognition result, determining second cross entropy loss according to the label data of the master training sample and the second lip language recognition result, determining third cross entropy loss according to the label data of the master training sample and the third lip language recognition result, and fusing the first cross entropy loss, the second cross entropy loss and the third cross entropy loss to obtain master recognition loss.

The master training sample is input into the master model to obtain the specific embodiments of the corresponding first lip language recognition result, second lip language recognition result and third lip language recognition result, and the processing flow of lip language recognition on the training sample by referring to the master model described in fig. 7 above and the detailed description of the master model based on the combination of the video stream and the audio stream in the foregoing can be referred to.

Specifically, the computer equipment inputs a video frame sequence in a master training sample into a video processing network of a master model; the method comprises the steps of extracting video features corresponding to a video frame sequence through a feature extraction layer of a video processing network, obtaining a video output vector according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language recognition result according to the video output vector through an output layer of the video processing network. The computer equipment inputs the audio signals in the master training sample into an audio processing network of the master model, extracts audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtains audio output vectors according to the audio features through a feature mapping layer of the audio processing network, and obtains a second lip language recognition result according to the audio output vectors through an output layer of the audio processing network. And obtaining a third lip language recognition result based on the audio-visual combined output vector by an audio-visual processing network in the master model based on the audio-visual combined output vector obtained by the audio-visual processing network according to the video frame sequence and the audio output vector obtained by the audio-visual processing network according to the audio signal.

In one embodiment, the computer device may construct the master identification loss using the following formula:

L_m＝λ_m(L_CE(y₂,f_m(X_A,X_V；θ_A,θ_V))+L_CE(y₂,f_m(X_A；θ_A))+L_CE(y₂,f_m(X_V；θ_V)))；

wherein λ is_mDenotes a balance factor, f_m(X_A,X_V；θ_A,θ_V) Indicating the third lip language recognition result corresponding to the master training sample, f_m(X_A；θ_A) Representing a second lip recognition result corresponding to the master training sample, f_m(X_V；θ_V) Representing a first lip language recognition result, y, corresponding to the master training sample₂Label data representing master training samples.

Then, the total loss of the teacher training phase to optimize the teacher model can be represented by the following formula:

L_master＝L_ts+λ_m(L_CE(y₂,f_m(X_A,X_V；θ_A,θ_V))+L_CE(y₂,f_m(X_A；θ_A))+L_CE(y₂,f_m(X_V；θ_V)))

and performing gradient back propagation through the student feedback loss and the teacher identification loss to update the model parameters of the teacher model and obtain the updated teacher model after the current alternate training.

Next, the optimization process of the student model in the student training phase of the alternate training will be described.

In the student training stage, only model parameters of a student model are updated, a training target comprises cross entropy loss and cross-modal fusion loss, the cross entropy loss is used for improving the classification accuracy of the student model, and the cross-modal fusion loss is used for matching the output between a student and an master model, so that the student model learns the cross-modal knowledge from the master model.

In one embodiment, as shown in fig. 8, the model training of the student model updated in the previous alternate training based on the teacher model and the training sample updated in the current alternate training in step 208 to obtain the student model updated in the current alternate training includes:

step 802, obtaining student training samples from training samples;

and step 804, determining the loss of the student according to the result obtained by performing lip language recognition on the student training sample by the student model updated in the previous alternate training and the result obtained by performing lip language recognition on the student training sample by the master model updated in the current alternate training.

And 806, updating the updated student model of the previous alternate training according to the student loss, and then obtaining the updated student model of the current alternate training.

In one embodiment, as shown in FIG. 9, step 804 includes:

and 902, performing lip language recognition on the video frame sequence in the student training sample through the student model updated by the previous alternate training to obtain a student recognition result, and constructing cross entropy loss according to the student recognition result and the label data of the student training sample.

Specifically, the computer device can input the video frame sequence of the student training sample into the student model which is updated in the previous alternate training, extract the video features corresponding to the video frame sequence through the feature extraction layer of the student model, obtain the video output vector according to the video features through the feature mapping layer of the student model, and obtain the student identification result according to the video output vector through the output layer of the student model.

In one embodiment, in the word-level lip language recognition scenario, after the student training samples are input to the student model to obtain the student recognition result, the corresponding cross entropy loss can be expressed by the following formula:

y＝[y1,y2,y3,...,yK]；

y′＝[y1′,y2′,y3′,...,yK′]

wherein y represents the label data of the student training sample, K represents the vocabulary of the vocabulary, y' represents the student recognition result of the student model on the student training sample, and can be recorded as f_s(X_v；θ_s)，L_CERepresenting the cross entropy loss.

In the sentence-level lip language recognition scenario, the computer device may obtain the loss generated by each character in the sentence by using the above formula, and obtain the cross entropy loss of the sentence according to the loss generated by all the characters.

And 904, constructing a cross-modal fusion loss according to the student recognition result, the first lip language recognition result, the second lip language recognition result and the third lip language recognition result obtained by performing lip language recognition on the student training sample according to the teacher model updated by alternate training at the current time.

In this embodiment, knowledge extraction from the speech modality to the video modality is necessary for lip language recognition, because different phoneme features and video features can avoid ambiguity, and the master model outputs different types of knowledge across modalities, that is, audio knowledge, video knowledge and audiovisual knowledge, so as to further refine teaching promotion and improve the guidance effect on the student model.

Specifically, the computer device may input a sequence of video frames of the student training sample into a video processing network of the master model; the method comprises the steps of extracting video features corresponding to a video frame sequence through a feature extraction layer of a video processing network, obtaining a video output vector according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language recognition result according to the video output vector through an output layer of the video processing network. The computer equipment inputs audio signals in the student training samples into an audio processing network of the master model, extracts audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtains audio output vectors according to the audio features through a feature mapping layer of the audio processing network, and obtains second lip language recognition results according to the audio output vectors through an output layer of the audio processing network. And obtaining a third lip language recognition result based on the audio-visual combined output vector by an audio-visual processing network in the master model based on the audio-visual combined output vector obtained by the audio-visual processing network according to the video frame sequence and the audio output vector obtained by the audio-visual processing network according to the audio signal.

Then, the computer device can construct the cross-modal fusion loss according to the student recognition result output by the student model, the first lip language recognition result, the second lip language recognition result and the third lip language recognition result output by the master model.

Further, because there is an inherent modal difference between the video modal data and the audio modal data, when updating the student model, how to fuse the cross-modal knowledge becomes a further problem to be solved. According to the embodiment of the application, two pre-training teaching-assistant networks, namely a video teaching-assistant network (tutorV) and an audio teaching-assistant network (tutorA), are introduced, video information and audio information output by the pre-training teaching-assistant networks are used as extra cross-mode guidance and are coded into weighting coefficients, and the weighting coefficients are used as the preference degrees of student models for the video information and the audio information, so that students can balance the learning preference for video characteristics and audio characteristics during training.

In one embodiment, as shown in FIG. 10, step 904 comprises:

step 1002, after obtaining a video output vector corresponding to a video frame sequence in a student training sample through a pre-trained video teaching aid network, encoding the video output vector into a video preference coefficient.

The video teaching aid network is a network based on video streaming, the audio teaching aid network is a network based on audio streaming, and parameters of the video teaching aid network and the audio teaching aid network are not updated in the process of alternately training the master model and the student model. The video teaching assistant network is used for extracting video information of a video frame sequence in the training sample, and the audio teaching assistant network is used for extracting audio information of an audio signal in the training sample. The information provided by both of them can be used to balance the knowledge of the different modalities.

Specifically, the computer equipment inputs a video frame sequence in a student training sample into a pre-trained video teaching aid network, extracts video features corresponding to the video frame sequence through a feature extraction layer of the video teaching aid network, obtains a video output vector according to the video features through a feature mapping layer of the video teaching aid network, and records the video output vector obtained by the video teaching aid network as H_VEncoding the video output vector into a video preference coefficient, which can be denoted as W_V。

And 1004, obtaining an audio output vector corresponding to the audio signal in the student training sample through the pre-trained audio teaching aid network, and then encoding the audio output vector into a good audio frequency deviation coefficient.

Similarly, the computer equipment inputs the audio signals in the student training samples into the pre-trained audio teaching aid network, extracts the audio features corresponding to the audio signals through the feature extraction layer of the audio teaching aid network, obtains audio output vectors according to the audio features through the feature mapping layer of the audio teaching aid network, and records the audio output vectors obtained by the audio teaching aid network as H_AThe audio output vector is encoded into a good coefficient of audio frequency offset, which can be recorded as W_A。

And 1006, determining a first focus loss according to the student recognition result and the first lip language recognition result, determining a second focus loss according to the student recognition result and the second lip language recognition result, and determining a third focus loss according to the student recognition result and the third lip language recognition result.

In this embodiment, in order to enable the student model to dynamically learn the cross-modal knowledge extracted by the master model and balance the learning effect of the student, a focus Loss (Focal local) is used to alleviate the problem of imbalance of difficulty and difficulty of training samples.

And step 1008, weighting the first focus loss according to the video preference coefficient, weighting the second focus loss according to the audio preference coefficient, and fusing the weighted first focus loss and the second focus loss with the third focus loss to obtain cross-modal fusion loss.

In one embodiment, the computer device may employ the following formula as the cross-modality fusion loss:

L_DF＝L_F(f_S(X_V；θ_S),f_m(X_A,X_V；θ_A,θ_V))+W_AL_F(f_S(X_V；θ_S),f_m(X_A；θ_A))

+W_VL_F(f_S(X_V；θ_S),f_m(X_V；θ_V))；

wherein L is_FDenotes focal loss, f_m(X_V；θ_V) Representing the first lip language recognition result output by the master model to the student training sample, f_m(X_A；θ_A) Second lip language recognition result, f, representing student training sample output by master model_m(X_A,X_V；θ_A,θ_V) Third lip language recognition result f representing student training sample output by master model_S(X_V；θ_S) Representing the student recognition results of the student model on the output of the student training samples, W_ARepresenting the audio preference coefficient, W_VRepresenting video preference coefficients.

And step 906, determining the student loss according to the cross entropy loss and the cross modal fusion loss.

By the above derivation, the total student loss in the student training phase can adopt the following formula:

L_s＝L_CE(y,f_s(X_V,θ_s))+λ_aL_DF；

wherein λ is_aRepresenting the regularized balance factor, and then calculating an optimization parameter θ_s ^*：

And in the student training stage, the computer equipment carries out gradient back propagation through the student loss so as to update the model parameters of the student model, and after the updated student model is obtained in the next alternate training process, the next alternate training process is continued, namely, the updated student model is continuously alternately trained on the basis of the master model and the training samples updated in the current alternate training process until the iteration stop condition is met, and the lip language recognition model is obtained according to the updated student model.

In one embodiment, encoding the audio output vector into audio good coefficients comprises: performing full-connection processing on the video output vector through a first full-connection layer in the cross-modal fusion network to obtain a video full-connection vector; performing full-connection processing on the audio output vector through a second full-connection layer in the cross-modal fusion network to obtain an audio full-connection vector; and connecting the video full-connection vector and the audio full-connection vector in series through a third full-connection layer in the cross-modal fusion network, and then performing full-connection processing to obtain an audio preference coefficient.

The cross-modal fusion network is a network for fusing knowledge of different modalities. The cross-modal fusion network is used as a part of the master model, and is updated in the master training stage, and is not updated in the student training stage. In this embodiment, the cross-modality fusion network includes three full connection layers, each of which is a first full connection layer for connecting video information, and a network parameter of the first full connection layer may be recorded as θ_FVA second full connection layer for performing full connection processing on the audio information, wherein the network parameter can be recorded as theta_FAAnd a third fully-connected layer for merging video information and audio information, the network parameter of which can be recorded as theta_FAV。

Specifically, the computer device may obtain the audio preference coefficient and the video preference coefficient by using the following formulas:

H′_A＝FC(H_A；θ_FA)；

H′_V＝FC(H_V；θ_FV)；

W_A＝W；W_V＝1-W；

wherein H_VRepresenting video output vectors, H, obtained over a video-aided network_ARepresenting an audio output vector obtained through an audio teaching aid network, FC (, theta) representing a fully connected layer with network parameters theta,

denotes the concatenation operation and phi denotes the sigmoid function.

It can be understood that, in the student training stage, the cross-modality fusion network is not updated as part of the master model, and the cross-modality fusion network is updated in the master training stage through student feedback loss, that is, the network parameters of the three fully-connected layers are all updated in the master training stage.

Fig. 11 is a schematic diagram of a model framework for training a student model in the student training phase of the alternate training in one embodiment. Referring to fig. 11, after obtaining a student model updated by the previous alternate training and an instructor model updated by the current alternate training, inputting a video frame sequence of the student training samples into the student model, inputting audio signals of the student training samples into the instructor model, constructing cross entropy loss by using a student recognition result of the student model, constructing cross modal fusion loss by using a student recognition result of the student model and an output result of the instructor model, and updating the student model according to the cross entropy loss and the cross modal fusion loss to obtain the student model updated by the current alternate training.

The update process for the student model during the student training phase has been described above. As mentioned above, regarding the way of constructing the temporary student loss during the teacher training phase of the alternate training in step 204, which is consistent with the way of constructing the student loss for the student model during the student training phase of the alternate training, the way of constructing the temporary student loss is briefly described herein, and the details can refer to the contents of the update process for the student model during the student training phase, which are not described herein again.

In one embodiment, as shown in fig. 12, step 1204, determining a temporary student loss according to the results obtained by performing lip language recognition on the temporary training samples obtained from the training samples by the student model and the master model respectively, includes:

step 1202, performing lip language recognition on the video frame sequence in the temporary training sample through the student model to obtain a temporary student recognition result, and constructing cross entropy loss according to the temporary student recognition result and the label data of the temporary training sample.

The student model is temporarily updated in the teacher training stage, the process is the same as that of the student model optimization in the student training stage, and the temporary student model obtained through temporary updating cannot be stored and is only used for the teacher model to determine the learning state of the current student model.

In the student training stage, only model parameters of a student model are updated, a training target comprises cross entropy loss and cross-modal fusion loss, the cross entropy loss is used for improving the classification accuracy of the student model, and the cross-modal fusion loss is used for matching the output between a student and an master model, so that the student model learns the cross-modal knowledge from the master model. Here, the same processing steps are also used for updating the student model which is alternately updated last time again in the master training stage to obtain the temporary student model.

Specifically, the computer device can input the video frame sequence of the temporary training sample into the student model which is updated in the previous alternate training, extract the video features corresponding to the video frame sequence through the feature extraction layer of the student model, obtain the video output vector according to the video features through the feature mapping layer of the student model, and obtain the temporary student identification result according to the video output vector through the output layer of the student model.

And 1204, constructing a cross-modal fusion loss according to the temporary student recognition result and a first lip language recognition result, a second lip language recognition result and a third lip language recognition result obtained by lip language recognition of the temporary training sample by the master model.

Specifically, the computer device may input the video frame sequence of the temporary training sample into a video processing network of the master model, extract video features corresponding to the video frame sequence through a feature extraction layer of the video processing network, obtain a video output vector according to the video features through a feature mapping layer of the video processing network, and obtain a first lip language recognition result according to the video output vector through an output layer of the video processing network. The computer equipment inputs the audio signals in the temporary training samples into an audio processing network of the master model, extracts audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtains audio output vectors according to the audio features through a feature mapping layer of the audio processing network, and obtains second lip language recognition results according to the audio output vectors through an output layer of the audio processing network. And obtaining a third lip language recognition result based on the audio-visual combined output vector by an audio-visual processing network in the master model based on the audio-visual combined output vector obtained by the audio-visual processing network according to the video frame sequence and the audio output vector obtained by the audio-visual processing network according to the audio signal.

Then, the computer device can construct the cross-modal fusion loss according to the temporary student recognition result output by the student model, the first lip language recognition result, the second lip language recognition result and the third lip language recognition result output by the master model.

In one embodiment, step 1204 comprises: after video output vectors corresponding to the video frame sequences in the temporary training samples are obtained through a pre-trained video assistant network, the video output vectors are coded into video preference coefficients; after an audio output vector corresponding to an audio signal in the temporary training sample is obtained through a pre-trained audio teaching aid network, the audio output vector is encoded into a good audio frequency deviation coefficient; determining a first focus loss according to the temporary student recognition result and the first lip language recognition result, determining a second focus loss according to the temporary student recognition result and the second lip language recognition result, and determining a third focus loss according to the temporary student recognition result and the third lip language recognition result; and weighting the first focus loss according to the video preference coefficient, weighting the second focus loss according to the audio preference coefficient, and fusing the weighted first focus loss and the second focus loss with the third focus loss to obtain the cross-modal fusion loss.

And step 1206, determining the temporary student loss according to the cross entropy loss and the cross modal fusion loss.

With the above derivation, the temporary student loss during the teacher training phase can be represented by the following formula:

L_s＝L_CE(y,f_s(X_V,θ_s))+λ_aL_DF；

wherein y represents label data of the provisional training sample, f_s(X_V,θ_s) The temporary student recognition result L obtained by lip language recognition of the temporary training sample by the student model representing the previous alternate training update_CERepresents the cross entropy loss, L_DFIs a trans-modal fusion loss.

It was previously deduced that, in the master training phase, after obtaining the temporary student loss, the computer device may obtain the temporary student model using the following formula:

In the master training phase, after obtaining the temporary student model, the computer device may construct a student feedback loss using the following formula:

L_ts＝L_CE(y₁,f_ts(X_v；θ_ts))；

from the derivation process, the computer device performs gradient back propagation in the master training stage through the above-mentioned student feedback loss to update the parameters of the fully-connected layer in the cross-modal fusion network, so that the network parameters of the fully-connected layer are trained in the master training stage.

Fig. 13 is a schematic diagram of a network structure for training a teacher model and a student model alternately in a specific embodiment. Referring to fig. 13, the network includes four modules: master model (master), student model (student), and pre-trained audio-and video-aided education networks (tutorA ), where subscript a and subscript V denote audio modality and video, respectivelyA modality. The teacher model is a model based on the combination of video streams and audio streams, the student model and the video teaching assistant network are both models based on video streams, and the audio teaching assistant network is a model based on audio streams. The master model uses an audio signal X_AAnd a sequence of video frames X_VAs input, three types of knowledge are provided: f generated from audio stream_m(X_A；θ_A) F generated from a video stream_m(X_V；θ_V) And f generated from audiovisual combinations_m(X_A,X_V；θ_A,θ_V) Student model with a sequence of video frames X_VAs input, the probability f is output_s(X_V；θ_s) Video assistant network with video frame sequence X_VAs input, the probability f is output_tV(X_V；θ_tV) Audio-frequency teaching-aid network and audio-frequency signal X_AAs input, the probability f is output_tA(X_A；θ_tA)。

In the training stage of alternately training students, only the model parameter theta of the student model is updated_sThe training goals include two terms: cross entropy loss and dynamic fusion loss of student training samples. Inputting the video frame sequence of the student training sample into a student model to obtain a student identification result f_s(X_V；θ_s) Based on the student identification result f_s(X_V；θ_s) And constructing cross entropy loss with the label data y of the student training sample. Inputting the video frame sequence and the audio signal of the student training sample into the master model to obtain f_m(X_A；θ_A)、f_m(X_V；θ_V) And f_m(X_A,X_V；θ_A,θ_V). Inputting the video frame sequence of the student training sample into a video teaching aid network to obtain a video output vector H_VInputting the audio signal of the student training sample into the audio teaching assistant network to obtain an audio output vector H_A. According to f_s(X_V；θ_s)、f_m(X_A；θ_A)、f_m(X_V；θ_V)、f_m(X_A,X_V；θ_A,θ_V)、H_VAnd H_ADynamic fusion losses are constructed.

In the master training stage of alternate training, only model parameters of a master model and a student model of a cross-modal fusion network are updated, and a training target comprises two items: student feedback loss and master recognition loss. The method comprises the steps of firstly using a temporary training sample, inputting the temporary training sample into a student model, an instructor model, a video teaching aid network and an audio teaching aid network according to the same steps of training the student model in a student training stage to obtain temporary student loss, and updating the student model again according to the temporary student loss to obtain a temporary student model (temporary student). Then, using the verification sample, inputting the video frame sequence of the verification sample into a temporary student model to obtain a temporary student identification result f_ts(X_V；θ_ts) Based on the provisional student identification result f_ts(X_V；θ_ts) The student feedback loss is constructed with the tag data y1 of the verification sample. Using a master training sample, inputting the video frame sequence and the audio signal of the master training sample into a master model, and obtaining f_m(X_A；θ_A)、f_m(X_V；θ_V) And f_m(X_A,X_V；θ_A,θ_V) Tag data y2 with the master training sample constructs master recognition loss.

And continuously updating the student model and the master model again according to the above flow until the training stopping condition is met, and obtaining the lip language recognition model according to the optimized student model.

In the related art, when a training sample is selected from a training set, the training sample is generally input into a model to be optimized after random sampling, and the training sample is not sequenced in this way, so that the effectiveness of the training process is influenced to a certain extent. Therefore, based on the strategy of course learning, the embodiment of the application enables the model to learn the lip language recognition knowledge from a simple sample, and gradually increases the sample difficulty, so as to facilitate better convergence of the model.

In one embodiment, the method further comprises: determining a learning difficulty coefficient corresponding to each training sample in the training samples; in the process of training the student model and the master model, according to the sequence of the learning difficulty coefficients from small to large, the student training samples and master training samples required by alternate training are sequentially selected from the training samples.

Specifically, after the computer device obtains the training set, for each training sample in the training set, a corresponding learning difficulty coefficient is determined respectively, and the smaller the learning difficulty coefficient is, the easier the model classifies the training sample, and the lower the learning difficulty of the training sample is, otherwise, the larger the learning difficulty coefficient is, the easier the model classifies the training sample, and the higher the learning difficulty of the training sample is. For the master training stage and the student training stage of the alternate training, when training samples are obtained from the training set, the training samples are sequentially selected and input into the model according to the sequence of the learning difficulty coefficients from small to large.

In one embodiment, determining the learning difficulty coefficient corresponding to each training sample in the training samples includes: processing the video frame sequence in each training sample through a pre-trained video assistant network to obtain the video confidence of the lip language prediction category of each training sample; processing the audio signals in each training sample through a pre-trained audio teaching aid network to obtain the audio confidence of the lip language prediction category of each training sample; and fusing the video confidence coefficient and the audio confidence coefficient to obtain the category confidence coefficient of each training sample, and determining the learning difficulty coefficient corresponding to each training sample according to the category confidence coefficient.

The confidence coefficient is inversely proportional to the learning difficulty coefficient, and the higher the class confidence coefficient is, the more easily the model predicts the training sample accurately, so the lower the learning difficulty coefficient of the training sample is, otherwise, the lower the class confidence coefficient is, the higher the learning difficulty coefficient of the training sample is.

In one embodiment, the computer device obtains the learning difficulty coefficient for the training sample using the scoring function:

wherein the content of the first and second substances,

representing the nth segment of the audio signal,

representing the mth video frame in the video frame sequence, C (phi) representing confidence, sort (phi) representing the sorting operation, the higher the confidence, the easier the training sample is learned by the model, and the learning difficulty coefficient of the training sample

The lower. Alternatively, when multiple training samples have the same learning difficulty coefficient, the confidence of the training samples in the video modality can be determined according to the confidence of the training samples in the video modality

Training samples with higher confidence in the video modality are preferentially selected.

In one embodiment, the method further comprises: determining the number of target samples required by the current alternate training according to the current iteration times, wherein the number of the target samples is gradually increased along with the iteration times; and acquiring training samples of the target sample number to perform the current alternate training.

For example, in the first alternate training process, 10 batches of small-batch training samples are respectively obtained by the computer device to optimize the master model 10 times in the master training stage and the student training stage, the number of each batch of small-batch training samples is 30, in the next alternate training process, 10 batches of small-batch training samples are still respectively obtained by the computer device to optimize the master model 10 times in the master training stage and the student training stage, and the number of each batch of small-batch training samples is 40.

For another example, in the first alternate training process, in the master training stage and the student training stage, the computer device obtains 10 batches of training samples in small batches respectively, optimizes the master model for 10 times, the number of the training samples in each batch of the small batches increases sequentially, the number of the training samples in the first batch is 10, the number of the training samples in the second batch is 15, the number of the training samples in the third batch is 20, and the number of the training samples in the 10 th batch increases sequentially, and is 55.

In one embodiment, the computer device determines the increment of the training sample during training using a cadence function as follows:

wherein G is_iInput percentage, G, representing the number of training samples in the ith iteration₀Is the initial percentage, P represents an exponential factor, P can be taken to be 1.75, and ξ represents the number of iterations in the alternating training.

Based on the evaluation function and the pace function, the difficulty of the training samples and the increment of the number of the training samples can be determined more reasonably, and the strategy can reduce learning fuzziness when training is started and can enable learners to converge better.

In one embodiment, the method further comprises: acquiring a video frame sequence to be identified; inputting a video frame sequence to be recognized into a trained lip language recognition model; and outputting the speaking content corresponding to the speaker in the video frame sequence to be recognized after processing the video frame sequence to be recognized through a video processing network in the lip language recognition model.

Specifically, at the end of training, the computer device may obtain a lip language recognition model from the student model. The computer device may use the lip language recognition model directly. The computer equipment can also obtain model parameters of the lip language recognition model, set a model structure of the student model when needed and import the model parameters to obtain the lip language recognition model.

The obtained lip language recognition model is a model based on a video processing network, and computer equipment such as a terminal or a server can input a video frame to be processed into the trained lip language recognition model and output the speaking content corresponding to the speaker in the video frame sequence to be recognized. The video frame sequence to be processed can be obtained according to a silent video, or can be according to a voiced video, for example, in a noisy environment, when the speaking content of a speaker in the video cannot be clearly heard, the speaking content of the speaker can be identified through a lip language identification model.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.

In a specific embodiment, as shown in fig. 14, the processing method of the lip language recognition model includes the following steps:

step 1402, obtaining training samples and obtaining a student model and an instructor model updated by previous alternate training, wherein each training sample comprises a video frame sequence and a corresponding audio signal;

the processing steps in the master training stage comprise:

step 1404, obtaining temporary training samples from the training samples;

and step 1406, inputting the temporary training samples into the student model based on the video stream to obtain temporary student identification results, and constructing cross entropy loss according to the temporary student identification results and the label data of the temporary training samples.

Step 1408, inputting the provisional training sample into an instructor model based on the video stream and the audio stream, and constructing a cross-modal fusion loss according to a first lip language recognition result, a second lip language recognition result and a third lip language recognition result obtained by performing lip language recognition on the provisional training sample according to the provisional student recognition result and the instructor model.

And step 1410, determining temporary student loss according to the cross entropy loss and the cross-modal fusion loss, and updating the student model based on the temporary student loss to obtain a temporary student model.

And 1412, acquiring a verification sample from the training sample, inputting the verification sample into the temporary student model to obtain a lip language recognition result, and constructing the feedback loss of the student according to the lip language recognition result and the label data of the verification sample.

And step 1414, acquiring master training samples from the training samples, inputting the master training samples into a master model, and determining master identification loss according to a first lip language identification result, a second lip language identification result, a third lip language identification result and label data of the master training samples, which are obtained by performing lip language identification on the master training samples by using the master model.

In step 1416, the master model is updated based on the student feedback loss and the master identification loss.

The processing steps in the student training phase comprise:

and 1418, acquiring student training samples from the training samples, inputting the student training samples into a student model based on the video stream to obtain student identification results, and constructing cross entropy loss according to the student identification results and the label data of the student training samples.

And step 1420, inputting the student training samples into an instructor model based on the video stream and the audio stream, and constructing cross-modal fusion loss according to the student recognition result and a first lip language recognition result, a second lip language recognition result and a third lip language recognition result obtained by lip language recognition of the student training samples by the instructor model.

And step 1422, determining the student loss according to the cross entropy loss and the cross-modal fusion loss, and updating the student model based on the student loss.

Fig. 15 is a schematic flow chart illustrating a processing method of the lip language identification model in one embodiment. Fig. 15 is a processing method of a lip language recognition model mainly used in a student training phase, which specifically includes the following steps:

step 1502, training samples are obtained, and a student model and an instructor model updated in the previous alternate training are obtained, wherein each training sample comprises a video frame sequence and a corresponding audio signal.

And 1504, performing lip language recognition on the video frame sequence in the student training samples obtained from the training samples according to the student models to obtain student recognition results, and constructing cross entropy loss according to the student recognition results and the label data of the student training samples.

And 1506, constructing a cross-modal fusion loss according to the student recognition result, a first lip language recognition result obtained by performing lip language recognition on the student training sample by the video processing network in the master model, a second lip language recognition result obtained by performing lip language recognition on the student training sample by the audio processing network in the master model, and a third lip language recognition result obtained by the audio-visual processing network in the master model based on the video frame sequence and the audio signal.

And 1508, determining the student loss according to the cross entropy loss and the cross modal fusion loss.

And 1510, updating the updated student model of the previous alternate training according to the student loss, obtaining the updated student model of the current alternate training, and performing model training on the updated master model of the previous alternate training based on the updated student model of the current alternate training and the training samples to obtain the updated master model of the current alternate training.

And 1512, returning to the step of obtaining the student model and the master model updated in the previous alternate training to continue the alternate training based on the student model and the master model updated in the current alternate training, and obtaining the lip language recognition model according to the student model updated when the training is stopped.

The specific embodiments of the above steps have been described above, and are not described herein again.

Compared with the traditional mode of guiding the learning of the student model by using the pre-training teacher model, the processing method of the lip language recognition model not only trains the student model, but also trains the model for guiding the learning of the student model, and the model is called as an master model, so that the whole distillation process is divided into a student training stage and a master training stage of alternate training.

The evaluation effect of the model training method provided by the embodiment of the present application is described below.

With respect to the data set used for training: to evaluate the methods provided by the embodiments of the present application, three reference datasets were used, namely one word-level dataset LRW [3] and two sentence-level datasets LRS2-BBC, LRS 3-TED. The LRW dataset is a large vocabulary level dataset with 500 words and 45 million utterances, each video being 1.16 seconds in length, 29 frames. The LRS2-BBC dataset comes from a conversation at the BBC, which is decomposed into a pre-training dataset, a fine-tuning training dataset, and a validation dataset. The LRS3-TED dataset is from a TED presentation, including 15 million utterances and over 420 million words.

Preprocessing on training samples: to crop the lip region of the video, the face markers are detected using dlib, and the results are randomly cropped and interpolated to yield 112 × 112 images centered on the lip, and the face region is also rotated and scaled.

Implementation details are as follows: in the word-level lip recognition scenario, the size of the vocabulary is set to 500, which is consistent with the vocabulary in the LRW. For sentence-level lip recognition scenarios, namely LRS2-BBC and LRS3-TED, the size of the vocabulary is set to 40, including 26 letters, 10 numbers, and 4 special marks ([ space ], [ keyboard ], [ EOS ], and punctuation marks).

In addition, in the training process, the student model and the master model are alternately trained by using the SGD optimizer, the momentum is 0.9, and the weight attenuation is 1 e-4. In an audio stream, the original waveform is taken as input. In a video stream, the input video is sampled at a rate of 25 fps.

The whole training process comprises two steps of pre-training and fine-tuning. Specifically, the student model and teacher model are pre-trained at the word level using a Time Convolution (TC) -based back-end, using a pre-training set of LRW and LRS2-BBS and LRS3-TED, and the pre-trained model is trimmed using LRW. In the sentence-level lip language recognition scenario, TM-Seq2Seq is used to replace TC in the pre-training model as a back end, training is continued with the pre-training set of LRS2-BBS or LRS3-TED, and then the new pre-training model is fine-tuned with the associated training value set.

Before training, the learning rate α is set to 10-3. During the fine tuning, α is initialized to 10-4, which is halved each time the loss curve is verified to be flat, and finally the learning rate drops to 10-6. Some of the superparameters in the preceding equations are set as follows: λ s ═ 10, λ m ═ 10, G0 ═ 0.25, P ═ 1.75, and ξ ═ 107.

Regarding the evaluation index: in all experiments, a Word Error Rate (WER), defined as WER ═ S + D + I)/NUM, where S, D, I are the number of words that are replaced, deleted, and inserted, respectively, compared to the tag data for the predicted value, and NUM is the total number of words in the tag data, was used as a metric.

Table one: WERs obtained with LRW, LRS2-BBC and LRS3-TED, respectively

Table 2: error rate on LRW

Method	THESE	THERE	THING	UNDER
					Ours (without distillation)	74％	70％	70％	66％
Ours (with distillation)	70％	59％	68％	60％

Table 3: WER for students learning from different pre-trained teachers or co-trained teachers

Method	Distill from	LRS-BBC
			Audio Teacher	x	17.2
Student1	Audio Teacher	54.2
			Video Teacher	x	57.5
Student2	Video Teacher	53.4
			Audio-Visual Teacher	x	15.6
Student3	Audio-Visual Teacher	54.1
			Audio Master	x	19.1
Student4	Audio Master	52.1
			Video Master	x	59.1
Student5	Video Master	53.0
			Audio-Visual Master	x	16.9
Student6	Audio-Visual Master	51.5

Comparison with related art: the methods provided in the examples of this application were compared to several methods, including MT, Temporal Conv, WAS, Bi LSTM, TM-CTC, TM-Seq2Seq, Conv-Seq2Seq, LIBS, and TM-CTC-KD.

For word-level lip language recognition. Table 1 shows a quantitative comparison of the methods associated with the LRW dataset in terms of word-level lip language recognition. It can be seen that the Ours-TC provided by the examples of the present application is significantly better than the baseline time convolution (Temporal Conv) without the knowledge of distillation, the WER is improved by 6.7%. Furthermore, the outer-TM achieves the best performance compared to other methods. In particular, the increase was 2% compared to the second best method, Conv-Seq2 Seq.

For sentence-level lip language identification, the experimental results are listed in the last two columns of table 2. It can be observed that the TM provided by the examples of this application performs best on LRS2-BBC and LRS3-TED, compared to other methods. More importantly, the method provided by the examples of the present application improves LRS2-BBC and LRS3-TED by 0.6% and 0.9% respectively, using less training data, compared to TM-Seq2 Seq. TM-Seq2Seq employs the same back-end as the TM provided in the embodiments of the present application and is trained on an additional non-public data set MV-LRS. In addition, Conv-Seq2Seq uses a more advanced structure than the student model provided by the embodiments of the present application, and the TM provided by the embodiments of the present application still achieves better performance, the WER of LRS2-BBC is improved by 2.5%, and the WER of LRS3-TED is improved by 1.1% compared with Conv-Seq2 Seq.

For the example of misclassification: the inventors further investigated the first four LRW cases with the highest error rates and listed the results of comparing our TC without KD to our TC in table 3. It can be observed that when multiple phonemes are mapped to a dimension bit, for example, the TH and DH phonemes are compared with the dimension bit/t/, the accuracy of the method provided by the embodiment of the present application is improved by nearly 6% on average.

In conclusion, the research results show that: (i) the teacher model distillation method provided by the embodiment of the application can effectively improve the performance of the task specific network. (ii) Although the model provided by the embodiments of the present application focuses primarily on the advantages of standard distillation methods, it is possible to obtain better performance when a mission-specific network structure is replaced with a more advanced one.

For ablation experiments: the effectiveness of the proposed modules is investigated, including the main network, the cross-modal fusion network and the course learning strategy, using a single-mode lip language recognition network as a baseline.

Effectiveness of master (master). To investigate master efficacy, the inventors studied 6 pairs of different models of teacher or master designs and tested the respective performance on LRS 2-BBC. The results are summarized in Table 3. The reported representation of the audio-visual master comes from its audiovisual branch and the architecture of the predicted teacher is exactly the same as that of its corresponding master. Furthermore, course learning strategies are not used herein.

The inventors have the following observations and analyses: (one) in the case of a single model without KD, whether the model is trainable (i.e. master model) or untrained (i.e. teacher model), the descending order of its performance in different modalities is always { audio-visual modality (AV), audio modality (a), and video modality (V) }. This verifies the importance of learning from cross-modality data rather than single modality data. And (II) under the condition of extracting knowledge from the teacher model and the master model, the performances of the student models in different forms are { V, AV, A } and { AV, A, V } in sequence from large to small. The first sort order means that audiovisual modalities can provide additional information compared to audio modalities, helping to mitigate ambiguities across modality gaps, but using a simple fusion strategy (concatenation) is limited. Yet another sort order shows the effectiveness of the master model, which can narrow the cross-modality differences to some extent, because the master model is based on the dynamic adjustment of the student model's task-specific feedback. (iii) in either form, student models that were learned from the master model always performed better than student models that were learned from the teacher model. These facts indicate that, despite sacrificing performance on their own, the co-trained master model is more efficient than a pre-trained teacher model due to its adaptability to student models.

In one embodiment, as shown in fig. 16, there is provided a processing apparatus 1600 for a lip language recognition model, which may be a part of a computer device by using a software module or a hardware module, or a combination of the two modules, and specifically includes: a sample acquisition module 1602, a temporary student model acquisition module 1604, an master model training module 1606, and an iteration module 1608, wherein:

a sample obtaining module 1602, configured to obtain training samples and obtain a student model and an instructor model updated in a previous alternate training, where each training sample includes a sequence of video frames and a corresponding audio signal;

a temporary student model obtaining module 1604, configured to determine a temporary student loss according to results obtained by performing lip language recognition on temporary training samples obtained from the training samples by the student model and the master model, and update the student model based on the temporary student loss to obtain a temporary student model;

the master model training module 1606 is configured to determine a student feedback loss according to a result obtained by performing lip language recognition on the verification sample obtained from the training sample by using the provisional student model and tag data of the verification sample, and determine a master recognition loss according to a result obtained by performing lip language recognition on the master training sample obtained from the training sample by using the master model and tag data of the master training sample; obtaining an updated master model of the current alternate training according to the student feedback loss and the master identification loss, and performing model training on the student model updated by the previous alternate training based on the updated master model of the current alternate training and the training sample to obtain an updated student model of the current alternate training;

an iteration module 1608, configured to, based on the updated student model and the master model during the second alternate training, return to the step of obtaining the updated student model and the master model during the previous alternate training to continue the alternate training, and obtain the lip language recognition model according to the updated student model when the training is stopped.

In one embodiment, the processing apparatus 1600 of the lip language recognition model further includes a student recognition module, configured to input the sequence of video frames in the training sample into the student model; extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model; obtaining a video output vector according to the video characteristics through a characteristic mapping layer of the student model; and obtaining a lip language recognition result according to the video output vector through an output layer of the student model.

In one embodiment, the processing apparatus 1600 for lip language recognition model further includes a master recognition module, configured to input the training sample into the master model; processing a video frame sequence in a training sample through a video processing network in a master model to obtain a first lip language recognition result; processing the audio signal in the training sample through an audio processing network in the master model to obtain a second lip language recognition result; and obtaining a third lip language recognition result based on the audio-visual combined output vector by an audio-visual processing network in the master model based on the audio-visual combined output vector obtained by the audio-visual processing network according to the video frame sequence and the audio output vector obtained by the audio-visual processing network according to the audio signal.

In one embodiment, the master identification module is further configured to input the sequence of video frames in the training sample into a video processing network of the master model; the method comprises the steps of extracting video features corresponding to a video frame sequence through a feature extraction layer of a video processing network, obtaining a video output vector according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language recognition result according to the video output vector through an output layer of the video processing network.

In one embodiment, the teacher identification module is further configured to input the audio signals in the training samples into an audio processing network of the teacher model; and extracting audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtaining an audio output vector according to the audio features through a feature mapping layer of the audio processing network, and obtaining a second lip language recognition result according to the audio output vector through an output layer of the audio processing network.

In one embodiment, when the student model is used for word-level lip language recognition, the master recognition module is further configured to input the video output vector and the audio output vector into an audiovisual processing network of the master model; and cascading the video output vector and the audio output vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language identification result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

In one embodiment, when the student model is used for sentence-level lip language recognition, the master recognition module is further configured to determine a feature vector of a previously output character; inputting the feature vector of the previous output character, the video output vector obtained by the video processing network according to the video frame sequence and the audio output vector obtained by the audio processing network according to the audio signal into the audio-visual processing network of the master model; obtaining a video coding vector and an audio coding vector according to the feature vector, the video output vector and the audio output vector through a multi-head attention coding layer of the audio-visual processing network; and cascading the video coding vector and the audio coding vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language identification result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

In one embodiment, the temporary student model obtaining module 1604 is further configured to perform lip language recognition on the video frame sequence in the temporary training sample through the student model to obtain a temporary student recognition result, and construct a cross entropy loss according to the temporary student recognition result and the label data of the temporary training sample; constructing a cross-modal fusion loss according to a first lip language recognition result, a second lip language recognition result and a third lip language recognition result obtained by performing lip language recognition on the provisional training sample according to the provisional student recognition result and the master model; and determining the temporary student loss according to the cross entropy loss and the cross modal fusion loss.

In one embodiment, the temporary student model obtaining module 1604 is further configured to obtain a video output vector corresponding to the sequence of video frames in the temporary training sample through a pre-trained video assistant network, and then encode the video output vector into a video preference coefficient; after an audio output vector corresponding to an audio signal in the temporary training sample is obtained through a pre-trained audio teaching aid network, the audio output vector is encoded into a good audio frequency deviation coefficient; determining a first focus loss according to the temporary student recognition result and the first lip language recognition result, determining a second focus loss according to the temporary student recognition result and the second lip language recognition result, and determining a third focus loss according to the temporary student recognition result and the third lip language recognition result; and weighting the first focus loss according to the video preference coefficient, weighting the second focus loss according to the audio preference coefficient, and fusing the weighted first focus loss and the second focus loss with the third focus loss to obtain the cross-modal fusion loss.

In one embodiment, the temporary student model obtaining module 1604 is further configured to perform full-join processing on the video output vector through a first full-join layer in the cross-modality fusion network to obtain a video full-join vector; performing full-connection processing on the audio output vector through a second full-connection layer in the cross-modal fusion network to obtain an audio full-connection vector; and connecting the video full-connection vector and the audio full-connection vector in series through a third full-connection layer in the cross-modal fusion network, and then performing full-connection processing to obtain an audio preference coefficient.

In one embodiment, master model training module 1606 is further configured to input the sequence of video frames in the validation sample into the student model; extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model; obtaining a video output vector according to the video characteristics through a characteristic mapping layer of the student model; obtaining a lip language recognition result according to the video output vector through an output layer of the student model; and constructing cross entropy loss according to the lip language identification result and the label data of the verification sample, and taking the cross entropy loss as the feedback loss of the student.

In an embodiment, the master model training module 1606 is further configured to input the master training sample into the master model, and obtain a corresponding first lip language recognition result, a second lip language recognition result, and a third lip language recognition result; determining first cross entropy loss according to the label data of the master training sample and the first lip language recognition result, determining second cross entropy loss according to the label data of the master training sample and the second lip language recognition result, determining third cross entropy loss according to the label data of the master training sample and the third lip language recognition result, and fusing the first cross entropy loss, the second cross entropy loss and the third cross entropy loss to obtain master recognition loss.

In one embodiment, the processing apparatus 1600 of the lip language recognition model further includes a student training module, configured to obtain a student training sample from the training samples; determining the loss of the student according to the result obtained by performing lip language recognition on the student training sample by the student model updated in the previous alternate training and the result obtained by performing lip language recognition on the student training sample by the master model updated in the current alternate training; and updating the student model updated by the previous alternate training according to the student loss, and then obtaining the student model updated by the current alternate training.

In one embodiment, the student training module is further configured to perform lip language recognition on the video frame sequence in the student training sample through the student model updated by the previous alternate training to obtain a student recognition result, and construct cross entropy loss according to the student recognition result and the label data of the student training sample; constructing a cross-modal fusion loss according to a first lip language recognition result, a second lip language recognition result and a third lip language recognition result obtained by lip language recognition of a student training sample according to a student recognition result and a master model updated by alternate training at the current time; and determining the student loss according to the cross entropy loss and the cross modal fusion loss.

In an embodiment, the processing apparatus 1600 of the lip language recognition model further includes a training sample selection module, configured to determine a learning difficulty coefficient corresponding to each training sample in the training samples; in the process of training the student model and the master model, according to the sequence of the learning difficulty coefficients from small to large, the student training samples and master training samples required by alternate training are sequentially selected from the training samples.

In one embodiment, the training sample selection module is further configured to process a video frame sequence in each training sample through a pre-trained video assistant network to obtain a video confidence of a lip language prediction category of each training sample; processing the audio signals in each training sample through a pre-trained audio teaching aid network to obtain the audio confidence of the lip language prediction category of each training sample; and fusing the video confidence coefficient and the audio confidence coefficient to obtain the category confidence coefficient of each training sample, and determining the learning difficulty coefficient corresponding to each training sample according to the category confidence coefficient.

In one embodiment, the processing apparatus 1600 of the lip language recognition model further includes a training sample number determining module, configured to determine, according to the current iteration number, a target sample number required by current alternate training, where the target sample number is gradually increased along with the iteration number; and acquiring training samples of the target sample number to perform the current alternate training.

In one embodiment, the processing apparatus 1600 of the lip language recognition model further includes a recognition module, configured to obtain a sequence of video frames to be recognized; inputting a video frame sequence to be recognized into a trained lip language recognition model; and outputting the speaking content corresponding to the speaker in the video frame sequence to be recognized after processing the video frame sequence to be recognized through a video processing network in the lip language recognition model.

Above-mentioned processing apparatus 1600 of lip language identification model compares with the traditional mode that uses the teacher's model of training in advance to guide the study of student's model, not only trains student's model, still trains the model of guiding the study of student's model, and this model is called master's model to divide whole distillation process into the student training phase and the master training phase of training in turn.

In one embodiment, as shown in fig. 17, there is provided a lip language recognition model processing apparatus 1700, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, and specifically includes: a sample acquisition module 1702, a tag loss construction module 1704, a cross-modal fusion loss construction module 1706, a student model update module 1708, and an iteration module 1710, wherein:

a processing device of a lip language recognition model comprises:

a sample obtaining module 1702, configured to obtain training samples and obtain a student model and an instructor model updated in a previous alternate training, where each training sample includes a sequence of video frames and a corresponding audio signal;

a label loss construction module 1704, configured to perform lip language identification on the video frame sequence in the student training sample obtained from the training sample according to the student model to obtain a student identification result, and construct cross entropy loss according to the student identification result and label data of the student training sample;

a trans-modal fusion loss construction module 1706, configured to construct a trans-modal fusion loss according to the student recognition result, a first lip language recognition result obtained by performing lip language recognition on the student training sample by using the video processing network in the master model, a second lip language recognition result obtained by performing lip language recognition on the student training sample by using the audio processing network in the master model, and a third lip language recognition result obtained by performing audio-visual processing network in the master model based on the video frame sequence and the audio signal;

a student model updating module 1708, configured to determine student loss according to cross entropy loss and cross-modal fusion loss; updating the updated student model of the previous alternate training according to the student loss, obtaining the updated student model of the current alternate training, and performing model training on the updated master model of the previous alternate training based on the updated student model of the current alternate training and the training samples to obtain the updated master model of the current alternate training;

and an iteration module 1710, configured to, based on the updated student model and master model when training alternately, return to the step of obtaining the updated student model and master model of the previous training alternately to continue the alternate training, and obtain the lip language recognition model according to the updated student model when the training stops.

Compared with the traditional mode of guiding the learning of the student model by using the pre-training teacher model, the lip language recognition model device 1700 not only trains the student model, but also trains the model for guiding the learning of the student model, and the model is called as an instructor model, so that the whole distillation process is divided into a student training stage and an instructor training stage which are alternately trained.

For specific definition of the processing device of the lip language recognition model, reference may be made to the above definition of the processing method of the lip language recognition model, and details are not described here. The modules in the processing device of the lip language recognition model can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, the internal structure of which may be as shown in fig. 18. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The computer program is executed by a processor to implement a method of processing a lip language recognition model.

Those skilled in the art will appreciate that the architecture shown in fig. 18 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A processing method of a lip language recognition model is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of the student model performing lip language recognition on the training sample comprises:

inputting a sequence of video frames in the training sample into the student model;

extracting video features corresponding to the video frame sequence through a feature extraction layer of the student model;

obtaining a video output vector according to the video features through a feature mapping layer of the student model;

and obtaining a lip language recognition result according to the video output vector through an output layer of the student model.

3. The method of claim 1, wherein the step of lip language recognition of the training sample by the master model comprises:

inputting the training sample into the master model;

processing the video frame sequence in the training sample through a video processing network in the master model to obtain a first lip language recognition result;

processing the audio signal in the training sample through an audio processing network in the master model to obtain a second lip language recognition result;

and obtaining a third lip language identification result based on the audio-visual combined output vector by an audio-visual processing network in the master model based on the audio-visual combined output vector obtained by the audio-visual processing network according to the video frame sequence and the audio output vector obtained by the audio-visual processing network according to the audio signal.

4. The method of claim 3, wherein the processing the sequence of video frames in the training sample through the video processing network in the master model to obtain a first lip language recognition result comprises:

inputting the sequence of video frames in the training sample into a video processing network of the master model;

extracting video features corresponding to the video frame sequence through a feature extraction layer of the video processing network, obtaining the video output vector according to the video features through a feature mapping layer of the video processing network, and obtaining a first lip language identification result according to the video output vector through an output layer of the video processing network.

5. The method of claim 3, wherein the processing the audio signal in the training sample through the audio processing network in the master model to obtain a second lip recognition result comprises:

inputting the audio signals in the training samples into an audio processing network of the master model;

and extracting audio features corresponding to the audio signals through a feature extraction layer of the audio processing network, obtaining audio output vectors according to the audio features through a feature mapping layer of the audio processing network, and obtaining second lip language recognition results according to the audio output vectors through an output layer of the audio processing network.

6. The method according to claim 3, wherein when the student model is used for word-level lip language recognition, obtaining, by an audiovisual processing network in the master model, an audiovisual combined output vector based on a video output vector obtained by the video processing network from the video frame sequence and an audio output vector obtained by the audio processing network from the audio signal, and obtaining a third lip language recognition result based on the audiovisual combined output vector comprises:

inputting the video output vector and the audio output vector into an audio-visual processing network of a master model;

and cascading the video output vector and the audio output vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language identification result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

7. The method according to claim 3, wherein when the student model is used for sentence-level lip language recognition, obtaining, by an audiovisual processing network in the master model, an audiovisual combined output vector based on a video output vector obtained by the video processing network from the video frame sequence and an audio output vector obtained by the audio processing network from the audio signal, and obtaining a third lip language recognition result based on the audiovisual combined output vector comprises:

determining a feature vector of a previously output character;

inputting the feature vector of the previous output character, the video output vector obtained by the video processing network according to the video frame sequence and the audio output vector obtained by the audio processing network according to the audio signal into an audio-visual processing network of a master model;

obtaining a video coding vector and an audio coding vector according to the feature vector, the video output vector and the audio output vector through a multi-head attention coding layer of the audio-visual processing network;

and cascading the video coding vector and the audio coding vector through a cascading layer of the audio-visual processing network to obtain an audio-visual combined output vector, and obtaining a third lip language identification result according to the audio-visual combined output vector through an output layer of the audio-visual processing network.

8. The method according to claim 1, wherein determining the tentative student loss according to the results of lip language recognition on the tentative training samples obtained from the training samples by the student model and the master model respectively comprises:

performing lip language recognition on the video frame sequence in the temporary training sample through the student model to obtain a temporary student recognition result, and constructing cross entropy loss according to the temporary student recognition result and label data of the temporary training sample;

constructing a cross-modal fusion loss according to the temporary student recognition result, the first lip language recognition result, the second lip language recognition result and the third lip language recognition result obtained by lip language recognition of the temporary training sample by the master model;

and determining the temporary student loss according to the cross entropy loss and the cross modal fusion loss.

9. The method according to claim 8, wherein the constructing a cross-modal fusion loss according to the provisional student recognition result, the first lip recognition result, the second lip recognition result and the third lip recognition result obtained by performing lip recognition on the provisional training sample by the master model comprises:

after video output vectors corresponding to the video frame sequences in the temporary training samples are obtained through a pre-trained video assistant network, the video output vectors are coded into video preference coefficients;

after an audio output vector corresponding to the audio signal in the temporary training sample is obtained through a pre-trained audio assistant network, the audio output vector is encoded into a good audio frequency deviation coefficient;

determining a first focus loss according to the temporary student recognition result and the first lip language recognition result, determining a second focus loss according to the temporary student recognition result and the second lip language recognition result, and determining a third focus loss according to the temporary student recognition result and the third lip language recognition result;

and weighting the first focus loss according to the video preference coefficient, weighting the second focus loss according to the audio preference coefficient, and fusing with the third focus loss to obtain the cross-modal fusion loss.

10. The method of claim 9, wherein encoding the audio output vector into audio good coefficients comprises:

performing full-connection processing on the video output vector through a first full-connection layer in a cross-modal fusion network to obtain a video full-connection vector;

performing full-connection processing on the audio output vector through a second full-connection layer in the cross-modal fusion network to obtain an audio full-connection vector;

and connecting the video full-connection vector and the audio full-connection vector in series through a third full-connection layer in the cross-modal fusion network, and then performing full-connection processing to obtain an audio preference coefficient.

11. The method according to claim 1, wherein the determining the student feedback loss according to the result of lip language recognition on the verification sample obtained from the training sample and the tag data of the verification sample by the provisional student model comprises:

inputting a sequence of video frames in the validation sample into the student model;

obtaining a lip language recognition result according to the video output vector through an output layer of the student model;

and constructing cross entropy loss according to the lip language identification result and the label data of the verification sample, and taking the cross entropy loss as the feedback loss of the student.

12. The method according to claim 1, wherein determining teacher recognition loss according to the result of lip language recognition on the teacher training sample obtained from the training sample by the teacher model and the label data of the teacher training sample comprises:

inputting the master training sample into a master model to obtain a corresponding first lip language recognition result, a second lip language recognition result and a third lip language recognition result;

determining a first cross entropy loss according to the label data of the master training sample and the first lip language recognition result, determining a second cross entropy loss according to the label data of the master training sample and the second lip language recognition result, determining a third cross entropy loss according to the label data of the master training sample and the third lip language recognition result, and fusing the first cross entropy loss, the second cross entropy loss and the third cross entropy loss to obtain the master recognition loss.

13. The method according to claim 1, wherein model training the updated student model of the previous alternate training based on the updated master model of the current alternate training and the training samples to obtain the updated student model of the current alternate training comprises:

obtaining student training samples from the training samples;

according to the result obtained by performing lip language recognition on the student training sample by the student model updated in the previous alternate training and the result obtained by performing lip language recognition on the student training sample by the master model updated in the current alternate training, determining student loss;

and updating the student model updated in the previous alternate training according to the student loss, and then obtaining the student model updated in the current alternate training.

14. The method according to claim 13, wherein the determining the student loss according to the result of lip language recognition on the student training sample by the student model updated in the previous alternate training and the result of lip language recognition on the student training sample by the teacher model updated in the current alternate training comprises:

performing lip language recognition on the video frame sequence in the student training sample through the student model updated by the previous alternate training to obtain a student recognition result, and constructing cross entropy loss according to the student recognition result and label data of the student training sample;

constructing a cross-modal fusion loss according to the student recognition result and a first lip language recognition result, a second lip language recognition result and a third lip language recognition result obtained by lip language recognition of the student training sample according to the teacher model updated by the current alternate training;

and determining the student loss according to the cross entropy loss and the cross modal fusion loss.

15. The method of claim 1, further comprising:

determining a learning difficulty coefficient corresponding to each training sample in the training samples;

and in the process of training the student model and the master model, sequentially selecting student training samples and master training samples required by alternate training from the training samples according to the sequence of the learning difficulty coefficients from small to large.