CN116484885A

CN116484885A - Visual language translation method and system based on contrast learning and word granularity weight

Info

Publication number: CN116484885A
Application number: CN202310461929.3A
Authority: CN
Inventors: 赵洲; 李林峻; 成曦泽; 金涛; 王晔; 林旺; 陈哲乾
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd; Zhejiang University ZJU
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd; Zhejiang University ZJU
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-07-25

Abstract

The invention discloses a visual language translation method and a visual language translation system based on contrast learning and word granularity weight, and belongs to the field of time sequence alignment visual language translation. Extracting lip language or finger language video embedded features and text embedded features of a source domain; firstly encoding video embedded features, then interacting with text embedded features through a multi-head attention mechanism, decoding to generate word probability distribution, and performing preliminary training based on cross entropy loss function items of tasks; calculating the word granularity diversity weight of the words according to the decoded attention vector; the source domain is randomly divided into a meta training set and a meta testing set, a meta learning training strategy with comparison limitation is adopted, and the encoder and decoder with generalization capability are trained by controlling the learning direction of the model through diversity perception weights. And completing the visual language translation task of the invisible expressive person by using the trained visual encoder and the cross-modal decoder. The invention improves generalization ability of the outside-domain expressive person and effectively improves visual language translation effect.

Description

Visual language translation method and system based on contrast learning and word granularity weight

Technical Field

The invention relates to the field of time sequence alignment visual language translation, in particular to a visual language translation method and system based on contrast learning and word granularity weight.

Background

Time-aligned visual language translation is intended to translate visual content exhibited by an expressive into natural language text, a interdisciplinary area between computer vision and natural language processing. Specifically, the field includes important artificial intelligence tasks such as lip language translation, finger language translation and the like, and the automatic visual language translator helps the hearing impaired person communicate with the normal person. In the lip translation, the speaking content is translated according to the lip movement of the speaker; in the meaning language translation, a text sequence is translated according to a fine-grained hand gesture of a meaning language person. The common point of the translation of the lip language and the finger language is that the visual content and the translated natural language text are aligned in time sequence.

The prior art mainly utilizes an autoregressive method or a connection time sequence classification method to generate word sequences. However, these methods perform poorly in real scenes because different expressives have a wide variety of performance habits on specific ambiguous words. In addition, the variability between the above-mentioned expressive agents is more pronounced in cases where the data resources are limited or the labeling costs are expensive. Ideally, an applicable time-aligned visual language translation system should possess excellent generalization capability and give high translation accuracy on non-seen expressive persons.

Currently, the domain generalization task has made breakthrough progress, and the purpose of the task is to train a model that can be generalized directly to an unseen target domain with a source domain of limited data. The prior art is mainly divided into three types of methods of characterization learning, data operation and learning strategies. Because of the requirement of time sequence aligned visual language translation task on the generalization capability of the model among speakers, if the method in the domain generalization task can be utilized, and meanwhile, the characteristics of the visual language translation task are combined, the accuracy of translation can be effectively improved.

Disclosure of Invention

The invention aims to enhance the generalization capability of a time sequence aligned visual language translation system on an out-of-domain expressive person so as to overcome inherent ambiguity among words and maintain semantic relations among classes and promote domain independence of a model. The invention provides a visual language translation method and a visual language translation system based on contrast learning and word granularity weight, which are used for translating lip language and finger language videos into natural language texts.

The invention adopts the specific technical scheme that:

in a first aspect, the present invention provides a visual language translation method based on contrast learning and word granularity weights, comprising the steps of:

1) Extracting lip language or finger language video embedded features of a source domain, and acquiring natural language text embedded features;

2) Coding the lip language or finger language video embedded feature by a visual coder based on a multi-head attention mechanism to obtain coded visual features;

3) For the coded visual features, interacting with natural language text embedded features in a cross-modal decoder through a multi-head attention mechanism, decoding to generate word probability distribution, and obtaining cross entropy loss function items based on tasks, and performing preliminary training on the visual encoder and the cross-modal decoder;

4) Calculating the word granularity diversity weight of the word according to decoded attention vectors obtained by the vision encoder and the cross-modal decoder which are preliminarily trained, and updating the cross entropy loss function item in the step 3) according to the word granularity diversity weight to obtain a cross entropy loss item with the word granularity diversity weight function;

5) Randomly dividing a source domain into a meta training set and a meta testing set, and updating parameters of a visual encoder and a cross-modal decoder by using cross entropy loss items with the function of word granularity diversity weight in a meta training stage; in the meta-test stage, the attention vector calculated in the cross-modal decoder updated in the meta-test stage is utilized to acquire global and local contrast learning loss function items, and the cross entropy loss items with the function of word granularity diversity weight are combined to update the parameters of the visual encoder and the cross-modal decoder again, so that the trained visual encoder and the trained cross-modal decoder are obtained;

6) And acquiring the video embedded characteristics of the lip language or the finger language to be translated in the target domain, and completing the visual language translation task of the unseen expressive person by using the trained visual encoder and the cross-modal decoder.

Further, the lip language or finger language video of the source domain is time aligned with the corresponding natural language text.

Furthermore, the visual encoder is of a multi-layer structure, and each layer is formed by stacking self-attention layers, feedforward neural networks, residual connection and layer standardization operation and is used for encoding lip language or finger language video embedded features and generating an encoded video feature matrix.

Furthermore, the cross-mode decoder is of a multi-layer structure, and each layer is formed by stacking self-attention layers, mutual-attention layers, feedforward neural networks, residual connection and layer standardization operation and is used for predicting words of the next time step according to natural language text embedded features and encoded video feature matrixes.

Further, the step 4) includes:

4.1 Inputting the embedded features of the source domain data into the visual encoder and the cross-modal decoder obtained by the preliminary training in the step 3), and performing cross-modal solutionThe last layer of encoder, the mutual attention layer SA (E _t ^′ ,F ^′ ,F ^′ ) The calculation result of (2) is recorded asWherein T is _t Is the number of words before time step t, +.>An attention vector representing the kth expressive person;

the personalized expression effect of the kth expressive person on the word c is obtained through calculation according to the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,indicating +.A when the r-th word is c>Is the number of samples with word label c, +.>An expression vector representing the kth expressive person on word c;

4.2 According to the expression vector of different expressive persons on the same word, calculating to obtain the variance of the expressive persons, wherein the calculation formula is as follows:

wherein K is the number of expressives in the source domain, σ (·) represents the nonlinear activation function, v _c The diversity weight vector representing the word c, and the complete word granularity diversity weight matrix is obtained by splicing the weights of each word and is expressed asT _c Representing the length of the vocabulary;

4.3 Updating cross entropy loss function term according to the word granularity diversity weight matrix obtained in the step 4.2), wherein the calculation formula is as follows:

wherein, represents the bitwise multiplication operation between vectors on word granularity,is a cross entropy loss function with the function of word granularity diversity weight.

Further, the step 5) includes:

5.1 Randomly dividing source domain data containing K domains into meta-training sets during each training round Sum element test set N _tr Representing the data volume of the meta training set, D _i Representing the ith data in the meta-training set,N _te representing the data volume of the meta-test set, D _j Represents the j-th data in the meta-test set;

5.2 In the meta training stage, calculating cross entropy loss items with word granularity diversity weight function, and updating parameters:

wherein θ represents the visual encoder and spanAll trainable parameters of the modal decoder, alpha represents the learning rate of the meta-training phase, theta ^′ Representing the parameters after the update of the parameters,representing the gradient;

5.3 During meta-test, the data pairs in the meta-test set are updated to theta through model parameters ^′ Using the attention vector calculated in the decoder to calculate a global and local contrast-learned loss function term;

5.4 Combining the global and local contrast learning loss function term in the step 5.3) and the cross entropy loss term with word granularity diversity weight function in the step 5.2) to obtain total loss, and updating parameters according to the total loss:

where beta represents the learning rate, theta represents all trainable parameters of the visual encoder and the cross-modal decoder,indicating the total loss.

Further, the calculation of the global contrast learning loss function term includes:

For a particular expressive person k, use is made of the expression vector on the particular word obtained in step 4.1)And calculating probability distribution of the word, and then acquiring a global contrast learning loss function term, wherein the calculation formula is as follows:

wherein softmax (. Cndot.) represents the activation function, T _c Representing the number of words decoded prior to time step t, τ represents the temperature coefficient,representing the probability distribution, N, of video sentence pairs of expressive k over word c _o Is a meta training set->Middle domain D _i Sum Meta test set->Middle domain D _j Constituent (D) _i ,D _j ) Number of pairs>Representing Domain D in the o-th pair of Domain groups _i Probability distribution on word c, +.>The loss function term representing global contrast learning, H (|. Cndot.) represents the relative entropy.

Further, the calculation of the local contrast learning loss function term includes:

subjecting the whole data of the source domain to the word granularity attention vector obtained in step 4.1)Combining two by two to obtain a plurality of sample pairs (x _b ,y _b ) The calculation formula of the loss function term of the local contrast learning constituting the set A is as follows:

wherein ρ (·) represents a distance function, (x) _b ,y _b ) Represents the b-th sample pair, N in set A _b Is the number of pairs of samples, y=1- [ x _l ＝y _l ]When x is expressed as _l ＝y _l When [ x ] _l ＝y _l ]=1, y=0; otherwise [ x ] _l ＝y _l ]＝0，Y＝1；x _l And y _l Is the sample pair x _b And y _b The respective label, ζ, represents a coefficient controlling the amplitude of the distance between two samples,a loss function term representing local contrast learning.

In a second aspect, the present invention provides a visual language translation system based on contrast learning and word granularity weights, comprising:

the lip language or finger language video preprocessing module is used for extracting the lip language or finger language video embedding characteristics of the source domain;

the natural language text preprocessing module is used for acquiring the natural language text embedding characteristics;

the visual encoder module is used for encoding the lip language or finger language video embedded features to obtain encoded visual features;

the cross-modal decoder module is used for interacting the coded visual characteristics with the natural language text embedded characteristics through a multi-head attention mechanism, and decoding to generate word probability distribution; in the actual translation stage, generating a target natural language text according to the coded visual embedded feature vector in an autoregressive manner;

a pre-training module for performing a preliminary training of the video encoder and the cross-modal decoder based on the task cross-entropy loss function term;

the word granularity weight calculation module is used for calculating the word granularity diversity weight of the words according to decoded attention vectors obtained by the vision encoder and the cross-mode decoder which are primarily trained;

The contrast-limited meta-learning training module is used for randomly dividing a source domain into a meta-training set and a meta-testing set, and updating the parameters of a visual encoder and a cross-modal decoder by using cross entropy loss items with word granularity diversity weight functions in a meta-training stage; in the meta-test stage, the attention vector calculated in the cross-modal decoder updated in the meta-test stage is utilized to acquire global and local contrast learning loss function items, and the cross entropy loss items with the function of word granularity diversity weight are combined to update the parameters of the visual encoder and the cross-modal decoder again, so that the trained visual encoder and the trained cross-modal decoder are obtained;

and the translation module is used for acquiring the video embedded characteristics of the lip language or the finger language to be translated in the target domain, and completing the visual language translation task of the unseen expressive person by utilizing the trained visual encoder and the cross-modal decoder.

Compared with the prior art, the invention has the following beneficial effects:

the invention relates to a visual language translation method based on contrast learning and word granularity weight, which uses a diversity attention mechanism guided by the word granularity weight and a meta learning training strategy of contrast limitation when in implementation.

(1) Through the proposed meta-learning training framework of word granularity diversity weight and contrast limitation, the invention improves the generalization capability of the time sequence aligned visual translation system on the outside-domain expressive person, and the generalization learning direction of the model in sequence prediction is defined, thereby overcoming the inherent ambiguity among words and adapting to the personalized expression habit difference of the expressive person, and realizing high-efficiency visual language translation.

(2) Aiming at words with inherent ambiguity which are difficult to identify, the invention provides a word granularity weight calculation module, which calculates the word granularity diversity weight to reflect the learning difficulty of the words by using the attention vector interacted by a decoder, and then guides the model to pay attention to the words with greater difficulty through the difficulty coefficient.

(3) In the meta learning training stage of contrast limitation, through using the global contrast loss function calculated by the personalized feature vector of the expressive person on the specific word, the semantic relation among classes is kept, meanwhile, through the local contrast loss function, the independence among the expressive person is improved, the generalization capability of the model on the non-expressive person is enhanced, and therefore, the efficient time sequence alignment visual language translation is realized.

In summary, through the use of the word granularity diversity weight and the two complementary comparison limiting meta learning training strategies, the invention can eliminate the inherent ambiguity of the words while maintaining the inter-class relationship of the words in the translation of the lip language and the finger language, improve the generalization capability of the model, and realize the high-accuracy translation expression on the non-expressive person, thereby realizing the high-efficiency time sequence aligned visual language translation.

Drawings

FIG. 1 is a schematic overall framework of the lip or finger translation method of the present invention, wherein < bos > represents a start symbol;

FIG. 2 is a schematic diagram of a specific flow of the word granularity weight calculation module of the present invention;

FIG. 3 is a schematic diagram of a meta training phase in the meta learning training strategy of the present invention.

Detailed Description

The invention is further illustrated and described below with reference to the drawings and detailed description. For ease of illustration, the encoder and decoder are each simplified to a single layer, with the complete encoder and decoder being stacked from multiple layers as shown in fig. 1.

The invention discloses a visual language translation task method based on contrast learning and word granularity weight, which comprises the following steps:

step 1, extracting lip language or finger language video embedded features, and acquiring natural language text embedded features;

and step 2, dividing the complete lip language or finger language data into a source domain and a target domain, and ensuring that the same expressive person cannot appear in the source domain and the target domain at the same time.

Step 3, coding the lip language or finger language video embedded feature by using a visual coder of a multi-head attention mechanism to obtain coded visual features;

step 4, for the coded visual features, interacting with the natural language text embedded feature vectors through a multi-head attention mechanism, decoding to generate word probability distribution, acquiring cross entropy loss function items based on tasks, and performing preliminary training;

Step 5, calculating the word granularity diversity weight of the words according to decoded attention vectors obtained by calculation of the initially trained visual encoder and the cross-modal decoder, and updating the cross entropy loss function item in the step 4 according to the word granularity diversity weight;

step 6, randomly dividing a source domain into a meta training set and a meta testing set, acquiring a global and local contrast learning loss function item by using the attention vector calculated in the decoder, and calculating the cross entropy loss function item updated in the step 5;

and 7, combining the loss function items calculated in the step 6 to obtain a final complete loss function, and completing the task of translation of the lip language and the finger language through a trained task model.

In one embodiment of the present invention, the above-mentioned steps 1 for obtaining the video embedded feature and the text embedded feature of the natural language may be implemented as follows:

for each given sequence of framesComponent video (lip language or finger language video), s _i Representing the ith frame, T _s Is the number of frames in the frame sequence; extracting video feature matrix->d represents the video feature dimension. In this embodiment, a pretrained self-supervision learning method, such as an AV-HuBERT network, is used to extract a video feature matrix of the lip language video, and a res net50 network is used to extract a video feature matrix of the finger language video.

For each given natural language sentence, its text sequence is noted asl _j Represents the j-th word or letter, T _l Is the number of words, and T _l ≤T _s Acquiring text embedding feature->The frame sequence S of the video is aligned with the natural language text sequence L in a time sequence in a semantic manner, and the visual language translation aims at translating to obtain a natural language sentence L through the given frame sequence S.

In one embodiment of the present invention, the specific method for dividing the complete lip language or finger language data in the step 2 is as follows:

for each expressive, i.e., labial or whisper, the data of all pairs of video and natural language sentences thereof can be considered as a field, expressed asN _k Representing the number of video-sentence pairs for the kth domain. Thus, the complete K domains can be expressed as +.>Dividing the complete lip or finger language data into a source domain and a target domain, wherein the source domain can be expressed as +.>The target domain may be expressed as +.>The source domain and the target domain are strictly divided according to the expressive person and guaranteed to be in the source domain +.>The over-expressed person appearing in (a) will not be in the target domain +.>Is to ensure ∈>With the above-mentioned domain division, all pairs of video and natural language sentences are numbered 1,2, … N _sr +N _tg The source domain is used as a training set and is expressed as T _sr ＝{S _m ,L _m |m∈[1,N _sr ]Target domain as test set, denoted as T _tg ＝{S _m ,L _m |m∈[N _sr +1,N _sr +N _tg ]N, where N _sr And N _tg The data volume in the training set and the test set respectively, S _m ,L _m Representing the mth video-sentence pair.

In one embodiment of the present invention, the encoder is a multi-layer structure, the flow of the visual encoder composed of the multi-head attention mechanism described in the above step 3 is shown in fig. 1, and one layer of the encoder is formed by stacking the self-attention layer, the feedforward neural network, the residual connection and the layer normalization operation. The implementation process is as follows:

3.1 The multi-head attention mechanism is obtained by combining a single-head attention mechanism, and when the single-head attention mechanism is calculated, the single-head attention matrix h is calculated according to an input query matrix, an original key matrix, an original value matrix and a mapping matrix corresponding to the original key matrix _i The method comprises the steps of carrying out a first treatment on the surface of the Through different mapping matrix parameters, h single-head attention matrixes h can be obtained _i Splicing the multi-head attention mechanisms, and calculating the multi-head attention mechanisms with a learnable parameter matrix according to the following formula:

in this embodiment, the video feature matrix F obtained in step 1 is used as a query matrix, an original key matrix, and an original value matrix, and the calculation of the multi-head attention mechanism is expressed as:

MHA(F,F,F)＝Concat(h ₁ ,h ₂ ,…,h _h )W ₁

Wherein MHA (-) represents a multi-head attention mechanism calculation function, concat (-) represents a splicing operation on a single-head attention matrix,representing a matrix of learnable parameters, d representing a dimension of a video feature.

3.2 Constructing a self-attention layer based on a multi-head attention mechanism, expressed as follows:

SA(F)＝MHA(F,F,F)

wherein SA (-) represents the self-attention layer.

3.3 The self-attention layer SA (-) in the step 3.2) and the video feature matrix F obtained in the step 1) are utilized to obtain a coded video feature matrix F' through repeated residual connection, layer standardization operation and a feedforward neural network, and the specific calculation process is as follows:

X＝LN(F+SA(F))

FFN(X)＝W ₃ σ(W ₂ X)

F′＝LN(X+FFN(X))

where LN (·) represents the layer normalization operation, X represents the acquired intra-encoder layer intermediate result, FFN (·) represents the feedforward neural network, σ (·) represents the nonlinear activation function,and->Are all parameter matrices which can be learned, +.>Representing the acquired coded video feature matrix.

In one embodiment of the invention, the decoder is a multi-layer structure, and one layer of the decoder is formed by stacking self-attention layers, mutual-attention layers, feedforward neural networks, residual connection and layer standardization operation. The implementation process is as follows:

4.1 For the text-embedded feature E obtained in step 1, the text-embedded feature before time step t is noted as T _t The number of words decoded before time step t is then updated by the self-attention layer constructed in step 3.4) to the text embedding feature, and the calculation process is as follows:

E′ _t ＝LN(E _t +SA(E _t ))

wherein SA (-) represents the self-attention layer, LN (-) represents the layer normalization operation,representing text embedded features, E' _t Representing text-embedded features updated via the self-attention layer.

4.2 E 'obtained in step 4.1)' _t Interacting with the coded visual features F 'obtained in step 3 through a mutual attention layer, in which E' _t As a query for the multi-head attention mechanism, F' is used as the original key and original value of the multi-head attention mechanism, and the calculation process is as follows:

I _t ＝LN(E′ _t +SA(E′ _t ，F′，F′))

I′ _t ＝LN(I _t +FFN(I _t ))

wherein FFN (·) represents the feedforward neural network,representing intermediate results of the mutual attention layer,representing the decoded output.

4.3 Using the decoded output I' _t The word probability distribution described in step 4 can be calculated by the following formula:

p _t ＝softmax(W _p I′ _t +b _p )

wherein softmax (. Cndot.) represents the activation function, W _p And b _p Respectively a weight matrix and a bias vector, pt represents the calculated probability distribution. The task-based cross entropy loss function term may be calculated from the calculated probability distribution by the following formula:

wherein l _t Word representing at time step t, l _＜t Representing words before time step t, l _t |l _＜t F represents the value given under the condition l _＜t And F to generate l _t The sum is the sum symbol,representing the cross entropy loss function term obtained.

In the primary training process, the training set T obtained by dividing the step 2 is divided _sr The lip language or finger language video and the paired natural language sentences are input into the model, and the same operation is carried out on each pair of data. Specifically, the video frame sequence S and the natural language sentence L are subjected to the step 1 to obtain respective embedded feature matrixes F and E; encoding the video feature F according to the step 3 to obtain an encoded video feature F'; f' and word sequence embedded feature E are interacted, and word distribution probability p is obtained through decoding _t Thereby obtaining the cross entropy loss function item based on the taskThe encoder of step 3 and the decoder of step 4 are trained by minimizing the loss function term using a gradient descent learning method, in preparation for subsequent calculations.

In one embodiment of the present invention, the word granularity diversity weight in the step 5 may be obtained as follows:

5.1 Training set T) _sr The data in (2) are input into the model obtained by preliminary training in the step 4, and then the multi-head attention mechanism MHA (E 'of the last layer in the step 4.2)' _t The calculation result of F ', F') is recorded asWherein the method comprises the steps ofT _t Is the number of words before time step t, +.>Representing the attention vector of the kth expressive person.

According to the directionQuantity u _r The personalized expression effect of the kth expressive person on the word c can be calculated by the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,indicating +.f when the r-th word is c>Is the number of samples with word label c,representing the expression vector of the kth expressive person on word c. According to the expression vectors of different expressives on the same word, calculating and obtaining the variance of the expression vectors to reflect the ambiguity caused by various action habits, wherein the calculation formula is as follows:

where K represents the number of expressives, σ (-) represents a nonlinear activation function, such as a sigmoid function,the diversity weight vector representing word c reflects the difficulty of learning the model on this word. Thus, the complete word granularity diversity weight matrix is derived from the weight concatenation of each word, denoted +.>T _c Representing the length of the vocabulary.

5.2 The word granularity diversity weight obtained in the step 5.1) is acted on the cross entropy loss function term in the step 4.3), the updated cross entropy loss function term is obtained and used for the meta training stage, and as shown in fig. 2, the calculation formula is as follows:

Wherein, represents the bitwise multiplication operation between vectors on word granularity,the cross entropy loss function with the function of word granularity diversity weight can enable the model to pay attention to words which are more difficult to learn through adjustment.

In one embodiment of the present invention, the implementation process of the above step 6 is as follows:

6.1 A complete source domain that will contain K domainsEach round of training is randomly divided into meta training set +.>Sum Meta test set-> N _tr Data amount representing meta training set, +.>N _te Representing the data volume of the meta-test set.

6.2 As shown in fig. 3, during training using the meta-learning strategy, the meta-training phase is performed by minimizing the cross entropy loss function term obtained at 5.2)The gradient descent learning method is adopted to update model parameters, and the calculation process is as follows:

wherein θ represents all trainable parameters in the model, α represents the learning rate of the meta-training phase, θ' represents updated parameters in the model,representing the gradient. Through the meta training phase, the expression of the expressive person on the ambiguous words and the semantic space consistency of the words are maintained.

6.3 During the meta-test phase, the meta-test set is used for testing the meta-test setThe model whose model parameters are updated to θ' is obtained by using the attention vector calculated in the decoder, and the global and local contrast learning penalty function terms described in step 6 are functionally complementary to each other.

The global objective of the contrast-learned penalty function term is to stabilize the inter-word class relationships, so that the learned inter-word relationships are preserved on the non-expressive person. For a particular expressive person k, use is made of the expression vector on the particular word obtained in step 5.1)The probability distribution of the word is calculated, then the global contrast learning loss function term is obtained, and the calculation process is as follows:

wherein softmax (. Cndot.) represents the activation function, T _c Representing the number of words decoded prior to time step t, τ represents the temperature coefficient,representing the probability distribution, N, of video sentence pairs of expressive k over word c _o Is a meta training set->Middle domain D _i Sum Meta test set->Middle domain D _j Constituent (D) _i ，D _j ) Number of pairs>Representing Domain D in the o-th pair of Domain groups _i Probability distribution on word c, +.>Similarly, let go of>A loss function term representing global contrast learning. H (|) represents the relative entropy.

The purpose of the local contrast learning loss function term is to mitigate the impact of ambiguous words on the model without regard to the expressive person so that the model's predictions are not sensitive to the non-expressive person. Source domainThe word granularity attention vector obtained in step 5.1) is +.>Combining two by two to obtain a plurality of sample pairs (x _b ，y _b ) Set a is constructed. Next, the local contrast learned loss function term may be obtained by calculation of the following formula:

wherein ρ (·) represents a distance function, (x) _b ，y _b ) Represents the b-th sample pair, N in set A _b Is the number of pairs of samples, y=1- [ x _l ＝y _l ]When x is expressed as _l ＝y _l When [ x ] _l ＝y _l ]=1, y=0; otherwise [ x ] _l ＝y _l ]＝0，Y＝1；x _l And y _l Is the sample pair x _b And y _b The respective label, ζ, represents a coefficient controlling the amplitude of the distance between two samples,a loss function term representing local contrast learning. In this embodiment, taking computational complexity into account, all samples are stored in an ordered queue at the time of sampling, with two samples dequeued each time as a sample pair, rather than an enumeration method.

In one embodiment of the present invention, the implementation process of the step 7 is as follows:

combining the updated cross entropy loss function term obtained in the step 5 with the global and local contrast learning loss function term obtained in the step 6 to obtain a combined final complete loss function in the step 7, wherein the calculation formula is as follows:

where lambda is a balance between a task-based loss function term and a contrast learning-based loss function term,representing the final complete loss function. By minimizing the complete loss function, the initialization parameter theta of the model is updated by adopting a gradient descent learning method, and the calculation process is as follows:

Where β represents the learning rate. And obtaining a model with better generalization capability through parameter updating.

In the target domainIn the above prediction stage, lip language or finger language video of the non-expressive person is input into the trained model, and the prediction of natural language sentences is completed through the encoder and the decoder, so that the time sequence aligned visual language translation task is completed, and the stage does not need the participation of meta learning.

The above method is applied to the following embodiments to embody the technical effects of the present invention, and specific steps in the embodiments are not described in detail.

The invention performs experiments on two time sequence aligned visual translation tasks, namely a lip language translation data set GRID and a finger language translation Chicago FSwild data set. The GRID dataset contained 33000 video sentence pairs recorded by 33 speakers, and the vocabulary consisted of 51 different words of 6 categories. To verify the effect of the model on the unseen speakers, the speakers (s 1, s2, s20, s 22) are selected as test sets. To verify the robustness of the model, three further similar divisions are given, in each of which no speaker is seen to be (s 3, s4, s23, s 24), (s 5, s6, s25, s 26), (s 7, s8, s27, s 28), respectively. The Chicago FSwild comprises 7304 finger videos of 160 fingers and corresponding natural language texts, all data are divided into three non-overlapping sets of fingers, wherein 5455 data of 87 fingers are used as training sets, 981 data of 37 fingers are used as verification sets, 868 data of 36 fingers are used as test sets, and the vocabulary scale is 31, and the vocabulary scale consists of 26 English letters and 5 special characters.

In order to objectively evaluate the performance of the invention, the invention uses two evaluation indexes of Character Error Rate (CER) and Word Error Rate (WER) to evaluate the effect of the invention in the task of translation of lip language, wherein the error rate can be represented by the formulaAnd calculating and acquiring, wherein S, D and I respectively represent the times of replacement operation, deletion operation and insertion operation when two sequences are aligned, and M is the number of characters or words in the sequences. In the invention, in the task of translating finger language, the letter accuracy evaluation index is used, and the acquisition method is +.>S, D, I and M define the translation task of the same lip language.

The implementation details of the invention on the data set selected by the lip language translation task are as follows: the lip language video is first processed by a Dlip detector to extract a 100 x 60 pixel frame sequence centered on the lips as the input video and the dataset is data enhanced by using a 50% probability horizontal flip. The invention is implemented on the data set selected by the finger language translation task as follows: the finger language video extracts a frame sequence centering on the face through a face detector, and the picture size is adjusted to 112×112 as input data.

The invention compares the task of translation of lip language with the following prior art models:

Compared with a SimulLR model, the method achieves the most advanced level on CER and WER indexes on the conventional division of the GRID data set, and adopts an attention-guided self-adaptive memory module to realize synchronous translation.

This is an earlier model of CTC approach to achieve sentence level prediction, and is the first attempt to divide the GRID dataset into speaker non-overlapping, in contrast to the 2.Lipnet model.

In contrast to the 3.lcanet model, which is an end-to-end model of a cascade of attention and CTC decoders, partly eliminates the influence of condition dependency and accelerates model convergence.

The target self-natural language text can be generated autoregressively by processing video with or without audio simultaneously in comparison with the 4.TM-seq2seq model.

And (3) comparing 5. The AV-HuBERT model, and realizing the task of audio/video lip language identification by using a masked multi-mode clustering method through self-supervision representation learning.

Following the procedure described in the detailed description, the experimental results obtained are shown in table 1, with the models of the present invention being represented as CtoML (base) and CtoML, where CtoML (base) represents a simplified version of the model without using the modules aimed at generalization, i.e. a model trained solely with task-based cross entropy loss functions.

Table 1: the invention aims at the test results of lip translation tasks obtained on four divisions of the GRID data set:

it can be seen from Table 1 that in all four divisions, the translation performance of the model of the invention, ctoML, is significantly better than the current most advanced lip translation model, simulLR, lipNet, TM-seq2seq, LCANet and AV-HuBERT. Wherein, in contrast to AV-HuBERT, the average over four divided evaluation indexes WER exceeds 1.44%, and the average over four divided evaluation indexes CER exceeds 1.36%. By comparison, the word granularity diversity weight provided by the invention can be seen to promote the model to acquire the learning direction on the inherently ambiguous word. Furthermore, the behavior of CtoML (base) demonstrates that the domain generalization capability not considered by the previous model is also improved. The model CtoML of the present invention performs more stably for different divisions.

The invention compares the task of translating the finger language with the following prior art models:

in contrast to the hdc-FSR model, finger translation is achieved by training the hand detector and the attention-based encoder and decoder for the presented chicogofs wild dataset.

Comparing with the IAF-FSR model, the end-to-end translation of the finger language video to the natural language text of the model gradually reduces the finger language action area in the frame sequence by using an iteration-based attention mechanism acting on vision.

In contrast to the 3.fgva model, this is a method that uses fine-grained attention mechanisms, with predictive natural language text serialized with multi-headed attention mechanisms, trained by CTCs and cross entropy loss functions.

And comparing with a 4-TDC-SL model, the model utilizes time sequence variable convolution sequence learning, fuses time and space characteristics and realizes end-to-end finger language translation.

The abbreviations of the names of the above comparative models are named according to their model methods and specific names, because the original methods do not provide abbreviated naming of these models. The experimental results obtained according to the procedure described in the specific embodiment are shown in table 2.

Table 2: the invention aims at the test result of the finger language translation task obtained by the Chicago FSwild data set on the verification set and the test set:

/>

as can be seen from Table 2, on the test set, the performance of the CtoML on the task of translating the finger language is superior to that of the prior model TDC-SL, and the letter accuracy is improved from 50.0% to 54.9%; on the validation set, the alphabetical accuracy of CtoML was improved from 47.0% to 55.7% of FGVA. This benefits from the complementary contrast learning-based limitations of the present invention, which motivates the model to focus on ambiguous words while maintaining a semantic consistency space on non-interviewees.

There is also provided in this embodiment a visual language translation system based on contrast learning and word granularity weights, for implementing the above embodiment. The term "module," "unit," or "units," as used below, may be a combination of software and/or hardware that implements a predetermined function. Although the system described in the following embodiments is preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible.

In this embodiment, the visual language translation system includes:

For the system embodiment, since it basically corresponds to the method embodiment, the relevant parts refer to the description of the method embodiment part, and the specific implementation method of the module is not repeated herein. The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Embodiments of the system of the present invention may be applied to any device having data processing capabilities, such as a computer or the like. The system embodiment may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability.

The foregoing list is only illustrative of specific embodiments of the invention. Obviously, the invention is not limited to the above embodiments, but many variations are possible. All modifications directly derived or suggested to one skilled in the art from the present disclosure should be considered as being within the scope of the present invention.

Claims

1. A visual language translation method based on contrast learning and word granularity weight is characterized by comprising the following steps:

2. The method of claim 1, wherein the lip or finger video of the source domain is time aligned with the corresponding natural language text.

3. The visual language translation method based on contrast learning and word granularity weight according to claim 1, wherein the visual encoder has a multi-layer structure, each layer is formed by stacking self-attention layers, feedforward neural networks, residual connection and layer standardization operations, and is used for encoding lip language or finger language video embedded features to generate an encoded video feature matrix;

The first layer calculation process of the visual encoder is expressed as follows:

X＝LN(F+SA(F))

FFN(X)＝W ₃ σ(W ₂ X)

F′＝LN(X+FFN(X))

where LN (·) represents the layer normalization operation and X represents the acquisitionFFN (·) represents the feedforward neural network, σ (·) represents the nonlinear activation function, W ₂ And W is ₃ Are all learnable parameter matrices, F' represents the acquired coded video feature matrix, F represents the lip or finger video embedded features, and SA (·) represents the self-attention layer.

4. The visual language translation method based on contrast learning and word granularity weight according to claim 1, wherein the cross-modal decoder is of a multi-layer structure, each layer is formed by stacking self-attention layers, mutual-attention layers, feedforward neural networks, residual connection and layer standardization operations, and is used for predicting words of the next time step according to natural language text embedded features and encoded video feature matrixes;

the first layer calculation process of the cross-mode decoder is expressed as follows:

E′ _t ＝LN(E _t +SA(E _t ))

I _t ＝LN(E′ _t +SA(E′ _t ，F′，F′))

I′ _t ＝LN(I _t +FFN(I _t ))

p _t ＝σ(W _p I′ _t +b _p )

wherein E is _t Representing text embedding features prior to time step t, SA (-) represents self-attention layer, LN (-) represents layer normalization operation, E '' _t Representing text-embedded features updated via the self-attention layer, F' representing the acquired coded video feature matrix, I _t Representing intermediate results of the mutual attention layer, FFN (·) represents the feed-forward neural network, I' _t Representing the decoded output, σ (·) represents the nonlinear activation function, W _p And b _p Respectively a weight matrix and a bias vector, p _t Representing the calculated probability distribution.

5. The visual language translation method based on contrast learning and word granularity weights according to claim 4, wherein the calculation formula of the task-based cross entropy loss function term is as follows:

wherein l _t Word representing at time step t, l _＜t Representing words, p, prior to time step t _t (l _t |l _＜t F) is expressed under the given condition l _＜t And F to generate l _t Is a function of the probability of (1),representing task-based cross entropy loss function term, T _l Representing the number of words.

6. The method for visual language translation based on contrast learning and word granularity weights according to claim 4, wherein said step 4) comprises:

4.1 Inputting the embedded features of the source domain data into the visual encoder and the cross-modal decoder obtained by the preliminary training in the step 3), and inputting the mutual attention layer SA (E 'of the last layer of the cross-modal decoder' _t The calculation result of F ', F') is recorded asWherein T is _t Is the number of words before time step t, +.>An attention vector representing the kth expressive person;

7. The method for visual language translation based on contrast learning and word granularity weights according to claim 6, wherein said step 5) comprises:

5.1 Randomly dividing source domain data containing K domains into meta-training sets during each training roundSum Meta test set- > N _tr Representing the data volume of the meta training set, D _i Represents the ith data in the meta-training set, < +.>N _te Representing the data volume of the meta-test set, D _j Represents the j-th data in the meta-test set;

where θ represents all trainable parameters of the visual encoder and the cross-modal decoder, α represents the learning rate of the meta-training phase, θ' represents the updated parameters,representing the gradient;

5.3 In the meta-test stage, updating the data pairs in the meta-test set into a model of theta' through model parameters, and calculating a global and local contrast learning loss function term by using the attention vector calculated in the decoder;

8. The method for visual language translation based on contrast learning and word granularity weights according to claim 7, wherein the calculation of the global contrast learning penalty function term comprises:

wherein, SOfimax (·) represents the activation function, T _c Representing the number of words decoded prior to time step t, τ represents the temperature coefficient,the probability distribution of video sentence pairs over word c representing the expressive k,N _o is a meta training set->Middle domain D _i Sum Meta test set->Middle domain D _j Constituent (D) _i ，D _j ) Number of pairs>Representing Domain D in the o-th pair of Domain groups _i Probability distribution on word c, +.>The loss function term representing global contrast learning, H (|. Cndot.) represents the relative entropy.

9. The method for visual language translation based on contrast learning and word granularity weights according to claim 7, wherein the calculation of the local contrast learning loss function term comprises:

subjecting the whole data of the source domain to the word granularity attention vector obtained in step 4.1)Combining two by two to obtain a plurality of sample pairs (x _b ，y _b ) The calculation formula of the loss function term of the local contrast learning constituting the set A is as follows:

wherein ρ (·) represents a distance function, (x) _b ，y _b ) Represents the b-th sample pair, N in set A _b Is the number of pairs of samples, y=1- [ x _l ＝y _l ]When x is expressed as _l ＝y _l When [ x ] _l ＝y _l ]＝1，Y=0; otherwise [ x ] _l ＝y _l ]＝0，Y＝i；x _l And y _l Is the sample pair x _b And y _b The respective label, ζ, represents a coefficient controlling the amplitude of the distance between two samples,a loss function term representing local contrast learning.

10. A visual language translation system based on contrast learning and word granularity weights, comprising: