CN115937641A

CN115937641A - Method, device and equipment for intermodal joint coding based on Transformer

Info

Publication number: CN115937641A
Application number: CN202211335121.2A
Authority: CN
Inventors: 刘绍辉; 米亚纯; 郭富博; 姜峰
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-04-07

Abstract

The invention discloses a Transformer-based inter-modal joint coding method, a Transformer-based inter-modal joint coding device and a Transformer-based inter-modal joint coding device, relates to the technical field of multi-modal fusion, and solves the technical problem of how to fuse information among modalities to achieve a better emotion classification effect, wherein the method comprises the following steps of: acquiring a video to be analyzed containing multi-modal information; extracting text features, audio features and video picture features of a video to be analyzed; unifying the text feature, the audio feature and the video picture feature into the same dimensionality based on the full connection layer and the LSTM layer; performing multi-modal attention joint coding on the text features, the audio features and the video picture features based on a Transformer model; processing and weighting the characteristic features based on a multi-layer perceptron classification model to obtain a classification result of the video to be analyzed; the method can adopt a Transformer model to simultaneously carry out combined attention coding on different modes, and achieves better classification effect.

Description

Method, device and equipment for intermodal joint coding based on Transformer

Technical Field

The invention relates to the technical field of multi-modal fusion.

Background

Much research is developed around multi-modal analysis of videos, and particularly, with the development of deep learning in recent years, relevant research is greatly advanced on the task of multi-modal analysis of videos. For video, there are usually three modalities, text, audio, video. For the text form, the text form mainly includes text patterns corresponding to videos, subtitles and dialogues carried by video frames, and the like; the audio is mainly auditory information of video, including conversation and background music; video is primarily visual information in video.

The existing multi-modal emotion analysis is mainly based on a deep learning technology and is used for modeling information in different modalities and interactive information among the modalities. Modeling within a modality refers to modeling information within a modality within a particular modality and independently of other modalities. Modeling between modalities refers to modeling information between different modalities, which includes both synchronized information and unsynchronized information modeling. For the task of multimodal analysis of video, a major challenge is to seek feature representation within a modality and feature fusion between different modalities. Referring to fig. 1, the basis of the multi-modal video analysis is to extract visual, auditory and text features by using appropriate feature extraction mechanisms, and to fuse the extracted features of three different modalities for later analysis.

For the multi-modal task of video, an important key point is to explore the information fusion of feature representation among different modes. Some work mainly depends on information fusion between modalities at an earlier stage, namely fusion is carried out on dimensions of characteristics of different modalities, the characteristics of different modalities are simply connected together and used as input of a prediction model to predict emotional attitude, and the prediction models of the part of methods are mostly traditional methods widely applied such as hidden Markov models, support vector machines or conditional random fields and the like; some work carries out information fusion among the modes at a later stage, a model is independently designed and trained for each different mode, the emotional attitude is predicted by means of voting and weighting methods, the methods usually train the models respectively for the different modes, interaction among the different modes is not considered in the model training process, and the final result is predicted by combining the prediction results of the multiple models.

In a multi-modal emotion understanding task based on a neural network, model established for multi-modal information and interaction of the multi-modal information is generally low in interpretability, and most of models based on the neural network are in a black box state for information fusion between modes. Despite the success of many models on some tasks, researchers are still trying to understand models and safely apply models.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a method, a device and equipment for inter-modal joint coding based on a Transformer, an interaction attention mechanism is introduced, and a Transformer model can be adopted for simultaneous joint attention coding of different modes, so that a better classification effect is realized.

A Transformer-based inter-modal joint coding method comprises the following steps:

acquiring a video to be analyzed containing multi-modal information;

extracting text features of the video to be analyzed;

extracting the audio features of the video to be analyzed;

extracting video picture characteristics of the video to be analyzed;

unifying the text feature, the audio feature and the video picture feature into the same dimensionality based on a full connection layer and an LSTM layer;

based on a transform model, performing multi-modal attention joint coding on the text characteristics, the audio characteristics and the video image characteristics to obtain text characterization characteristics, audio characterization characteristics and video image characterization characteristics;

and weighting the text characterization features, the audio characterization features and the video image characterization features based on a multi-layer perceptron classification model to obtain a classification result of the video to be analyzed.

Further, extracting the text features based on a pre-trained Chinese-BERT-wmm model;

the Chinese-BERT-wmm model is loaded through a Hugging face Transformer;

the Chinese-BERT-wmm model adopts a full word mask mechanism in the training process.

Further, extracting the video picture characteristics of the video to be analyzed by adopting an R (2 + 1) D model;

extracting the video picture characteristics of the video to be analyzed, comprising the following steps of:

extracting three-dimensional sequence features in the video to be analyzed;

flattening the three-dimensional sequence features into two-dimensional sequence features;

and performing down-sampling on the two-dimensional sequence features, selecting 1 frame of extraction features every 16 frames, and selecting the network output features of the R (2 + 1) D model in the space-time pooling layer as the extracted video picture features.

Further, in the process of multi-mode attention joint coding, the text mode is used as a main mode, and modulation coding is carried out on an audio mode and a video picture mode;

the transform model comprises three joint coding units, wherein each joint coding unit comprises a Multi-Head attention module, a first residual connection standardization module, a feedforward module Feed-Forward, a second residual connection standardization module, a soft attention module soft-attention and a third residual connection standardization module which are sequentially connected;

based on a transform model, performing multi-modal attention joint coding on the text features, the audio features and the video picture features to obtain text characterization features, audio characterization features and video picture characterization features, wherein the method comprises the following steps:

inputting the text characteristics into a first joint coding unit to obtain text characterization characteristics;

inputting the text features and the audio features into a second combined encoding unit to obtain audio representation features;

and inputting the text characteristics and the video picture characteristics into a third joint coding unit to obtain video picture characterization characteristics.

Further, the soft attention module comprises a plurality of soft attention layers;

the soft attention module operates the input features, including:

performing soft attention operation on the input features in each soft attention layer;

and superposing the results obtained by the operation of each soft attention layer to obtain the output of the soft attention module, wherein the output is expressed by the following formula:

S _M ＝stack(m ₁ ，...mG ^m )；

wherein S is _M For the output of the soft attention module, stack represents the superposition calculation, m ₁ Representing one-dimensional vectors, mG, resulting from soft attention operations ^m Indicating that the vector resulting from the soft attention operation is weighted.

Further, based on a multi-layer perceptron classification model, weighting the text characterization feature, the audio characterization feature and the video image characterization feature to obtain a classification result of the video to be analyzed, including:

respectively inputting the text characterization feature, the audio characterization feature and the video picture characterization feature into a first full-link layer, a RELU activation function layer and a second full-link layer, and introducing a random inactivation Dropout mechanism to obtain an intermediate text characterization feature, an intermediate audio characterization feature and an intermediate video picture characterization feature;

based on a soft-attention mechanism, respectively calculating soft attention vectors according to the intermediate text characterization features, the intermediate audio characterization features and the intermediate video picture characterization features;

respectively weighting the text characterization feature, the audio characterization feature and the video picture characterization feature based on the soft attention vector to obtain a final text characterization feature, a final audio characterization feature and a final video picture characterization feature;

and performing layer normalization weighting fusion on the final text characterization features, the final audio characterization features and the final video picture characterization features based on a fusion weight matrix obtained in advance, and inputting the layer normalization weighting fusion into a third full-connection layer to obtain a classification result of the video to be analyzed.

Further, based on a soft-attention mechanism, soft attention vectors are respectively calculated according to the intermediate text characterization features, the intermediate audio characterization features and the intermediate video picture characterization features, and the text characterization features, the audio characterization features and the video picture characterization features are weighted based on the soft attention vectors and are expressed by the following formula:

wherein L degree represents the final text characteristic feature obtained by weighting, A degree represents the final audio characteristic feature obtained by weighting, V degree represents the final video picture characteristic feature obtained by weighting, and alpha ^L Soft attention vector, α, representing a characteristic of a text representation ^A Soft attention vector, alpha, representing a characteristic of an audio feature ^V Soft attention vector representing the characteristic features of the video picture, softmax representing a normalized exponential function,

represents a mid-text characterization feature, and->

Representing an intermediate audio characterizing feature that is characteristic of,

represents a characteristic feature of the intermediate video frame, and/or>

Represents a feature of the ith sentence in the text, based on the value of the word or phrase>

Represents a characteristic of the ith audio segment, is greater than>

Features representing the ith video frame, l _i Represents the ith text weighted hyper-parameter, a _i Representing the ith audio weighted hyperparameter, v _i Representing the ith video picture weighting hyperparameter.

Further, based on a fusion weight matrix obtained in advance, layer normalization weighting fusion is carried out on the final text characterization feature, the final audio characterization feature and the final video picture characterization feature, and the layer normalization weighting fusion is expressed by the following formula:

predictedy＝LayerNorm(W _L L ^o +W _A A ^o +W _V V ^o )；

wherein predicted _y LayerNorm indicates the layer normalization operation, W, for the predicted classification result _L 、W _A 、W _V Representing the weight corresponding to the text mode, the audio mode and the video image mode to meet the requirements

Wherein d is _z Is a hyper-parameter, which represents the vector dimension after weighted fusion. />

A transform-based inter-modal joint coding apparatus, comprising:

the video acquisition module is used for acquiring a video to be analyzed containing multi-mode information;

the text feature extraction module is used for extracting the text features of the video to be analyzed;

the audio characteristic extraction module is used for extracting the audio characteristics of the video to be analyzed;

the visual characteristic extraction module is used for extracting video image characteristics of the video to be analyzed;

the dimension unifying module is used for unifying the text feature, the audio feature and the video picture feature into the same dimension based on the full connection layer and the LSTM layer;

the characterization feature calculation module is used for performing multi-mode attention joint coding on the text features, the audio features and the video picture features based on a Transformer model to obtain text characterization features, audio characterization features and video picture characterization features;

and the classification result calculation module is used for weighting the text characterization features, the audio characterization features and the video image characterization features based on the multilayer perceptron classification model to obtain the classification result of the video to be analyzed.

An electronic device comprises a processor and a storage device, wherein a plurality of instructions are stored in the storage device, and the processor is used for reading the plurality of instructions in the storage device and executing the method.

The inter-modality joint coding method, device and equipment based on the Transformer at least have the following beneficial effects:

(1) The method comprises the steps that a neural network model is built based on a Transformer, joint coding is carried out on characteristics among various modes by adopting an interactive attention mechanism, information fusion among the modes is carried out through the attention mechanism, the neural network can obtain relevant information related to emotional attitude to be analyzed among the modes through the attention mechanism, an interactive attention module is introduced, the Transformer model can be adopted for carrying out joint attention coding on different modes at the same time, different mode characteristics do not need to be sequentially input into the Transformer model, attention information in the characteristics can be mined by utilizing strong characterization capability of the Transformer, and meanwhile, the correlation among specific modes is modeled, so that a better classification effect is achieved;

(2) When the classification model carries out weighted fusion on the characterization features containing the emotional tendency, a random inactivation Dropout mechanism is introduced, so that overfitting in the neural network training process can be prevented; a soft-attention mechanism is introduced, so that key information can be extracted better, and a better classification effect is realized.

Drawings

FIG. 1 is a flowchart illustrating an embodiment of a transform-based inter-modal joint coding method according to the present invention;

FIG. 2 is a schematic structural diagram of an embodiment of a text feature extraction model in the method provided by the present invention;

FIG. 3 is a schematic diagram illustrating the structural comparison between the C3D model and the R (2 + 1) D model provided by the present invention;

FIG. 4 is a schematic structural diagram of an embodiment of the R (2 + 1) D model provided by the present invention;

FIG. 5 is a schematic structural diagram of an embodiment of a three-mode joint coding Transformer model provided in the present invention;

fig. 6 is a schematic structural diagram of an embodiment of a two-mode jointly encoded transform model according to the present invention.

Detailed Description

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Referring to fig. 1, in some embodiments, there is provided a transform-based inter-modality joint coding method, including:

s1, acquiring a video to be analyzed containing multi-modal information;

s2, extracting text features of the video to be analyzed;

s3, extracting audio features of the video to be analyzed;

s4, extracting video image characteristics of the video to be analyzed;

s5, unifying the text feature, the audio feature and the video picture feature into the same dimensionality based on the full connection layer and the LSTM layer;

s6, performing multi-mode attention joint coding on the text characteristics, the audio characteristics and the video picture characteristics based on a transform model to obtain text characterization characteristics, audio characterization characteristics and video picture characterization characteristics;

and S7, weighting the text characterization features, the audio characterization features and the video image characterization features based on the multilayer perceptron classification model to obtain a classification result of the video to be analyzed.

In some embodiments, in step S2, the text features are extracted based on a pre-trained Chinese-BERT-wmm model, and the Chinese-BERT-wmm model adopts a full-word Mask mechanism (full-word Mask mechanism) in the training process. The full word mask mechanism means that if the mask mechanism is applied to the part of a sub-word in the training sample, other sub-words belonging to the word segmentation are also subjected to mask operation.

Referring to fig. 2, the BERT model is a current popular pre-training language model based on a Transformer, and the BERT constructs a bidirectional language model through an encoder structure in the Transformer model. In FIG. 2, tokN represents the Nth word in the sentence, class Label represents the text feature classification label, E _N Denotes the nth word embedding, T _N A feature vector representing the nth word output after passing through the model. The BERT model has the advantage that one context information can be learned directly. The Chinese-BERT-wmm model is a special BERT pre-training model for Chinese, a Mask mechanism is applied to an original BERT model in a training stage, and each subword in a training sample is subjected to masking operation randomly. And a full word mask mechanism is adopted in the BERT-wmm model, that is, if a part of a sub-word is applied with the mask mechanism, other sub-words belonging to the word segmentation are also maskedCode operation, i.e. full word Mask mechanism. Chinese-BERT-wmm is a powerful language model completed in the pre-training of large-scale corpus data set of Chinese, and can accurately extract semantic information in text.

As a better implementation mode, a bagging face transform is adopted to load a chip-BERT-wmm pre-training model. The Huggingface transform provides an API (application programming interface) which is convenient for fast reading downloading and using, and a pre-trained model can be conveniently used for feature processing work of a text, so that the working efficiency is improved.

In step S3, the audio features are extracted based on the library librosa of the audio signal processing library by using mel-frequency spectrum features. The audio information in the video contains a large amount of human speech and music with many non-verbal expressions (laughter, sigh, etc.), which together with information with linguistic significance has a tremendous effect on the analysis of the video. In the field of speech processing, features such as F0, MFCCs, etc. are widely used, but higher-level features tend to ignore large amounts of information. To overcome this problem, the present embodiment uses low-level auditory features as the subsequent multi-modal analysis audio feature input, selecting Mel-spectra (Mel-spectra) features for extraction. The mel-frequency spectrum characteristic is suitable for the instant voice recognition task and can obtain better effect, therefore, the mel-frequency spectrum can be used as better auditory information to be represented and applied to the embodiment. As a better implementation mode, a down-sampling operation is also adopted in the extraction process to reduce the feature dimension on the time sequence, thereby reducing the complexity, quickening the inference process and improving the feature extraction efficiency of the model.

Referring to FIG. 3, a structural comparison of the C3D model (VGG model version 3D) and the R (2 + 1) D model (a variant model of ResNet version 3D) is provided. The embodiment adopts an R (2 + 1) D network, namely a 3D convolution model, as a network model for video picture feature extraction. The three-dimensional convolution kernel naturally has the defects of high calculation cost and easiness in overfitting compared with the two-dimensional convolution kernel, the R (2 + 1) D network carries out three-dimensional convolution in the original C3D network step by step in order to reduce complexity, the three-dimensional convolution operation is decomposed into one-time convolution in a two-dimensional space and one-dimensional convolution in a time sequence, three-dimensional time sequence information and space information of video frame content are separated, optimization is easier, and a loss function is smaller.

Referring to FIG. 4, a model structure of R (2 + 1) D is provided, and this embodiment employs a pre-trained R (2 + 1) D-152 network. Wherein clip represents a video clip, (2 + 1) D conv represents a combination of a 2D spatial convolution and a 1D temporal convolution, space-time pool represents a temporal pooling layer, fc represents a fully-connected layer, the model is trained on a Sports-1M data set, the Sports-1M data set comprises one million videos collected from a YouTube website, and 487 motion-related videos are counted, and 1000-3000 videos exist under each video category. In the pre-training process, the model inputs 32 RGB video frames at a time, the size of the video frames is scaled to 128 × 171, and then one 112 × 112 video frame is randomly cropped in the video frames. The network output characteristics of the time-space pooling layer are selected as the extracted characteristics, and the output of the full-connection layer is not selected, so that only a convolution layer is arranged in the model, the long size of network input can be unlimited, and the model has more flexibility.

In step S4, the video picture characteristics of the video to be analyzed are extracted by adopting the R (2 + 1) D model;

extracting the video picture characteristics of the video to be analyzed, comprising the following steps:

s41, extracting three-dimensional sequence features in the video to be analyzed;

s42, flattening the three-dimensional sequence features into two-dimensional sequence features;

s43, scaling the size and randomly cutting the two-dimensional sequence features;

s44, down-sampling the two-dimensional sequence features, selecting 1 frame for extracting features from every 16 frames, and selecting the network output features of the space-time pooling layer as the extracted video picture features. Therefore, the complexity of the video characteristics obtained by sampling can be reduced, and the speed of processing the video by the model is increased under the condition that the performance of the model is not reduced.

And intercepting the first 8 video frames with the extracted features as the input of the video features to ensure that the features in the video are kept aligned.

In step S5, before inputting the extracted text features, audio features, and video features into the transform model, the features of different modalities are respectively input into a full link layer and an LSTM layer to ensure that each modality has the same vector dimension, thereby implementing multi-modality attention joint coding in the subsequent steps.

In the method provided by this embodiment, the video to be analyzed includes three modalities, namely a text modality, an audio modality, and a video image modality.

Step S6, in the process of multi-mode attention joint coding, the text mode is used as a main mode, and modulation coding is carried out on an audio mode and a video picture mode;

referring to fig. 5, the Transformer model includes three joint coding units, each of which includes a Multi-Head attention module Multi-Head, a first residual error connection normalization module Add & Norm, a Feed-Forward module Feed-Forward, a second residual error connection normalization module Add & Norm, a soft attention module soft-attention, and a third residual error connection normalization module Add & Norm connected in sequence;

s61, inputting the text characteristics into a first joint coding unit to obtain text characterization characteristics;

s62, inputting the text features and the audio features into a second combined coding unit to obtain audio representation features;

and S63, inputting the text features and the video picture features into a third combined coding unit to obtain video picture representation features.

The transform model provided in this embodiment includes an interactive attention module composed of a Multi-Head attention module Multi-Head, a first residual connection normalization module Add & Norm, a Feed-Forward module Feed-Forward, and a second residual connection normalization module Add & Norm, which are connected in sequence, and introduces a soft attention module, so as to map features output by the interactive attention module to a new characterization space.

Referring to fig. 6, the working principle of multi-modal attention joint coding is described by taking attention joint coding of two modalities as an example. First, assume that there are two modes X and Y, where the X mode is a main mode for modulation-coding information in the Y mode. The attention joint coding process introduces the idea of a Guide-attention unit in an interactive attention module, and replaces a K matrix and a V matrix of a multi-head attention mechanism in a Transformer model from Y to X. After substitution, similar to in the Transformer, QK ^T The calculation of (2) results in an attention matrix, which can be understood as a similarity matrix of row vectors in the features in the X-modality and the Y-modality, mining the feature correlation between the X-modality and the Y-modality. After this similarity matrix is obtained, the similarity matrix is dot multiplied by X. As with the Transformer model, residual ligation and LayerNorm model were introduced. The whole calculation process can be expressed as

f＝LayerNorm(y+MA(y，x，x))；

Where LayerNorm denotes layer normalization, y denotes input characteristics, and MA (y, x, x) denotes multi-head attention.

The LayerNorm operation is used to make the change of the loss function in the training process more stable, thereby realizing better joint coding effect.

In order to ensure that the process of matrix dot multiplication and residual concatenation can be performed smoothly, the vector dimensions of X and Y should be the same, and are expressed by the following formula:

wherein R is ^N×K A linear space containing all matrices is represented, where the matrices are N rows by K columns.

In some embodiments, the soft attention module in the transform model comprises a plurality of soft attention layers;

the soft attention module operates the input features, including:

and superposing the results obtained by the operation of each soft attention layer to obtain the output of the soft attention module, wherein the output is represented by the following formula:

S _M ＝stack(m ₁ ，...mG ^m )；

In some embodiments, the soft attention layer performs a soft attention operation on the features of the input, which is expressed by the following formula:

/>

wherein, W _m Is a transformation matrix with the shape of 2k multiplied by k, represents the network weight of a fully-connected layer in the neural network, is a shared parameter of all soft attention layers, softmax is a normalized exponential function, soft-attention is a soft attention operation function carried out by each soft attention layer,

is the ith 1 × 2k weight vector, m _i Is the output of the ith soft attention layer, and M is the inputOf the feature matrix, M _j Is the feature matrix of the jth input.

In step S7, based on the multi-layer perceptron classification model, weighting the text characterization feature, the audio characterization feature, and the video image characterization feature to obtain a classification result of the video to be analyzed, including:

s71, inputting the text characterization feature, the audio characterization feature and the video picture characterization feature to a first full-link layer, a RELU activation function layer and a second full-link layer respectively, and introducing a random inactivation Dropout mechanism to obtain an intermediate text characterization feature, an intermediate audio characterization feature and an intermediate video picture characterization feature;

s72, respectively calculating soft attention vectors according to the intermediate text characterization features, the intermediate audio characterization features and the intermediate video picture characterization features based on a soft-attention mechanism;

s73, respectively weighting the text characterization feature, the audio characterization feature and the video picture characterization feature based on the soft attention vector to obtain a final text characterization feature, a final audio characterization feature and a final video picture characterization feature;

and S74, based on a fusion weight matrix obtained in advance, performing layer normalization weighting fusion on the final text characterization feature, the final audio characterization feature and the final video picture characterization feature, and inputting the layer normalization weighting fusion into a third full-connection layer to obtain a classification result of the video to be analyzed.

After multi-mode joint coding, modal characteristics represented by the network module are output, wherein the modal characteristics comprise a text characteristic, an audio characteristic and a video image characteristic, and the characteristic dimensions are kept the same and are represented by the following formula:

wherein,

representing text characterizing features, audio characterizing features and video picture characterizing features,/ _m Representing the mth text feature, a _m Representing the mth audio feature, v _m Representing the m-th video picture characteristic, R ^N×K And (3) a linear space formed by all the N rows and K columns of the matrix.

The representation of the multi-mode joint coding, the text representation feature, the audio representation feature and the video representation feature already contain a large amount of representation with emotional attitude tendency and attention information corresponding to the representation. Dimension reduction is required on these characterizations to output the final prediction result of the neural network. Specifically, in step S71, a random inactivation Dropout mechanism is introduced, which can prevent overfitting during the neural network training process.

In steps S72 and S73, based on a soft-attention mechanism, soft attention vectors are respectively calculated according to the intermediate text characterizing feature, the intermediate audio characterizing feature, and the intermediate video picture characterizing feature, and the text characterizing feature, the audio characterizing feature, and the video picture characterizing feature are weighted based on the soft attention vectors, and expressed by the following formulas:

/>

wherein, L degree represents the final text characteristic feature obtained by weighting, A degree represents the final audio characteristic feature obtained by weighting, V degree represents the final video picture characteristic feature obtained by weighting, and alpha ^L Soft attention vector, alpha, representing a characteristic of a text representation ^A Soft attention vector, alpha, representing a characteristic of an audio feature ^V Soft attention vector representing the characteristic features of the video picture, softmax representing a normalized exponential function,

represents an intermediate text characterization feature, and->

Representing the characteristic features of the intermediate audio,

representing an intermediate video picture characterizing feature, based on a reference picture, and a reference picture, based on a reference picture, and a reference picture>

Characteristic representing the ith audio segment>

Features representing the ith video frame, l _i Representing the ith text weighted hyper-parameter, a _i Represents the ithAudio weighted hyperparameter, v _i Representing the ith video picture weighting hyperparameter.

When the classification model carries out weighting fusion on the characterization features containing the emotional tendency, a soft-attention mechanism is introduced, so that key information can be better extracted.

In step S74, based on a fusion weight matrix obtained in advance, performing layer normalization weighting fusion on the final text characterization feature, the final audio characterization feature, and the final video picture characterization feature, and expressed by the following formula:

predicted _y ＝LayerNorm(W _L L ^o +W _A A ^o +W _V V ^o )；

wherein predicted _y For the predicted classification result, layerNorm indicates the layer normalization operation, W _L 、W _A 、W _V Weights representing the correspondence of text mode, audio mode and video picture mode are obtained by network learning and satisfy

Wherein d is _z Is a hyper-parameter, which represents the vector dimension after weighted fusion.

As a preferred implementation, the multi-layer perceptron model adopts a binary cross entropy loss function.

As a preferred embodiment, the obtained result is input to the third fully-connected layer, and the dimension of the vector is 1 dimension in order to map the fused features into a final output prediction vector.

In a specific application scenario, a CPU adopts 6-Core Intel (R) Core (TM) i5-9600K CPU @3.70GHZ, a memory adopts 169B 2400MHz DDR4, a display card adopts Nvidia GTX 1080Ti, an operating system adopts Ubuntu 20.04.1LTS, codes are mainly realized based on Python, the version is 3.6.13, part of a neural network is mainly realized by depending on Pyorch 1.2.0, and a third party library required by Anaconda management is used.

In the aspect of feature extraction, the three-dimensional sequence features extracted from the video are flattened to two dimensions, down sampling is adopted, and 1 frame is selected for extracting features every 16 video frames. And in order to ensure that the features in the video are kept aligned, the front 8 video frames with extracted features are intercepted as the input of the video features. The text feature input adopts Hugging Face to load a pretrained Chinese-BERT-wmm, and in the text feature extraction, embedded vectors representing the beginning and the end (< SOS > and < EOS >) are respectively added at the beginning and the end of the text.

Similar to video features, audio features also employ downsampling processes. And the third-party library pickle based on Python saves the extracted features of different modes as a pkl file, so that the calculation of the neural network does not have a mode feature extraction part, only the forward and backward propagation of the network needs to be concerned, and the high efficiency of the neural network training is ensured. In a network implementation, the number of multi-head attentions is set to 4, the hidden layer size of a transform is 1024, and there are 6 transform modules. The input size of the fully connected layer is 1024. The total parameter amount is about 140M.

In the training stage of the model, an Adam optimizer is adopted to optimize the network, and the Adam optimizer is a popular optimizer in deep learning at present and has the advantages of simplicity, high efficiency, suitability for large-scale data and the like. The initial learning rate of the neural network is set to 1e-3 and if the accuracy on the validation set does not rise, the learning rate will continue to decline by a factor of 0.2. Dropout randomly inactivated node retention probability set to 0.1. Dropout enables regularization of the model, reducing overfitting in deep learning. Subject to the video memory limit of the video card, the batch _ size is set to 16.

In some embodiments, there is provided a Transformer-based inter-modality joint encoding apparatus, including:

and the classification result calculation module is used for weighting the text characterization feature, the audio characterization feature and the video picture characterization feature based on the multilayer perceptron classification model to obtain the classification result of the video to be analyzed.

In some embodiments, an electronic device is provided, which includes a processor and a storage device, the storage device having a plurality of instructions stored therein, and the processor is configured to read the plurality of instructions from the storage device and execute the method.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A transform-based inter-modal joint coding method, comprising:

acquiring a video to be analyzed containing multi-modal information;

extracting text features of the video to be analyzed;

extracting the audio features of the video to be analyzed;

extracting video picture characteristics of the video to be analyzed;

unifying the text feature, the audio feature and the video picture feature into the same dimensionality based on the full connection layer and the LSTM layer;

2. The method of claim 1, wherein the text features are extracted based on a pre-trained Chinese-BERT-wmm model;

the Chinese-BERT-wmm model is loaded through a Hugging face Transformer;

3. The method as claimed in claim 1, wherein the R (2 + 1) D model is adopted to extract the video picture features of the video to be analyzed;

extracting three-dimensional sequence features in the video to be analyzed;

and performing down-sampling on the two-dimensional sequence features, selecting 1 frame for extracting features every 16 frames, and selecting the network output features of the R (2 + 1) D model in the space-time pooling layer as the extracted video picture features.

4. The method according to claim 1, wherein in the multi-modal joint attention coding process, the text mode is used as a main mode, and an audio mode and a video picture mode are modulation-coded;

the Transformer model comprises three joint coding units, wherein each joint coding unit comprises a Multi-Head attention module, a first residual connection standardization module, a feedforward module Feed-Forward, a second residual connection standardization module, a soft attention module soft-attention and a third residual connection standardization module which are sequentially connected;

and inputting the text features and the video picture features into a third joint coding unit to obtain video picture representation features.

5. The method of claim 4, wherein the soft attention module comprises a plurality of soft attention layers;

the soft attention module operates the input features, including:

S _M ＝stack(m ₁ ，...mG ^m )；

6. The method according to claim 1 or 5, wherein weighting the text characterization feature, the audio characterization feature and the video image characterization feature based on a multi-layer perceptron classification model to obtain a classification result of the video to be analyzed comprises:

inputting the text characterization feature, the audio characterization feature and the video picture characterization feature into a first full-link layer, a RELU activation function layer and a second full-link layer respectively, and introducing a random deactivation Drapout mechanism to obtain an intermediate text characterization feature, an intermediate audio characterization feature and an intermediate video picture characterization feature;

respectively weighting the text characterization features, the audio characterization features and the video picture characterization features based on the soft attention vectors to obtain final text characterization features, final audio characterization features and final video picture characterization features;

and performing layer normalization weighting fusion on the final text characterization feature, the final audio characterization feature and the final video picture characterization feature based on a fusion weight matrix obtained in advance, and inputting the layer normalization weighting fusion into a third full-connection layer to obtain a classification result of the video to be analyzed.

7. The method according to claim 6, wherein based on soft-attention mechanism, soft attention vectors are respectively calculated according to the intermediate text characterization feature, the intermediate audio characterization feature and the intermediate video picture characterization feature, and the text characterization feature, the audio characterization feature and the video picture characterization feature are weighted based on the soft attention vectors, and are expressed by the following formulas:

wherein L is ^o Representing the final text characterizing feature, A, obtained by weighting ^o Representing the final audio characterizing feature, V, obtained by weighting ^o Representing the weighted final video picture characterizing feature, α ^L Soft attention vector, alpha, representing a characteristic of a text representation ^A Soft attention vector, alpha, representing a characteristic of an audio feature ^V Soft attention vector representing the characteristic features of the video picture, softmax representing a normalized exponential function,

represents an intermediate text characterization feature, and->

Represents an intermediate audio characterizing feature, and->

Represents a characteristic feature of the intermediate video frame, and/or>

Presentation textCharacteristic of the ith sentence in the text, based on the comparison of the characteristic value of the ith sentence in the text and the value of the word or phrase>

Represents a characteristic of the ith audio segment, is greater than>

Features representing the ith video frame, l _i Representing the ith text weighted hyper-parameter, a _i Representing the ith audio weighted hyperparameter, v _i Representing the ith video picture weighting hyperparameter.

8. The method according to claim 6, wherein the final text characterization feature, the final audio characterization feature and the final video picture characterization feature are subjected to layer normalization weighting fusion based on a fusion weight matrix obtained in advance, and are expressed by the following formula:

predicted _y ＝LayerNorm(W _L L ^o +W _A A ^o +W _v V ^o )；

9. An inter-modal joint encoding apparatus based on a transform, comprising:

the text feature extraction module is used for extracting text features of the video to be analyzed;

10. An electronic device comprising a processor and a memory means, wherein a plurality of instructions are stored in the memory means, and wherein the processor is configured to read the plurality of instructions from the memory means and to perform the method according to any one of claims 1 to 8.