CN115937641A - Method, device and equipment for intermodal joint coding based on Transformer - Google Patents

Method, device and equipment for intermodal joint coding based on Transformer Download PDF

Info

Publication number
CN115937641A
CN115937641A CN202211335121.2A CN202211335121A CN115937641A CN 115937641 A CN115937641 A CN 115937641A CN 202211335121 A CN202211335121 A CN 202211335121A CN 115937641 A CN115937641 A CN 115937641A
Authority
CN
China
Prior art keywords
features
characterization
text
feature
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211335121.2A
Other languages
Chinese (zh)
Inventor
刘绍辉
米亚纯
郭富博
姜峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202211335121.2A priority Critical patent/CN115937641A/en
Publication of CN115937641A publication Critical patent/CN115937641A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a Transformer-based inter-modal joint coding method, a Transformer-based inter-modal joint coding device and a Transformer-based inter-modal joint coding device, relates to the technical field of multi-modal fusion, and solves the technical problem of how to fuse information among modalities to achieve a better emotion classification effect, wherein the method comprises the following steps of: acquiring a video to be analyzed containing multi-modal information; extracting text features, audio features and video picture features of a video to be analyzed; unifying the text feature, the audio feature and the video picture feature into the same dimensionality based on the full connection layer and the LSTM layer; performing multi-modal attention joint coding on the text features, the audio features and the video picture features based on a Transformer model; processing and weighting the characteristic features based on a multi-layer perceptron classification model to obtain a classification result of the video to be analyzed; the method can adopt a Transformer model to simultaneously carry out combined attention coding on different modes, and achieves better classification effect.

Description

Method, device and equipment for intermodal joint coding based on Transformer
Technical Field
The invention relates to the technical field of multi-modal fusion.
Background
Much research is developed around multi-modal analysis of videos, and particularly, with the development of deep learning in recent years, relevant research is greatly advanced on the task of multi-modal analysis of videos. For video, there are usually three modalities, text, audio, video. For the text form, the text form mainly includes text patterns corresponding to videos, subtitles and dialogues carried by video frames, and the like; the audio is mainly auditory information of video, including conversation and background music; video is primarily visual information in video.
The existing multi-modal emotion analysis is mainly based on a deep learning technology and is used for modeling information in different modalities and interactive information among the modalities. Modeling within a modality refers to modeling information within a modality within a particular modality and independently of other modalities. Modeling between modalities refers to modeling information between different modalities, which includes both synchronized information and unsynchronized information modeling. For the task of multimodal analysis of video, a major challenge is to seek feature representation within a modality and feature fusion between different modalities. Referring to fig. 1, the basis of the multi-modal video analysis is to extract visual, auditory and text features by using appropriate feature extraction mechanisms, and to fuse the extracted features of three different modalities for later analysis.
For the multi-modal task of video, an important key point is to explore the information fusion of feature representation among different modes. Some work mainly depends on information fusion between modalities at an earlier stage, namely fusion is carried out on dimensions of characteristics of different modalities, the characteristics of different modalities are simply connected together and used as input of a prediction model to predict emotional attitude, and the prediction models of the part of methods are mostly traditional methods widely applied such as hidden Markov models, support vector machines or conditional random fields and the like; some work carries out information fusion among the modes at a later stage, a model is independently designed and trained for each different mode, the emotional attitude is predicted by means of voting and weighting methods, the methods usually train the models respectively for the different modes, interaction among the different modes is not considered in the model training process, and the final result is predicted by combining the prediction results of the multiple models.
In a multi-modal emotion understanding task based on a neural network, model established for multi-modal information and interaction of the multi-modal information is generally low in interpretability, and most of models based on the neural network are in a black box state for information fusion between modes. Despite the success of many models on some tasks, researchers are still trying to understand models and safely apply models.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a method, a device and equipment for inter-modal joint coding based on a Transformer, an interaction attention mechanism is introduced, and a Transformer model can be adopted for simultaneous joint attention coding of different modes, so that a better classification effect is realized.
A Transformer-based inter-modal joint coding method comprises the following steps:
acquiring a video to be analyzed containing multi-modal information;
extracting text features of the video to be analyzed;
extracting the audio features of the video to be analyzed;
extracting video picture characteristics of the video to be analyzed;
unifying the text feature, the audio feature and the video picture feature into the same dimensionality based on a full connection layer and an LSTM layer;
based on a transform model, performing multi-modal attention joint coding on the text characteristics, the audio characteristics and the video image characteristics to obtain text characterization characteristics, audio characterization characteristics and video image characterization characteristics;
and weighting the text characterization features, the audio characterization features and the video image characterization features based on a multi-layer perceptron classification model to obtain a classification result of the video to be analyzed.
Further, extracting the text features based on a pre-trained Chinese-BERT-wmm model;
the Chinese-BERT-wmm model is loaded through a Hugging face Transformer;
the Chinese-BERT-wmm model adopts a full word mask mechanism in the training process.
Further, extracting the video picture characteristics of the video to be analyzed by adopting an R (2 + 1) D model;
extracting the video picture characteristics of the video to be analyzed, comprising the following steps of:
extracting three-dimensional sequence features in the video to be analyzed;
flattening the three-dimensional sequence features into two-dimensional sequence features;
and performing down-sampling on the two-dimensional sequence features, selecting 1 frame of extraction features every 16 frames, and selecting the network output features of the R (2 + 1) D model in the space-time pooling layer as the extracted video picture features.
Further, in the process of multi-mode attention joint coding, the text mode is used as a main mode, and modulation coding is carried out on an audio mode and a video picture mode;
the transform model comprises three joint coding units, wherein each joint coding unit comprises a Multi-Head attention module, a first residual connection standardization module, a feedforward module Feed-Forward, a second residual connection standardization module, a soft attention module soft-attention and a third residual connection standardization module which are sequentially connected;
based on a transform model, performing multi-modal attention joint coding on the text features, the audio features and the video picture features to obtain text characterization features, audio characterization features and video picture characterization features, wherein the method comprises the following steps:
inputting the text characteristics into a first joint coding unit to obtain text characterization characteristics;
inputting the text features and the audio features into a second combined encoding unit to obtain audio representation features;
and inputting the text characteristics and the video picture characteristics into a third joint coding unit to obtain video picture characterization characteristics.
Further, the soft attention module comprises a plurality of soft attention layers;
the soft attention module operates the input features, including:
performing soft attention operation on the input features in each soft attention layer;
and superposing the results obtained by the operation of each soft attention layer to obtain the output of the soft attention module, wherein the output is expressed by the following formula:
S M =stack(m 1 ,...mG m );
wherein S is M For the output of the soft attention module, stack represents the superposition calculation, m 1 Representing one-dimensional vectors, mG, resulting from soft attention operations m Indicating that the vector resulting from the soft attention operation is weighted.
Further, based on a multi-layer perceptron classification model, weighting the text characterization feature, the audio characterization feature and the video image characterization feature to obtain a classification result of the video to be analyzed, including:
respectively inputting the text characterization feature, the audio characterization feature and the video picture characterization feature into a first full-link layer, a RELU activation function layer and a second full-link layer, and introducing a random inactivation Dropout mechanism to obtain an intermediate text characterization feature, an intermediate audio characterization feature and an intermediate video picture characterization feature;
based on a soft-attention mechanism, respectively calculating soft attention vectors according to the intermediate text characterization features, the intermediate audio characterization features and the intermediate video picture characterization features;
respectively weighting the text characterization feature, the audio characterization feature and the video picture characterization feature based on the soft attention vector to obtain a final text characterization feature, a final audio characterization feature and a final video picture characterization feature;
and performing layer normalization weighting fusion on the final text characterization features, the final audio characterization features and the final video picture characterization features based on a fusion weight matrix obtained in advance, and inputting the layer normalization weighting fusion into a third full-connection layer to obtain a classification result of the video to be analyzed.
Further, based on a soft-attention mechanism, soft attention vectors are respectively calculated according to the intermediate text characterization features, the intermediate audio characterization features and the intermediate video picture characterization features, and the text characterization features, the audio characterization features and the video picture characterization features are weighted based on the soft attention vectors and are expressed by the following formula:
Figure BDA0003914424510000031
Figure BDA0003914424510000032
Figure BDA0003914424510000041
Figure BDA0003914424510000042
Figure BDA0003914424510000043
Figure BDA0003914424510000044
wherein L degree represents the final text characteristic feature obtained by weighting, A degree represents the final audio characteristic feature obtained by weighting, V degree represents the final video picture characteristic feature obtained by weighting, and alpha L Soft attention vector, α, representing a characteristic of a text representation A Soft attention vector, alpha, representing a characteristic of an audio feature V Soft attention vector representing the characteristic features of the video picture, softmax representing a normalized exponential function,
Figure BDA0003914424510000045
represents a mid-text characterization feature, and->
Figure BDA0003914424510000046
Representing an intermediate audio characterizing feature that is characteristic of,
Figure BDA0003914424510000047
represents a characteristic feature of the intermediate video frame, and/or>
Figure BDA0003914424510000048
Represents a feature of the ith sentence in the text, based on the value of the word or phrase>
Figure BDA0003914424510000049
Represents a characteristic of the ith audio segment, is greater than>
Figure BDA00039144245100000410
Features representing the ith video frame, l i Represents the ith text weighted hyper-parameter, a i Representing the ith audio weighted hyperparameter, v i Representing the ith video picture weighting hyperparameter.
Further, based on a fusion weight matrix obtained in advance, layer normalization weighting fusion is carried out on the final text characterization feature, the final audio characterization feature and the final video picture characterization feature, and the layer normalization weighting fusion is expressed by the following formula:
predictedy=LayerNorm(W L L o +W A A o +W V V o );
wherein predicted y LayerNorm indicates the layer normalization operation, W, for the predicted classification result L 、W A 、W V Representing the weight corresponding to the text mode, the audio mode and the video image mode to meet the requirements
Figure BDA00039144245100000411
Wherein d is z Is a hyper-parameter, which represents the vector dimension after weighted fusion. />
A transform-based inter-modal joint coding apparatus, comprising:
the video acquisition module is used for acquiring a video to be analyzed containing multi-mode information;
the text feature extraction module is used for extracting the text features of the video to be analyzed;
the audio characteristic extraction module is used for extracting the audio characteristics of the video to be analyzed;
the visual characteristic extraction module is used for extracting video image characteristics of the video to be analyzed;
the dimension unifying module is used for unifying the text feature, the audio feature and the video picture feature into the same dimension based on the full connection layer and the LSTM layer;
the characterization feature calculation module is used for performing multi-mode attention joint coding on the text features, the audio features and the video picture features based on a Transformer model to obtain text characterization features, audio characterization features and video picture characterization features;
and the classification result calculation module is used for weighting the text characterization features, the audio characterization features and the video image characterization features based on the multilayer perceptron classification model to obtain the classification result of the video to be analyzed.
An electronic device comprises a processor and a storage device, wherein a plurality of instructions are stored in the storage device, and the processor is used for reading the plurality of instructions in the storage device and executing the method.
The inter-modality joint coding method, device and equipment based on the Transformer at least have the following beneficial effects:
(1) The method comprises the steps that a neural network model is built based on a Transformer, joint coding is carried out on characteristics among various modes by adopting an interactive attention mechanism, information fusion among the modes is carried out through the attention mechanism, the neural network can obtain relevant information related to emotional attitude to be analyzed among the modes through the attention mechanism, an interactive attention module is introduced, the Transformer model can be adopted for carrying out joint attention coding on different modes at the same time, different mode characteristics do not need to be sequentially input into the Transformer model, attention information in the characteristics can be mined by utilizing strong characterization capability of the Transformer, and meanwhile, the correlation among specific modes is modeled, so that a better classification effect is achieved;
(2) When the classification model carries out weighted fusion on the characterization features containing the emotional tendency, a random inactivation Dropout mechanism is introduced, so that overfitting in the neural network training process can be prevented; a soft-attention mechanism is introduced, so that key information can be extracted better, and a better classification effect is realized.
Drawings
FIG. 1 is a flowchart illustrating an embodiment of a transform-based inter-modal joint coding method according to the present invention;
FIG. 2 is a schematic structural diagram of an embodiment of a text feature extraction model in the method provided by the present invention;
FIG. 3 is a schematic diagram illustrating the structural comparison between the C3D model and the R (2 + 1) D model provided by the present invention;
FIG. 4 is a schematic structural diagram of an embodiment of the R (2 + 1) D model provided by the present invention;
FIG. 5 is a schematic structural diagram of an embodiment of a three-mode joint coding Transformer model provided in the present invention;
fig. 6 is a schematic structural diagram of an embodiment of a two-mode jointly encoded transform model according to the present invention.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Referring to fig. 1, in some embodiments, there is provided a transform-based inter-modality joint coding method, including:
s1, acquiring a video to be analyzed containing multi-modal information;
s2, extracting text features of the video to be analyzed;
s3, extracting audio features of the video to be analyzed;
s4, extracting video image characteristics of the video to be analyzed;
s5, unifying the text feature, the audio feature and the video picture feature into the same dimensionality based on the full connection layer and the LSTM layer;
s6, performing multi-mode attention joint coding on the text characteristics, the audio characteristics and the video picture characteristics based on a transform model to obtain text characterization characteristics, audio characterization characteristics and video picture characterization characteristics;
and S7, weighting the text characterization features, the audio characterization features and the video image characterization features based on the multilayer perceptron classification model to obtain a classification result of the video to be analyzed.
In some embodiments, in step S2, the text features are extracted based on a pre-trained Chinese-BERT-wmm model, and the Chinese-BERT-wmm model adopts a full-word Mask mechanism (full-word Mask mechanism) in the training process. The full word mask mechanism means that if the mask mechanism is applied to the part of a sub-word in the training sample, other sub-words belonging to the word segmentation are also subjected to mask operation.
Referring to fig. 2, the BERT model is a current popular pre-training language model based on a Transformer, and the BERT constructs a bidirectional language model through an encoder structure in the Transformer model. In FIG. 2, tokN represents the Nth word in the sentence, class Label represents the text feature classification label, E N Denotes the nth word embedding, T N A feature vector representing the nth word output after passing through the model. The BERT model has the advantage that one context information can be learned directly. The Chinese-BERT-wmm model is a special BERT pre-training model for Chinese, a Mask mechanism is applied to an original BERT model in a training stage, and each subword in a training sample is subjected to masking operation randomly. And a full word mask mechanism is adopted in the BERT-wmm model, that is, if a part of a sub-word is applied with the mask mechanism, other sub-words belonging to the word segmentation are also maskedCode operation, i.e. full word Mask mechanism. Chinese-BERT-wmm is a powerful language model completed in the pre-training of large-scale corpus data set of Chinese, and can accurately extract semantic information in text.
As a better implementation mode, a bagging face transform is adopted to load a chip-BERT-wmm pre-training model. The Huggingface transform provides an API (application programming interface) which is convenient for fast reading downloading and using, and a pre-trained model can be conveniently used for feature processing work of a text, so that the working efficiency is improved.
In step S3, the audio features are extracted based on the library librosa of the audio signal processing library by using mel-frequency spectrum features. The audio information in the video contains a large amount of human speech and music with many non-verbal expressions (laughter, sigh, etc.), which together with information with linguistic significance has a tremendous effect on the analysis of the video. In the field of speech processing, features such as F0, MFCCs, etc. are widely used, but higher-level features tend to ignore large amounts of information. To overcome this problem, the present embodiment uses low-level auditory features as the subsequent multi-modal analysis audio feature input, selecting Mel-spectra (Mel-spectra) features for extraction. The mel-frequency spectrum characteristic is suitable for the instant voice recognition task and can obtain better effect, therefore, the mel-frequency spectrum can be used as better auditory information to be represented and applied to the embodiment. As a better implementation mode, a down-sampling operation is also adopted in the extraction process to reduce the feature dimension on the time sequence, thereby reducing the complexity, quickening the inference process and improving the feature extraction efficiency of the model.
Referring to FIG. 3, a structural comparison of the C3D model (VGG model version 3D) and the R (2 + 1) D model (a variant model of ResNet version 3D) is provided. The embodiment adopts an R (2 + 1) D network, namely a 3D convolution model, as a network model for video picture feature extraction. The three-dimensional convolution kernel naturally has the defects of high calculation cost and easiness in overfitting compared with the two-dimensional convolution kernel, the R (2 + 1) D network carries out three-dimensional convolution in the original C3D network step by step in order to reduce complexity, the three-dimensional convolution operation is decomposed into one-time convolution in a two-dimensional space and one-dimensional convolution in a time sequence, three-dimensional time sequence information and space information of video frame content are separated, optimization is easier, and a loss function is smaller.
Referring to FIG. 4, a model structure of R (2 + 1) D is provided, and this embodiment employs a pre-trained R (2 + 1) D-152 network. Wherein clip represents a video clip, (2 + 1) D conv represents a combination of a 2D spatial convolution and a 1D temporal convolution, space-time pool represents a temporal pooling layer, fc represents a fully-connected layer, the model is trained on a Sports-1M data set, the Sports-1M data set comprises one million videos collected from a YouTube website, and 487 motion-related videos are counted, and 1000-3000 videos exist under each video category. In the pre-training process, the model inputs 32 RGB video frames at a time, the size of the video frames is scaled to 128 × 171, and then one 112 × 112 video frame is randomly cropped in the video frames. The network output characteristics of the time-space pooling layer are selected as the extracted characteristics, and the output of the full-connection layer is not selected, so that only a convolution layer is arranged in the model, the long size of network input can be unlimited, and the model has more flexibility.
In step S4, the video picture characteristics of the video to be analyzed are extracted by adopting the R (2 + 1) D model;
extracting the video picture characteristics of the video to be analyzed, comprising the following steps:
s41, extracting three-dimensional sequence features in the video to be analyzed;
s42, flattening the three-dimensional sequence features into two-dimensional sequence features;
s43, scaling the size and randomly cutting the two-dimensional sequence features;
s44, down-sampling the two-dimensional sequence features, selecting 1 frame for extracting features from every 16 frames, and selecting the network output features of the space-time pooling layer as the extracted video picture features. Therefore, the complexity of the video characteristics obtained by sampling can be reduced, and the speed of processing the video by the model is increased under the condition that the performance of the model is not reduced.
And intercepting the first 8 video frames with the extracted features as the input of the video features to ensure that the features in the video are kept aligned.
In step S5, before inputting the extracted text features, audio features, and video features into the transform model, the features of different modalities are respectively input into a full link layer and an LSTM layer to ensure that each modality has the same vector dimension, thereby implementing multi-modality attention joint coding in the subsequent steps.
In the method provided by this embodiment, the video to be analyzed includes three modalities, namely a text modality, an audio modality, and a video image modality.
Step S6, in the process of multi-mode attention joint coding, the text mode is used as a main mode, and modulation coding is carried out on an audio mode and a video picture mode;
referring to fig. 5, the Transformer model includes three joint coding units, each of which includes a Multi-Head attention module Multi-Head, a first residual error connection normalization module Add & Norm, a Feed-Forward module Feed-Forward, a second residual error connection normalization module Add & Norm, a soft attention module soft-attention, and a third residual error connection normalization module Add & Norm connected in sequence;
based on a transform model, performing multi-modal attention joint coding on the text features, the audio features and the video picture features to obtain text characterization features, audio characterization features and video picture characterization features, wherein the method comprises the following steps:
s61, inputting the text characteristics into a first joint coding unit to obtain text characterization characteristics;
s62, inputting the text features and the audio features into a second combined coding unit to obtain audio representation features;
and S63, inputting the text features and the video picture features into a third combined coding unit to obtain video picture representation features.
The transform model provided in this embodiment includes an interactive attention module composed of a Multi-Head attention module Multi-Head, a first residual connection normalization module Add & Norm, a Feed-Forward module Feed-Forward, and a second residual connection normalization module Add & Norm, which are connected in sequence, and introduces a soft attention module, so as to map features output by the interactive attention module to a new characterization space.
Referring to fig. 6, the working principle of multi-modal attention joint coding is described by taking attention joint coding of two modalities as an example. First, assume that there are two modes X and Y, where the X mode is a main mode for modulation-coding information in the Y mode. The attention joint coding process introduces the idea of a Guide-attention unit in an interactive attention module, and replaces a K matrix and a V matrix of a multi-head attention mechanism in a Transformer model from Y to X. After substitution, similar to in the Transformer, QK T The calculation of (2) results in an attention matrix, which can be understood as a similarity matrix of row vectors in the features in the X-modality and the Y-modality, mining the feature correlation between the X-modality and the Y-modality. After this similarity matrix is obtained, the similarity matrix is dot multiplied by X. As with the Transformer model, residual ligation and LayerNorm model were introduced. The whole calculation process can be expressed as
f=LayerNorm(y+MA(y,x,x));
Where LayerNorm denotes layer normalization, y denotes input characteristics, and MA (y, x, x) denotes multi-head attention.
The LayerNorm operation is used to make the change of the loss function in the training process more stable, thereby realizing better joint coding effect.
In order to ensure that the process of matrix dot multiplication and residual concatenation can be performed smoothly, the vector dimensions of X and Y should be the same, and are expressed by the following formula:
Figure BDA0003914424510000091
Figure BDA0003914424510000092
wherein R is N×K A linear space containing all matrices is represented, where the matrices are N rows by K columns.
In some embodiments, the soft attention module in the transform model comprises a plurality of soft attention layers;
the soft attention module operates the input features, including:
performing soft attention operation on the input features in each soft attention layer;
and superposing the results obtained by the operation of each soft attention layer to obtain the output of the soft attention module, wherein the output is represented by the following formula:
S M =stack(m 1 ,...mG m );
wherein S is M For the output of the soft attention module, stack represents the superposition calculation, m 1 Representing one-dimensional vectors, mG, resulting from soft attention operations m Indicating that the vector resulting from the soft attention operation is weighted.
In some embodiments, the soft attention layer performs a soft attention operation on the features of the input, which is expressed by the following formula:
Figure BDA0003914424510000093
Figure BDA0003914424510000094
/>
Figure BDA0003914424510000095
wherein, W m Is a transformation matrix with the shape of 2k multiplied by k, represents the network weight of a fully-connected layer in the neural network, is a shared parameter of all soft attention layers, softmax is a normalized exponential function, soft-attention is a soft attention operation function carried out by each soft attention layer,
Figure BDA0003914424510000096
is the ith 1 × 2k weight vector, m i Is the output of the ith soft attention layer, and M is the inputOf the feature matrix, M j Is the feature matrix of the jth input.
In step S7, based on the multi-layer perceptron classification model, weighting the text characterization feature, the audio characterization feature, and the video image characterization feature to obtain a classification result of the video to be analyzed, including:
s71, inputting the text characterization feature, the audio characterization feature and the video picture characterization feature to a first full-link layer, a RELU activation function layer and a second full-link layer respectively, and introducing a random inactivation Dropout mechanism to obtain an intermediate text characterization feature, an intermediate audio characterization feature and an intermediate video picture characterization feature;
s72, respectively calculating soft attention vectors according to the intermediate text characterization features, the intermediate audio characterization features and the intermediate video picture characterization features based on a soft-attention mechanism;
s73, respectively weighting the text characterization feature, the audio characterization feature and the video picture characterization feature based on the soft attention vector to obtain a final text characterization feature, a final audio characterization feature and a final video picture characterization feature;
and S74, based on a fusion weight matrix obtained in advance, performing layer normalization weighting fusion on the final text characterization feature, the final audio characterization feature and the final video picture characterization feature, and inputting the layer normalization weighting fusion into a third full-connection layer to obtain a classification result of the video to be analyzed.
After multi-mode joint coding, modal characteristics represented by the network module are output, wherein the modal characteristics comprise a text characteristic, an audio characteristic and a video image characteristic, and the characteristic dimensions are kept the same and are represented by the following formula:
Figure BDA0003914424510000101
Figure BDA0003914424510000102
Figure BDA0003914424510000103
wherein,
Figure BDA0003914424510000104
representing text characterizing features, audio characterizing features and video picture characterizing features,/ m Representing the mth text feature, a m Representing the mth audio feature, v m Representing the m-th video picture characteristic, R N×K And (3) a linear space formed by all the N rows and K columns of the matrix.
The representation of the multi-mode joint coding, the text representation feature, the audio representation feature and the video representation feature already contain a large amount of representation with emotional attitude tendency and attention information corresponding to the representation. Dimension reduction is required on these characterizations to output the final prediction result of the neural network. Specifically, in step S71, a random inactivation Dropout mechanism is introduced, which can prevent overfitting during the neural network training process.
In steps S72 and S73, based on a soft-attention mechanism, soft attention vectors are respectively calculated according to the intermediate text characterizing feature, the intermediate audio characterizing feature, and the intermediate video picture characterizing feature, and the text characterizing feature, the audio characterizing feature, and the video picture characterizing feature are weighted based on the soft attention vectors, and expressed by the following formulas:
Figure BDA0003914424510000105
/>
Figure BDA0003914424510000106
Figure BDA0003914424510000107
Figure BDA0003914424510000108
Figure BDA0003914424510000109
Figure BDA0003914424510000111
wherein, L degree represents the final text characteristic feature obtained by weighting, A degree represents the final audio characteristic feature obtained by weighting, V degree represents the final video picture characteristic feature obtained by weighting, and alpha L Soft attention vector, alpha, representing a characteristic of a text representation A Soft attention vector, alpha, representing a characteristic of an audio feature V Soft attention vector representing the characteristic features of the video picture, softmax representing a normalized exponential function,
Figure BDA0003914424510000112
represents an intermediate text characterization feature, and->
Figure BDA0003914424510000113
Representing the characteristic features of the intermediate audio,
Figure BDA0003914424510000114
representing an intermediate video picture characterizing feature, based on a reference picture, and a reference picture, based on a reference picture, and a reference picture>
Figure BDA0003914424510000115
Represents a feature of the ith sentence in the text, based on the value of the word or phrase>
Figure BDA0003914424510000116
Characteristic representing the ith audio segment>
Figure BDA0003914424510000117
Features representing the ith video frame, l i Representing the ith text weighted hyper-parameter, a i Represents the ithAudio weighted hyperparameter, v i Representing the ith video picture weighting hyperparameter.
When the classification model carries out weighting fusion on the characterization features containing the emotional tendency, a soft-attention mechanism is introduced, so that key information can be better extracted.
In step S74, based on a fusion weight matrix obtained in advance, performing layer normalization weighting fusion on the final text characterization feature, the final audio characterization feature, and the final video picture characterization feature, and expressed by the following formula:
predicted y =LayerNorm(W L L o +W A A o +W V V o );
wherein predicted y For the predicted classification result, layerNorm indicates the layer normalization operation, W L 、W A 、W V Weights representing the correspondence of text mode, audio mode and video picture mode are obtained by network learning and satisfy
Figure BDA0003914424510000118
Wherein d is z Is a hyper-parameter, which represents the vector dimension after weighted fusion.
As a preferred implementation, the multi-layer perceptron model adopts a binary cross entropy loss function.
As a preferred embodiment, the obtained result is input to the third fully-connected layer, and the dimension of the vector is 1 dimension in order to map the fused features into a final output prediction vector.
In a specific application scenario, a CPU adopts 6-Core Intel (R) Core (TM) i5-9600K CPU @3.70GHZ, a memory adopts 169B 2400MHz DDR4, a display card adopts Nvidia GTX 1080Ti, an operating system adopts Ubuntu 20.04.1LTS, codes are mainly realized based on Python, the version is 3.6.13, part of a neural network is mainly realized by depending on Pyorch 1.2.0, and a third party library required by Anaconda management is used.
In the aspect of feature extraction, the three-dimensional sequence features extracted from the video are flattened to two dimensions, down sampling is adopted, and 1 frame is selected for extracting features every 16 video frames. And in order to ensure that the features in the video are kept aligned, the front 8 video frames with extracted features are intercepted as the input of the video features. The text feature input adopts Hugging Face to load a pretrained Chinese-BERT-wmm, and in the text feature extraction, embedded vectors representing the beginning and the end (< SOS > and < EOS >) are respectively added at the beginning and the end of the text.
Similar to video features, audio features also employ downsampling processes. And the third-party library pickle based on Python saves the extracted features of different modes as a pkl file, so that the calculation of the neural network does not have a mode feature extraction part, only the forward and backward propagation of the network needs to be concerned, and the high efficiency of the neural network training is ensured. In a network implementation, the number of multi-head attentions is set to 4, the hidden layer size of a transform is 1024, and there are 6 transform modules. The input size of the fully connected layer is 1024. The total parameter amount is about 140M.
In the training stage of the model, an Adam optimizer is adopted to optimize the network, and the Adam optimizer is a popular optimizer in deep learning at present and has the advantages of simplicity, high efficiency, suitability for large-scale data and the like. The initial learning rate of the neural network is set to 1e-3 and if the accuracy on the validation set does not rise, the learning rate will continue to decline by a factor of 0.2. Dropout randomly inactivated node retention probability set to 0.1. Dropout enables regularization of the model, reducing overfitting in deep learning. Subject to the video memory limit of the video card, the batch _ size is set to 16.
In some embodiments, there is provided a Transformer-based inter-modality joint encoding apparatus, including:
the video acquisition module is used for acquiring a video to be analyzed containing multi-mode information;
the text feature extraction module is used for extracting the text features of the video to be analyzed;
the audio characteristic extraction module is used for extracting the audio characteristics of the video to be analyzed;
the visual characteristic extraction module is used for extracting video image characteristics of the video to be analyzed;
the dimension unifying module is used for unifying the text feature, the audio feature and the video picture feature into the same dimension based on the full connection layer and the LSTM layer;
the characterization feature calculation module is used for performing multi-mode attention joint coding on the text features, the audio features and the video picture features based on a Transformer model to obtain text characterization features, audio characterization features and video picture characterization features;
and the classification result calculation module is used for weighting the text characterization feature, the audio characterization feature and the video picture characterization feature based on the multilayer perceptron classification model to obtain the classification result of the video to be analyzed.
In some embodiments, an electronic device is provided, which includes a processor and a storage device, the storage device having a plurality of instructions stored therein, and the processor is configured to read the plurality of instructions from the storage device and execute the method.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A transform-based inter-modal joint coding method, comprising:
acquiring a video to be analyzed containing multi-modal information;
extracting text features of the video to be analyzed;
extracting the audio features of the video to be analyzed;
extracting video picture characteristics of the video to be analyzed;
unifying the text feature, the audio feature and the video picture feature into the same dimensionality based on the full connection layer and the LSTM layer;
based on a transform model, performing multi-modal attention joint coding on the text characteristics, the audio characteristics and the video image characteristics to obtain text characterization characteristics, audio characterization characteristics and video image characterization characteristics;
and weighting the text characterization features, the audio characterization features and the video image characterization features based on a multi-layer perceptron classification model to obtain a classification result of the video to be analyzed.
2. The method of claim 1, wherein the text features are extracted based on a pre-trained Chinese-BERT-wmm model;
the Chinese-BERT-wmm model is loaded through a Hugging face Transformer;
the Chinese-BERT-wmm model adopts a full word mask mechanism in the training process.
3. The method as claimed in claim 1, wherein the R (2 + 1) D model is adopted to extract the video picture features of the video to be analyzed;
extracting the video picture characteristics of the video to be analyzed, comprising the following steps:
extracting three-dimensional sequence features in the video to be analyzed;
flattening the three-dimensional sequence features into two-dimensional sequence features;
and performing down-sampling on the two-dimensional sequence features, selecting 1 frame for extracting features every 16 frames, and selecting the network output features of the R (2 + 1) D model in the space-time pooling layer as the extracted video picture features.
4. The method according to claim 1, wherein in the multi-modal joint attention coding process, the text mode is used as a main mode, and an audio mode and a video picture mode are modulation-coded;
the Transformer model comprises three joint coding units, wherein each joint coding unit comprises a Multi-Head attention module, a first residual connection standardization module, a feedforward module Feed-Forward, a second residual connection standardization module, a soft attention module soft-attention and a third residual connection standardization module which are sequentially connected;
based on a transform model, performing multi-modal attention joint coding on the text features, the audio features and the video picture features to obtain text characterization features, audio characterization features and video picture characterization features, wherein the method comprises the following steps:
inputting the text characteristics into a first joint coding unit to obtain text characterization characteristics;
inputting the text features and the audio features into a second combined encoding unit to obtain audio representation features;
and inputting the text features and the video picture features into a third joint coding unit to obtain video picture representation features.
5. The method of claim 4, wherein the soft attention module comprises a plurality of soft attention layers;
the soft attention module operates the input features, including:
performing soft attention operation on the input features in each soft attention layer;
and superposing the results obtained by the operation of each soft attention layer to obtain the output of the soft attention module, wherein the output is expressed by the following formula:
S M =stack(m 1 ,...mG m );
wherein S is M For the output of the soft attention module, stack represents the superposition calculation, m 1 Representing one-dimensional vectors, mG, resulting from soft attention operations m Indicating that the vector resulting from the soft attention operation is weighted.
6. The method according to claim 1 or 5, wherein weighting the text characterization feature, the audio characterization feature and the video image characterization feature based on a multi-layer perceptron classification model to obtain a classification result of the video to be analyzed comprises:
inputting the text characterization feature, the audio characterization feature and the video picture characterization feature into a first full-link layer, a RELU activation function layer and a second full-link layer respectively, and introducing a random deactivation Drapout mechanism to obtain an intermediate text characterization feature, an intermediate audio characterization feature and an intermediate video picture characterization feature;
based on a soft-attention mechanism, respectively calculating soft attention vectors according to the intermediate text characterization features, the intermediate audio characterization features and the intermediate video picture characterization features;
respectively weighting the text characterization features, the audio characterization features and the video picture characterization features based on the soft attention vectors to obtain final text characterization features, final audio characterization features and final video picture characterization features;
and performing layer normalization weighting fusion on the final text characterization feature, the final audio characterization feature and the final video picture characterization feature based on a fusion weight matrix obtained in advance, and inputting the layer normalization weighting fusion into a third full-connection layer to obtain a classification result of the video to be analyzed.
7. The method according to claim 6, wherein based on soft-attention mechanism, soft attention vectors are respectively calculated according to the intermediate text characterization feature, the intermediate audio characterization feature and the intermediate video picture characterization feature, and the text characterization feature, the audio characterization feature and the video picture characterization feature are weighted based on the soft attention vectors, and are expressed by the following formulas:
Figure FDA0003914424500000021
Figure FDA0003914424500000031
Figure FDA0003914424500000032
Figure FDA0003914424500000033
Figure FDA0003914424500000034
Figure FDA0003914424500000035
wherein L is o Representing the final text characterizing feature, A, obtained by weighting o Representing the final audio characterizing feature, V, obtained by weighting o Representing the weighted final video picture characterizing feature, α L Soft attention vector, alpha, representing a characteristic of a text representation A Soft attention vector, alpha, representing a characteristic of an audio feature V Soft attention vector representing the characteristic features of the video picture, softmax representing a normalized exponential function,
Figure FDA0003914424500000036
represents an intermediate text characterization feature, and->
Figure FDA0003914424500000037
Represents an intermediate audio characterizing feature, and->
Figure FDA0003914424500000038
Represents a characteristic feature of the intermediate video frame, and/or>
Figure FDA0003914424500000039
Presentation textCharacteristic of the ith sentence in the text, based on the comparison of the characteristic value of the ith sentence in the text and the value of the word or phrase>
Figure FDA00039144245000000310
Represents a characteristic of the ith audio segment, is greater than>
Figure FDA00039144245000000311
Features representing the ith video frame, l i Representing the ith text weighted hyper-parameter, a i Representing the ith audio weighted hyperparameter, v i Representing the ith video picture weighting hyperparameter.
8. The method according to claim 6, wherein the final text characterization feature, the final audio characterization feature and the final video picture characterization feature are subjected to layer normalization weighting fusion based on a fusion weight matrix obtained in advance, and are expressed by the following formula:
predicted y =LayerNorm(W L L o +W A A o +W v V o );
wherein predicted y LayerNorm indicates the layer normalization operation, W, for the predicted classification result L 、W A 、W V Representing the weight corresponding to the text mode, the audio mode and the video image mode to meet the requirements
Figure FDA00039144245000000312
Wherein d is z Is a hyper-parameter, which represents the vector dimension after weighted fusion.
9. An inter-modal joint encoding apparatus based on a transform, comprising:
the video acquisition module is used for acquiring a video to be analyzed containing multi-mode information;
the text feature extraction module is used for extracting text features of the video to be analyzed;
the audio characteristic extraction module is used for extracting the audio characteristics of the video to be analyzed;
the visual characteristic extraction module is used for extracting video image characteristics of the video to be analyzed;
the dimension unifying module is used for unifying the text feature, the audio feature and the video picture feature into the same dimension based on the full connection layer and the LSTM layer;
the characterization feature calculation module is used for performing multi-mode attention joint coding on the text features, the audio features and the video picture features based on a Transformer model to obtain text characterization features, audio characterization features and video picture characterization features;
and the classification result calculation module is used for weighting the text characterization feature, the audio characterization feature and the video picture characterization feature based on the multilayer perceptron classification model to obtain the classification result of the video to be analyzed.
10. An electronic device comprising a processor and a memory means, wherein a plurality of instructions are stored in the memory means, and wherein the processor is configured to read the plurality of instructions from the memory means and to perform the method according to any one of claims 1 to 8.
CN202211335121.2A 2022-10-28 2022-10-28 Method, device and equipment for intermodal joint coding based on Transformer Pending CN115937641A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211335121.2A CN115937641A (en) 2022-10-28 2022-10-28 Method, device and equipment for intermodal joint coding based on Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211335121.2A CN115937641A (en) 2022-10-28 2022-10-28 Method, device and equipment for intermodal joint coding based on Transformer

Publications (1)

Publication Number Publication Date
CN115937641A true CN115937641A (en) 2023-04-07

Family

ID=86553123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211335121.2A Pending CN115937641A (en) 2022-10-28 2022-10-28 Method, device and equipment for intermodal joint coding based on Transformer

Country Status (1)

Country Link
CN (1) CN115937641A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701568A (en) * 2023-05-09 2023-09-05 湖南工商大学 Short video emotion classification method and system based on 3D convolutional neural network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701568A (en) * 2023-05-09 2023-09-05 湖南工商大学 Short video emotion classification method and system based on 3D convolutional neural network

Similar Documents

Publication Publication Date Title
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
CN112860888B (en) Attention mechanism-based bimodal emotion analysis method
CN111198937B (en) Dialog generation device, dialog generation program, dialog generation apparatus, computer-readable storage medium, and electronic apparatus
CN111783705B (en) Character recognition method and system based on attention mechanism
CN111488739A (en) Implicit discourse relation identification method based on multi-granularity generated image enhancement representation
CN111914085A (en) Text fine-grained emotion classification method, system, device and storage medium
CN111401079A (en) Training method and device of neural network machine translation model and storage medium
CN110795549B (en) Short text conversation method, device, equipment and storage medium
Elmadany et al. Multiview learning via deep discriminative canonical correlation analysis
US20230281456A1 (en) Multi-modal artifical neural network and a self-supervised learning method for training same
CN114973222A (en) Scene text recognition method based on explicit supervision mechanism
CN115358289A (en) Text generation algorithm fusing multi-type knowledge base and inference technology
CN115937641A (en) Method, device and equipment for intermodal joint coding based on Transformer
Xue et al. Lcsnet: End-to-end lipreading with channel-aware feature selection
Wang et al. WaveNet with cross-attention for audiovisual speech recognition
CN115272908A (en) Multi-modal emotion recognition method and system based on improved Transformer
Al-Fraihat et al. Speech recognition utilizing deep learning: A systematic review of the latest developments
He et al. An optimal 3D convolutional neural network based lipreading method
CN117076669A (en) Aspect-level emotion analysis method and device, electronic equipment and storage medium
CN115240713B (en) Voice emotion recognition method and device based on multi-modal characteristics and contrast learning
Srivastava et al. Image Captioning based on Deep Convolutional Neural Networks and LSTM
CN112765955B (en) Cross-modal instance segmentation method under Chinese finger representation
CN114399646A (en) Image description method and device based on Transformer structure
CN113611289A (en) Voice recognition method and device
Ashraf et al. On the Audio-Visual Emotion Recognition using Convolutional Neural Networks and Extreme Learning Machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination