CN116680578A

CN116680578A - Cross-modal model-based deep semantic understanding method

Info

Publication number: CN116680578A
Application number: CN202310445651.0A
Authority: CN
Inventors: 矫健; 祝中科; 程球; 白善今; 李平
Original assignee: CETC 52 Research Institute
Current assignee: CETC 52 Research Institute
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-09-01

Abstract

The application discloses a depth semantic understanding method based on a cross-modal model, which comprises the steps of invoking a WIT data set to pretrain a text encoder and an image encoder; invoking a data set to pretrain the audio feature extraction sub-network; invoking a data set to perform migration training on the pre-trained text encoder, the image encoder and the audio feature extraction sub-network to obtain a cross-modal model after weight migration learning; invoking a data set and the data set to perform optimization training on the audio feature extraction sub-network in the cross-modal model after weight transfer learning to obtain a cross-modal model after weight optimization; and (3) operating the cross-modal model with optimized weight, outputting the associated information among text data, image data and audio data in the multi-source data set, and completing the deep semantic understanding of the multi-source data set. The application solves the problems that the current multi-source data intelligent detection and identification algorithm does not effectively utilize multi-mode data and has poor safety and stability of analysis results.

Description

Cross-modal model-based deep semantic understanding method

Technical Field

The application belongs to the technical field of cross-modal semantic understanding, and particularly relates to a depth semantic understanding method based on a cross-modal model.

Background

With the continuous development of high and new technologies such as the internet and multimedia, multi-source data of fusion images, texts and audios are becoming mainstream information transmission media, and are closely related to the real life of people. Currently, research for single modes, such as computer vision, natural language processing, voice recognition and the like, has made tremendous research progress. How to further perform cross-modal semantic understanding and reasoning among vision, text and audio, and reduce semantic gaps among modalities, thus becoming a hotspot problem.

Deep learning is an artificial intelligent algorithm which depends on sample tag data, and the data detection and recognition level of the intelligent algorithm is improved through continuous training by carrying out characterization learning on the data. The deep learning is mainly aimed at intelligent detection and recognition, and correct detection and recognition are timely made according to current data information, so that the method is very suitable for intelligent data analysis scenes. However, the deep learning algorithm has the problems that the joint learning capability of different types of data is insufficient, the mode information cannot be fused efficiently, the accurate association and alignment between modes cannot be established, and the semantic information in each mode is difficult to effectively mine and understand. These problems make the safety of the deep learning detection recognition result questioned, and are difficult to accurately guide the application.

The cross-modal algorithm is a machine learning algorithm with strong safety of a recognized detection and identification result, and the multi-source data information is mapped to a unified cross-modal vector space through joint representation learning, and target detection or regression is carried out by combining information of a plurality of modes. The method has the advantages of safety and strong stability, can be used for solving the detection and identification problems of various data types, and is very in line with human cognition. However, the common cross-modal algorithm has the defects that three main modes in the real world cannot be effectively combined with human beings to be known and described, the fitting is easy, the safety and stability under noise perturbation are poor, and the learning effect is poor.

Disclosure of Invention

The application aims to provide a depth semantic understanding method based on a cross-modal model, which solves the problems that the current multi-source data intelligent detection and identification algorithm does not effectively utilize multi-modal data and has poor safety and stability of analysis results.

In order to achieve the above purpose, the technical scheme adopted by the application is as follows:

a cross-modal model based depth semantic understanding method, the cross-modal model comprising a text encoder, an image encoder, and an audio feature extraction sub-network, the cross-modal model based depth semantic understanding method comprising:

invoking the WIT dataset to pretrain the text encoder and the image encoder;

invoking an ImageNet dataset to pretrain the audio feature extraction sub-network;

invoking an AudioSet data set to perform migration training on the pre-trained text encoder, the image encoder and the audio feature extraction sub-network to obtain a cross-modal model after weight migration learning;

invoking an ESC-50 data set and an AudioSet data set to perform optimization training on the audio feature extraction sub-network in the cross-modal model after weight transfer learning to obtain a cross-modal model after weight optimization;

and (3) operating the cross-modal model with optimized weight, outputting the associated information among text data, image data and audio data in the multi-source data set, and completing the deep semantic understanding of the multi-source data set.

The following provides several alternatives, but not as additional limitations to the above-described overall scheme, and only further additions or preferences, each of which may be individually combined for the above-described overall scheme, or may be combined among multiple alternatives, without technical or logical contradictions.

Preferably, the invoking the WIT dataset pre-trains the text encoder and the image encoder, including:

the text encoder extracts text feature vectors of text data in the WIT data set;

the image encoder extracts image feature vectors of image data in the WIT data set;

calculating cosine similarity of text data and image data in the text image pair according to the normalized text feature vector and the normalized image feature vector;

and calculating a loss value of the text data and a loss value of the image data based on the cosine similarity, and updating the text encoder and the image encoder.

Preferably, the calculating the loss value of the text data and the loss value of the image data based on the cosine similarity and updating the text encoder and the image encoder includes:

the loss value of the text data is calculated as follows:

in the loss of _t Is the loss value of text data, y _t For the text data to be based on the true tags of the text image pairs,a predicted probability value for the text data based on the text image pair, i.e. cosine similarity of the text data and the image data in the text image pair;

the loss value of the image data is calculated as follows:

in the loss of _i For loss value of image data, y _i For the image data based on the true label of the text image pair,a predicted probability value for the image data based on the text image pair, i.e., cosine similarity of the text data and the image data in the text image pair;

the loss value for the text image pair is calculated as follows:

in the loss of _ti A loss value for the text image pair;

the weights of the text encoder and the image encoder are updated by back-propagation based on the loss value of the text image pair.

Preferably, the invoking the ImageNet dataset to pretrain the audio feature extraction sub-network includes:

performing data enhancement on the ImageNet data set by using a random wearing method and a Mixup method;

the audio feature extraction sub-network is pre-trained with the data enhanced ImageNet dataset.

Preferably, the invoking AudioSet data set performs migration training on the pre-trained text encoder, the image encoder and the audio feature extraction sub-network, including:

converting the audio data in the AudioSet data set into a spectrogram through short-time Fourier transform;

mapping the spectrogram to three input channels according to the frequency bands to obtain a three-channel image;

taking one audio text pair in the audio set data set, taking three-channel images of audio data in the audio text pair as input, taking text data in the audio text as a label, and performing expansion training on the pre-trained audio feature extraction sub-network;

and taking the expanded trained audio feature extraction sub-network, combining the pre-trained text encoder and the pre-trained image encoder as a cross-modal model to be trained, taking an audio text image pair in the AudioSet data set, and performing migration training on the cross-modal model by using the audio text image pair.

Preferably, the taking the audio text image pair in the AudioSet data set, and performing migration training on the cross-modal model by using the audio text image pair comprises:

the text encoder extracts text feature vectors of text data in an audio text image pair;

the image encoder extracts image feature vectors of image data in the audio text image pair;

the audio feature extraction sub-network extracts audio feature vectors in three channel images corresponding to the audio data in the audio text image pair;

according to the normalized text feature vector, the normalized image feature vector and the normalized audio feature vector, calculating cosine similarity of the text data and the image data, cosine similarity of the audio data and the text data, and cosine similarity of the image data and the audio data;

and calculating a loss value and updating the cross-modal model based on the cosine similarity of the text data and the image data, the cosine similarity of the audio data and the text data, and the cosine similarity of the image data and the audio data.

Preferably, the calculating the loss value and updating the cross-modal model based on the cosine similarity of the text data and the image data, the cosine similarity of the audio data and the text data, and the cosine similarity of the image data and the audio data includes:

calculating a text image loss value based on cosine similarity of the text data and the image data;

calculating an audio text loss value based on cosine similarity of the audio data and the text data;

calculating an image audio loss value based on cosine similarity of the image data and the audio data;

and adding and averaging the text image loss value, the audio text loss value and the image audio loss value to update the cross-modal model as a final loss value.

Preferably, the invoking the ESC-50 dataset and the AudioSet dataset performs optimization training on the audio feature extraction sub-network in the cross-modal model after the weight transfer learning, including:

the method comprises the steps of taking an audio text pair in an ESC-50 dataset, freezing weights of a text encoder and an image encoder in a cross-modal model, training the cross-modal model by the audio text pair, and adjusting weights of an audio feature extraction sub-network;

and (3) taking an audio text pair in the AudioSet data set, freezing the weight of the image encoder in the cross-modal model, training the cross-modal model by using the audio text pair, and adjusting the weights of the audio feature extraction sub-network and the text encoder.

According to the cross-modal model-based depth semantic understanding method, different modal data are jointly trained according to the cross-modal model, image, audio and text feature fields can be aligned, similarity of different modal data features is compared by using a cosine similarity function, generalization capability, characterization capability and reasoning capability of the model are improved, safety of detection results of a cross-modal deep learning algorithm is improved well, and the problems that cross-modal pre-training, cross-modal searching and weak safety cannot be achieved by a single-modal deep learning algorithm are solved.

The second purpose of the application is to provide a depth semantic understanding device based on a cross-modal model, which solves the problems that the current multi-source data intelligent detection and recognition algorithm does not effectively utilize multi-modal data and has poor safety and stability of analysis results.

a cross-modal model-based depth semantic understanding device comprises a processor and a memory storing a plurality of computer instructions which when executed by the processor implement the steps of the cross-modal model-based depth semantic understanding method.

Drawings

FIG. 1 is a flow chart of the overall method for understanding depth semantics based on a cross-modal model;

FIG. 2 is a detailed flow chart of a cross-modal model-based deep semantic understanding method of the present application;

FIG. 3 is a schematic diagram of an implementation of the cosine similarity function of the present application;

FIG. 4 is a schematic diagram of an embodiment of a cross-modal model constructed in accordance with the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

As shown in fig. 1, the overall flow of the depth semantic understanding method based on the cross-modal model in this embodiment is as follows: and (3) pre-training a cross-modal model (cross-modal deep learning model) by using a plurality of data sets, completing the parameter initialization of the cross-modal model, migrating and optimizing the pre-trained model parameters by using the data sets containing three modes, and carrying out information analysis test and application on the optimized model.

The data types currently prevailing mainly include text data, image data, and audio data, and thus the present embodiment builds a cross-modal model including text encoder, image encoder, and audio feature extraction sub-network. In other embodiments, if it is necessary to understand and parse other types of data, a corresponding identification network is added to the cross-modal model based on the logic of the present application.

Text encoders and image encoders in deep learning are methods for converting text and image information into machine-understandable vector representations, common text encoders include BERT and GPT, etc., while common image encoders include VGG, res net, and acceptance, etc. The use of the text encoder and the image encoder can more efficiently process large-scale text and image data, and improve the accuracy and efficiency of deep learning.

Specifically, as shown in fig. 2, the method for understanding depth semantics based on a cross-modal model in this embodiment includes the following steps:

and S1, invoking the WIT data set to pretrain the text encoder and the image encoder.

WIT (WebImageText) dataset is a dataset containing images and text that can be used to train image retrieval and text recognition models. It can help the machine learning model better understand the relationship between images and text. The present embodiment pre-trains the text encoder and the image encoder of the deep learning model in two modalities with a text image pair in the WIT dataset.

The WIT data set comprises N text labels, and in the training process, each text label is encoded by a text encoder of the model to obtain a corresponding text feature vector which is stored in a one-dimensional vector data format and comprises a plurality of label categories; each image is encoded by an image encoder to obtain a corresponding image feature vector stored in a one-dimensional vector data format, wherein the text feature vector is taken as a row, the image feature vector is taken as a column to form a two-dimensional feature vector, the cosine similarity is calculated, and the highest similarity is the prediction category of Top-1 as shown in fig. 3.

And S1.1, extracting text feature vectors of text data in the WIT data set by a text encoder.

The text feature vector output by the text encoder is as follows:

T＝[T ₁ ,T ₂ ,…T _i ,…T _M ]

wherein T is the text feature vector of the one-dimensional vector data format, M is the total number of text features in the text feature vector, and T _i Is the ith text feature.

And S1.2, extracting image feature vectors of image data in the WIT data set by the image encoder.

The image feature vectors output by the image encoder are as follows:

I＝[I ₁ ,I ₂ ,…I _j ,…I _N ]

wherein I is the image feature vector of the one-dimensional vector data format, N is the total number of image features in the image feature vector, I _j Is the j-th image feature.

And S1.3, calculating cosine similarity of text data and image data in the text image pair according to the normalized text feature vector and the normalized image feature vector.

Mapping two different single-mode features to the same vector space, and calculating L2 normalization, wherein an L2 normalization expression is as follows:

x＝[x ₁ ,x ₂ ,…,x _L ]

y＝[y ₁ ,y ₂ ,…,y _L ]

wherein x represents a one-dimensional feature vector of length L, x _i And the value is the ith value in the one-dimensional feature vector x, and y represents the one-dimensional feature vector normalized by L2. Substituting the text feature vector and the image feature vector into a normalization formula to obtain a normalized text feature vector T ^′ ＝[T ₁ ^′ ,T ₂ ^′ ,…T _i ^′ ,…T ^′ _M ]And image featureSign vector I ^′ ＝[I ₁ ^′ ,I ₂ ^′ ,…I _j ^′ ,…I ^′ _N ]。

And calculating to obtain the probability of similarity of the original non-standardized predictions of the two modes through a cosine similarity function, wherein the cosine similarity function expression is as follows:

in cosine similarities (T) ^′ ,I ^′ ) For the cosine similarity of the normalized text feature vector and the image feature vector, the probability of whether the normalized text feature vector and the image feature vector are similar is represented, and the value range of the cosine similarity function is [ -1,1]However, in the present application, the cosine similarity has a value between 0 and 1, 0 is completely dissimilar, 1 is completely similar, and a value closer to 1 indicates a higher similarity. In addition, in one text image pair, the cosine similarity of the text data and the image data is the same as the cosine similarity of the image data and the text data, i.e., cosine similarities (I ^′ ,T ^′ )＝cosine similarities(T ^′ ,I ^′ )。

And S1.4, calculating a loss value of the text data and a loss value of the image data based on the cosine similarity, and updating the text encoder and the image encoder.

cosine similarities(T ^′ ,I ^′ ) And carrying out cross entropy loss function calculation with GroundTruth (GroundTruth represents a real target label and is expressed by an identity matrix, wherein the similarity between paired samples is 1 at the highest, and the other samples are 0), and the cross entropy loss function expression is as follows:

step S1.4.1, calculating a loss value of the text data as follows:

in the loss of _t For loss value of text data, i.e. picture true sample label andthe difference between the prediction probabilities, y _t For the text data to be based on the true tags of the text image pairs,for the text data, the predicted probability value based on the text image pair, namely the cosine similarity calculated in the step S1.3, is taken as [0,1]。

Step S1.4.2, calculating a loss value of the image data is as follows:

in the loss of _i For loss value of image data, i.e. difference between picture true sample label and prediction probability, y _i For the image data based on the true label of the text image pair,for the image data, the value is [0,1 ] based on the predicted probability value of the text image pair, namely the cosine similarity calculated in the step S1.3]。

Step S1.4.3, calculating a loss value of the text image pair as follows:

in the loss of _ti Is the loss value for the text image pair.

Step S1.4.4, back-propagating the weights of the text encoder and the image encoder according to the loss value of the text image pair. The back propagation of the deep learning model is a conventional model updating means, and is not described in detail in this embodiment.

And S2, invoking an ImageNet data set to pretrain the audio feature extraction sub-network.

The initialization model weight is to acquire a pre-training model irrelevant to a specific task from large-scale data through supervised learning, and is an application of transfer learning, which can transfer knowledge learned in the open field to a downstream task so as to improve a low-resource task. The audio feature extraction sub-network is mainly realized by a data enhancement algorithm, a ResNet network and an attention mechanism algorithm in the image field.

And S2.1, performing data enhancement on the ImageNet data set by using a random wearing method and a Mixup method.

Data enhancement (e.g., randomErasing, mixup, cutMix, etc.) is a lightweight approach to improving the generalization ability of a network model, without requiring any additional parameters or memory consumption, which can be integrated with various neural network models without changing learning strategies. The present embodiment selects random running and Mixup as the data enhancement algorithm for the pre-training audio feature extraction sub-network.

In this embodiment, two data enhancement, namely random casting and Mixup, are performed on the ImageNet dataset, and the expression of the random casting method is as follows:

S＝W×H

r _e ∈(r ₁ ,r ₂ )

P＝(x _e ,y _e )

I _e ＝(x _e ,y _e ,x _1e +W _e ,y _e +H _e )

wherein I is _e For random erasing, randomly selecting a rectangular region in an image, S is the image area, W is the image width, H is the image height, S _e To randomly initialize the "erasure rectangle" size, (S) _l ,S _h ) Represent S _e Length and height, r _e ∈(r ₁ ,r ₂ ) Represent S _e Is characterized by that its transverse-longitudinal ratio,and->Respectively represent rectangle I _e P represents a point randomly taken in the image, (x) _e ,y _e ) Representing the abscissa, I of the point _e ＝(x _e ,y _e ,x _e +W _e ,y _e +H _e ) The rectangular region is shown as being expanded at this point. In the selected area, each pixel is assigned a value of 0,255]Is a random value of (a).

The expression of the Mixup method is as follows:

in (x) _i ,y _i ) And (x) _j ,y _j ) Is two samples randomly selected from the same batch of input data and corresponding labels, lambda is a randomly sampled number from beta distribution, lambda epsilon [0,1 ]]。

And S2.2, pre-training the audio feature extraction sub-network by using the image Net data set with enhanced data, and storing the weight of the audio feature extraction sub-network trained by using the image Net data set.

The res net network of this embodiment merges the attention mechanism, whose expression is as follows:

where x is the convolution filter processed signal,for the output of the attention module, which is also the input of the next layer network, L _i (x) For input of the layer i network, A _i (x) For input to the ith attention module, L _i (x) And A is a _i (x) The two inputs are identical.

When the audio feature extraction sub-network is pre-trained, a conventional optimization method, such as a gradient descent algorithm, can be adopted, and an optimizer can be directly adopted for optimization.

And S3, invoking an AudioSet data set to perform migration training on the pre-trained text encoder, the image encoder and the audio feature extraction sub-network, so as to obtain a cross-modal model after weight migration learning.

Firstly, audio data in an AudioSet data set is selected to conduct expansion training on the pre-trained audio feature extraction sub-network, and audio detection performance of a cross-mode algorithm model is improved. And then integrating the expanded and trained audio feature extraction sub-network into a cross-modal deep learning algorithm model, and using an audio fragment, a corresponding video frame and a designated text label in the AudioSet data set, and completing migration training of the whole cross-modal deep learning algorithm model by a joint image encoder and a text encoder, wherein the structure of the cross-modal deep learning algorithm model is shown in figure 4. The specific training steps are as follows:

step S3.1, converting the audio data in the AudioSet data set into a spectrogram through short-time Fourier transform, wherein the expression is as follows:

wherein X (τ, ω) is the amplitude and phase of the base sinusoidal frequency ω at different time points τ in the time domain signal X, τ is the different time points in the time domain signal, ω is the base sinusoidal frequency, X [ n ] represents the corresponding base coordinates, ω [ n- τ ] represents the analysis window function.

To reduce the spectral disturbance caused by framing, the noise level in the spectrum is reduced using a Blackman-Harris window function expressed as follows:

wherein a is ₀ ＝0.35875，a ₁ ＝0.48829，a ₂ ＝0.14128，a ₃ =0.01168, t represents time, N is total window length, W [ t ]]K=0, 1,2, N-1, which is a weighted value of the time function.

And S3.2, mapping the spectrogram to three input channels according to the frequency bands to obtain a three-channel image.

In this embodiment, the spectrogram is mapped onto three input channels along its frequency axis to obtain a three-channel image, which is divided into three frequency bands: low (0.00-7.35 kHz), medium (7.35-14.70 kHz) and high (14.7-22.05 kHz).

And S3.3, taking one audio text pair in the audio set data set, taking three channel images of the audio data in the audio text pair as input, taking text data in the audio text as a label, and performing expansion training on the pre-trained audio feature extraction sub-network. When the audio feature extraction sub-network is subjected to expansion training, a conventional optimization method, such as a gradient descent algorithm, can be adopted.

And S3.4, extracting the expanded trained audio features from the sub-network, combining the pre-trained text encoder and the pre-trained image encoder to serve as a cross-modal model to be trained, taking an audio text image pair in the AudioSet data set, and performing migration training on the cross-modal model by using the audio text image pair.

Step S3.4.1, the text encoder extracts text feature vectors of the text data in the audio text image pair.

The text feature vector output by the text encoder is as follows:

T＝[T ₁ ，T ₂ ，...T _i ，...T _M ]

wherein T is the text feature vector of the one-dimensional vector data format, M is the total number of text features in the text feature vector, and T _i For the ith textAnd (3) sign.

Step S3.4.2, the image encoder extracts image feature vectors of the image data in the audio text image pair.

The image feature vectors output by the image encoder are as follows:

I＝[I ₁ ，I ₂ ，...I _j ，...I _N ]

And S3.4.3, extracting audio feature vectors in three channel images corresponding to the audio data in the audio text image pair by the audio feature extraction sub-network.

The audio feature vectors output by the audio feature extraction sub-network are as follows:

F＝[F ₁ ，F ₂ ，...F _k ，...F _Q ]

wherein F is the audio feature vector of the one-dimensional vector data format, Q is the total number of audio features in the audio feature vector, F _k Is the kth audio feature.

Step S3.4.4, calculating cosine similarity of the text data and the image data, cosine similarity of the audio data and the text data, and cosine similarity of the image data and the audio data according to the normalized text feature vector, the normalized image feature vector and the normalized audio feature vector.

The normalization method takes the L2 normalization in step S1.3, and is not described in detail here, so as to obtain a normalized text feature vector T '= [ T ]' ₁ ,T′ ₂ ,…T′ _i ,…T′ _M ]Image feature vector I '= [ I ]' ₁ ,I′ ₂ ,…I′ _j ,…I′ _N ]And an audio feature vector F '= [ F ]' ₁ ,F′ ₂ ,…F′ _K ,…F′ _Q ]. And the cosine similarity of the three is calculated as follows:

in the formula, cosine similarities (T ', I') is the cosine similarity of the normalized text feature vector and the image feature vector, that is, the cosine similarity of the text data and the image data, cosine similarities (F ', T') is the cosine similarity of the normalized audio feature vector and the text feature vector, that is, the cosine similarity of the audio data and the text data, and cosine similarities (I ', F') is the cosine similarity of the normalized image feature vector and the audio feature vector, that is, the cosine similarity of the image data and the audio data.

In the same audio text image pair, the cosine similarity of the text data and the image data is the same as the cosine similarity of the image data and the text data, the cosine similarity of the audio data and the text data is the same as the cosine similarity of the text data and the image data, and the cosine similarity of the image data and the audio data is the same as the cosine similarity of the audio data and the image data.

Step S3.4.5, calculating a loss value based on the cosine similarity of the text data and the image data, the cosine similarity of the audio data and the text data, and the cosine similarity of the image data and the audio data, and updating the cross-modal model.

As shown in fig. 4, since the cross-modal model of the present embodiment performs depth semantic understanding on three-dimensional data of text, image and audio, the present embodiment combines one-dimensional feature vectors of the three data two by two to obtain three two-dimensional feature vectors, calculates a loss value of each two-dimensional feature vector, and then obtains a loss value of the whole cross-modal model to update the cross-modal model. The loss value calculation method of each two-dimensional feature vector in this embodiment is the same as that described in step S1.4.

(1) Calculating text image loss value loss based on cosine similarity of text data and image data _ti 。

(2) Calculating an audio text loss value loss based on cosine similarity of audio data and text data _ft 。

In the loss of _f For loss value of audio data, i.e. difference between audio real sample label and prediction probability, y _f For the audio data to be based on the true tags of the text-to-audio pairs,the predicted probability value for the audio data based on the text audio pair, i.e., the cosine similarity of the audio data and the text data calculated in step S3.4.4.

(3) Calculating an image audio loss value loss based on cosine similarity of image data and audio data _if 。

(4) The text image loss value, the audio text loss value, and the image audio loss value are summed and averaged as a final loss value to update the cross-modal model.

Where loss is the final loss value, and the cross-modal model is updated based on the loss value.

And S4, invoking the ESC-50 data set and the AudioSet data set to perform optimization training on the audio feature extraction sub-network in the cross-modal model after weight transfer learning, so as to obtain the cross-modal model after weight optimization.

Because the audio data used in the training step is less than the text image data, the full training of the audio feature extraction sub-network cannot be realized, and the accuracy of the model test set of the cross-mode deep learning algorithm is low. Therefore, the embodiment uses the ESC-50 data set to fine tune the audio feature extraction sub-network, so that the generalization performance of the model can be enhanced, and the precision of the model test set can be improved.

During training, firstly, an audio data spectrogram in the ESC-50 data set is mapped onto three input channels along a frequency axis to obtain three-channel images, then, audio text pairs in the ESC-50 data set are taken, weights of a text encoder and an image encoder in a cross-modal model are frozen, the cross-modal model is trained by the audio text pairs, and weights of an audio feature extraction sub-network are adjusted.

Then, obtaining an audio data spectrogram in the AudioSet data set through short-time Fourier transform, mapping the spectrogram onto three input channels along a frequency axis of the spectrogram to obtain a three-channel image, wherein the three-channel image is divided into three frequency bands: low (0.00-7.35 kHz), medium (7.35-14.70 kHz) and high (14.7-22.05 kHz); and inputting the three-channel frequency spectrum image after transformation mapping and the corresponding text into a cross-modal deep learning model of frozen image encoder parameters, and training the model for 300 rounds by using an Adam optimizer. The basic learning value is set to 0.00025, the weight decay factor is 0.0005 and the difference between the true and predicted values is calculated using the cosine similarity function and the cross entropy loss function, i.e. the audio text loss value is calculated for updating the text encoder and the audio feature extraction sub-network.

And S5, running the cross-modal model with optimized weight, outputting the associated information among text data, image data and audio data in the multi-source data set, and completing the deep semantic understanding of the multi-source data set.

Before model application, in order to ensure model accuracy, a trained cross-mode deep learning model is used, evaluation test is carried out on a multi-source data set containing three modes of pictures, audios and texts, if the test passes, construction of the cross-mode model is completed, if the test does not pass, training is needed again, and step S3 can be carried out for training when the training is carried out again.

The depth semantic understanding method based on the cross-modal model uses a text image pair in a WIT data set, and a text encoder and an image encoder of a deep learning model are pre-trained in two modes to obtain a bimodal deep learning model after weight initialization; pre-training an audio feature extraction sub-network through an ImageNet dataset; performing migration training on an audio feature extraction sub-network, a text encoder and an image encoder in a cross-modal deep learning model by using an AudioSet data set; performing fine tuning on the audio feature extraction sub-network of the cross-modal deep learning model subjected to migration training by using an ESC-50 data set and using two modal data (audio and text); and finally, evaluating the pre-trained and fine-tuned cross-mode deep learning model on a multi-source data set containing three modes of pictures, audios and texts to finish the deep semantic understanding based on the cross-mode model.

In another embodiment, the application also provides a cross-modal model-based depth semantic understanding device, which comprises a processor and a memory storing a plurality of computer instructions, wherein the computer instructions realize the steps of the cross-modal model-based depth semantic understanding method when being executed by the processor.

For a specific definition of a cross-modal model based depth semantic understanding apparatus, reference may be made to the definition of a cross-modal model based depth semantic understanding method hereinabove, and the description thereof will not be repeated here.

The memory and the processor are electrically connected directly or indirectly to each other for data transmission or interaction. For example, the components may be electrically connected to each other by one or more communication buses or signal lines. The memory stores a computer program executable on a processor that implements the method of the embodiments of the present application by running the computer program stored in the memory.

The Memory may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc. The memory is used for storing a program, and the processor executes the program after receiving an execution instruction.

The processor may be an integrated circuit chip having data processing capabilities. The processor may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), and the like. The methods, steps and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A cross-modal model based depth semantic understanding method, wherein the cross-modal model comprises a text encoder, an image encoder and an audio feature extraction sub-network, the cross-modal model based depth semantic understanding method comprising:

invoking the WIT dataset to pretrain the text encoder and the image encoder;

2. The cross-modality model based depth semantic understanding method of claim 1, wherein the invoking the WIT dataset pre-trains a text encoder and an image encoder, comprising:

3. The cross-modal model based depth semantic understanding method of claim 2, wherein the calculating the loss value of the text data and the loss value of the image data based on the cosine similarity and updating the text encoder and the image encoder comprises:

the loss value of the text data is calculated as follows:

the loss value of the image data is calculated as follows:

the loss value for the text image pair is calculated as follows:

in the loss of _ti A loss value for the text image pair;

4. The cross-modal model based deep semantic understanding method of claim 1, wherein invoking the ImageNet dataset to pretrain the audio feature extraction sub-network comprises:

5. The cross-modal model based depth semantic understanding method of claim 1, wherein invoking the AudioSet dataset to migrate train the pre-trained text encoder, image encoder and audio feature extraction sub-network comprises:

6. The cross-modal model based depth semantic understanding method as claimed in claim 5, wherein the taking the audio text image pair in the AudioSet dataset, performing migration training on the cross-modal model with the audio text image pair, comprises:

7. The cross-modal model based depth semantic understanding method as claimed in claim 6, wherein the calculating a loss value and updating the cross-modal model based on cosine similarity of text data and image data, cosine similarity of audio data and text data, and cosine similarity of image data and audio data comprises:

8. The cross-modal model based deep semantic understanding method of claim 1, wherein invoking the ESC-50 dataset and AudioSet dataset to optimally train the audio feature extraction sub-network in the cross-modal model after weight transfer learning comprises:

9. A cross-modal model based depth semantic understanding device comprising a processor and a memory storing a number of computer instructions, wherein the computer instructions when executed by the processor implement the steps of the cross-modal model based depth semantic understanding method of any one of claims 1 to 8.