CN114332575A

CN114332575A - Multi-modal feature fusion method and device, electronic equipment and readable storage medium

Info

Publication number: CN114332575A
Application number: CN202111626977.0A
Authority: CN
Inventors: 覃祥坤
Original assignee: Zhongdian Jinxin Software Co Ltd
Current assignee: Zhongdian Jinxin Software Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-04-12

Abstract

The application provides a multi-modal feature fusion method, a multi-modal feature fusion device, electronic equipment and a readable storage medium, wherein the method comprises the steps of obtaining a plurality of heterogeneous data of a target object; aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data; determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of corresponding single-mode weight matrices for each single-mode feature matrix; determining a fusion weight matrix between the fusion feature matrix and each fusion feature matrix aiming at each fusion feature matrix; normalizing each fusion weight matrix to obtain a normalized fusion weight matrix; and determining a multi-modal feature matrix describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix. Therefore, different modal characteristics carried by heterogeneous data can be fused, and the characteristics of the target object can be more finely expressed.

Description

Multi-modal feature fusion method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer processing technologies, and in particular, to a multimodal feature fusion method and apparatus, an electronic device, and a readable storage medium.

Background

Modality means a manner in which a target object occurs or exists, single modality means that a target object exists in only one manner, and multi-modality means that the same target object can occur or exist in a form of a combination between two or more modalities or modalities at the same time. Data or information from the same data source may be referred to as a Modality feature (Modality), and common single Modality features include video, text, pictures, and audio, among others.

The reason why the single-modal features of data from different data sources are fused into multi-modal features is that different modal features represent different target objects, in other words, different modal features look at different angles of the same target object. Complementary information exists between different modal characteristics, and if the different modal characteristics can be fused together, the characteristics of the object itself can be more finely characterized, so how to fuse the different modal characteristics becomes a problem to be solved urgently.

Disclosure of Invention

In view of the above, an object of the present application is to provide a multi-modal feature fusion method, apparatus, electronic device and readable storage medium, which can fuse different modal features carried by heterogeneous data to more finely express features of a target object.

The embodiment of the application provides a multi-modal feature fusion method, which comprises the following steps:

acquiring a plurality of heterogeneous data of a target object;

aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data;

respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix;

determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix;

aiming at each fusion characteristic matrix, respectively determining a fusion weight matrix between the fusion characteristic matrix and each fusion characteristic matrix;

for each fusion weight matrix corresponding to the fusion feature matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix;

and determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix.

Further, when the heterogeneous data includes audio data, the extracting a monomodal feature matrix of the heterogeneous data includes:

converting the audio data into single sound channel audio data, and performing resampling processing on the single sound channel audio data to obtain resampled audio data;

moving a Hanning time window with a preset window length on the resampled audio data, and performing Fourier transform on the resampled audio data to obtain an audio frequency spectrum of the audio data;

mapping the audio spectrum to an initial mel-frequency cepstrum using a filter bank;

carrying out logarithmic calculation on the initial Mel cepstrum to obtain a stable Mel cepstrum;

and recombining the stable Mel cepstrum according to a preset time length to obtain a single-mode characteristic matrix of the audio data.

Further, when the heterogeneous data includes text data, the extracting a monomodal feature matrix of the heterogeneous data includes:

performing natural language preprocessing on the text data, and extracting a plurality of character feature groups and a picture feature group corresponding to each character feature group from the text data; the character feature group comprises at least two of Chinese character features, phrase features and sentence features; the expression form of the character feature group is a one-hot vector form;

determining a character feature matrix of each character feature group and a picture feature matrix of a picture feature group corresponding to each character feature group;

for each character feature group, fusing a character feature matrix of the character feature group and a picture feature matrix of a picture feature group corresponding to the character feature group to obtain a primary fusion matrix of the character feature group;

for each character feature group, determining a multidimensional correlation matrix between each sub-character feature in the character feature group and the sub-picture feature corresponding to the sub-character feature based on the preliminary fusion matrix of the character feature group and the picture feature matrix of the picture feature group corresponding to the character feature group;

determining a multidimensional attention weight of each sub-character feature based on a multidimensional correlation matrix of each sub-character feature in the character feature group;

and determining a monomodal feature matrix of the text data based on the multidimensional attention weight of each sub-character feature in each character feature group and the picture feature matrix of the sub-picture feature corresponding to each sub-character feature.

Further, the determining the character feature matrix of each character feature group and the picture feature matrix of the picture feature group corresponding to each character feature group includes:

for each character feature group, determining a character feature matrix capable of representing each sub-character feature in the character feature group by using a pre-trained feature extraction model corresponding to the character feature group;

and determining the picture characteristic matrix of the picture characteristic group corresponding to the character characteristic group by using a pre-trained convolutional neural network.

Further, the fusing the text feature matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group to obtain a preliminary fusion matrix of the text feature group includes:

for each sub-text characteristic in the text characteristic group, fusing a sub-text matrix of the sub-text characteristic and a sub-picture matrix of the sub-picture characteristic corresponding to the sub-text characteristic to obtain a sub-fusion matrix of each sub-text characteristic;

and determining a preliminary fusion matrix of the character feature group based on the sub-fusion matrix of each sub-character feature.

Further, the determining a multidimensional correlation matrix between each sub-text feature in the text feature group and the sub-picture feature corresponding to the sub-text feature based on the preliminary fusion matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group includes:

and aiming at each sub-document feature in the text feature group, determining a multi-dimensional correlation matrix of the sub-document feature based on the sub-fusion matrix of the sub-document feature and the sub-picture matrix of the sub-document feature corresponding to the sub-picture feature.

Further, a feature extraction model corresponding to the character feature group is obtained through the following steps:

acquiring a pre-trained language pre-training model; the language pre-training model is used for extracting a character feature matrix representing each sub-character feature of the character feature group from one-hot vectors of the character feature group;

and carrying out model distillation treatment on the language pre-training model, and compressing parameters in the language pre-training model to obtain the feature extraction model.

Further, when the heterogeneous data includes video data, the multimodal feature fusion method further includes:

segmenting the video data into multi-frame picture data;

when the heterogeneous data includes picture data, the extracting a monomodal feature matrix of the heterogeneous data includes:

and extracting a monomodal feature matrix of the picture data by using a deep neural network with a residual error jumping mechanism.

The embodiment of the present application further provides a multi-modal feature fusion apparatus, where the multi-modal feature fusion apparatus includes:

the data acquisition module is used for acquiring a plurality of heterogeneous data of the target object;

the matrix extraction module is used for extracting a monomodal feature matrix of the heterogeneous data aiming at each heterogeneous data;

the single-mode weight determining module is used for respectively determining a single-mode weight matrix between each single-mode feature matrix and the single-mode feature matrix aiming at each single-mode feature matrix;

a fusion matrix determining module, configured to determine a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix;

the fusion weight determining module is used for respectively determining fusion weight matrixes between the fusion feature matrix and each fusion feature matrix aiming at each fusion feature matrix;

the matrix normalization module is used for normalizing the fusion weight matrix aiming at each fusion weight matrix corresponding to the fusion characteristic matrix to obtain a normalized fusion weight matrix;

and the multi-modal characteristic determining module is used for determining a multi-modal characteristic matrix used for describing the target object based on each fusion characteristic matrix and a plurality of normalized fusion weight matrixes corresponding to each fusion characteristic matrix.

Further, when the heterogeneous data includes audio data, the matrix extraction module is configured to, when the matrix extraction module is configured to extract a monomodal feature matrix of the heterogeneous data, the matrix extraction module is configured to:

Further, when the heterogeneous data includes text data, the matrix extraction module, when being configured to extract a monomodal feature matrix of the heterogeneous data, is configured to:

Further, when the matrix extraction module is configured to determine the text feature matrix of each text feature group and the picture feature matrix of the picture feature group corresponding to each text feature group, the matrix extraction module is configured to:

Further, when the matrix extraction module is configured to fuse the text feature matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group to obtain a preliminary fusion matrix of the text feature group, the matrix extraction module is configured to:

Further, when the matrix extraction module is configured to determine a multidimensional correlation matrix between each sub-text feature in the text feature group and the sub-picture feature corresponding to the sub-text feature based on the preliminary fusion matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group, the matrix extraction module is configured to:

Further, the matrix extraction module is configured to obtain a feature extraction model corresponding to the text feature group through the following steps:

Further, when the heterogeneous data includes video data, the multimodal feature fusion apparatus further includes a video segmentation module, and the video segmentation module is configured to:

segmenting the video data into multi-frame picture data;

when the heterogeneous data includes picture data, and the matrix extraction module is configured to, when the matrix extraction module is configured to extract a monomodal feature matrix of the heterogeneous data, the matrix extraction module is configured to:

An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the multimodal feature fusion method as described above.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the multimodal feature fusion method as described above.

The multi-modal feature fusion method, the multi-modal feature fusion device, the electronic equipment and the readable storage medium, provided by the embodiment of the application, are used for acquiring a plurality of heterogeneous data of a target object; aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data; respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix; determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix; aiming at each fusion characteristic matrix, respectively determining a fusion weight matrix between the fusion characteristic matrix and each fusion characteristic matrix; for each fusion weight matrix corresponding to the fusion feature matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix; and determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix. Therefore, different modal characteristics carried by heterogeneous data can be fused, and the characteristics of the target object can be more finely expressed.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a flow chart of a multi-modal feature fusion method provided in an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a single-mode feature matrix extraction process of audio data according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a multi-modal feature fusion apparatus according to an embodiment of the present disclosure;

fig. 4 is a second schematic structural diagram of a multi-modal feature fusion apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

Research shows that target objects represented by different modal features are different, in other words, angles at which different modal features view the same target object are different, and in essence, some cross information exists between different modal features (so there is also a case of information redundancy), but in addition to the cross information, more importantly, complementary information also exists between different modal features, and if different modal features can be fused together, the features of the object itself can be more finely characterized.

Currently, there are three main common ways of multi-modal feature fusion: first, front-end fusion (early-fusion), namely data-level fusion (data-level fusion); second, post-fusion (late-fusion), i.e. decision-level fusion; and thirdly, intermediate fusion (intermediate-fusion).

Front-end fusion fuses multiple independent datasets into a single feature vector, which is then input into a machine learning classifier. Because front-end fusion of multi-modal data often cannot fully utilize complementarity among multi-modal data, the raw data of front-end fusion usually contains a large amount of redundant information.

The back-end fusion is to fuse the classifier outputs trained respectively by different modal data into scores (decisions). The advantage of doing so is that the errors of the fusion model come from different classifiers, while the errors from different classifiers are often not correlated with each other, do not affect each other, and do not cause further accumulation of errors. Common backend fusion methods include maximum value fusion (max-fusion), average-value fusion (averaged-fusion), Bayes 'rule based fusion (Bayes' rule based), ensemble learning (ensemble learning), and the like. Among them, ensemble learning is a typical representative of a backend fusion mode, and is widely applied to research fields such as communication, computer recognition, voice recognition, and the like.

The intermediate fusion means that different modal data are firstly converted into high-dimensional characteristic expression and then fused in the intermediate layer of the model. Taking a neural network as an example, the intermediate fusion firstly converts the original data into a high-dimensional feature expression by using the neural network, and then obtains the commonality of different modal data on a high-dimensional space.

Based on this, the embodiment of the application provides a multi-modal feature fusion method, which can fuse the single-modal features of heterogeneous data from different data sources, and can express the features of a target object in more detail.

Referring to fig. 1, fig. 1 is a flowchart of a multi-modal feature fusion method according to an embodiment of the present application. As shown in fig. 1, an embodiment of the present application provides a multi-modal feature fusion method, including:

s101, acquiring a plurality of heterogeneous data of the target object.

And S102, aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data.

S103, respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix.

Taking the example of extracting N single-mode feature matrices from a plurality of heterogeneous data of a target object, each single-mode feature matrix corresponds to N single-mode weight matrices; here, a single-modality weight matrix between each single-modality feature matrix and itself is included.

And S104, determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix.

Here, each single-mode feature matrix can be calculated to obtain a corresponding fusion feature matrix.

And S105, aiming at each fusion feature matrix, respectively determining a fusion weight matrix between the fusion feature matrix and each fusion feature matrix.

Here, each monomodal feature matrix corresponds to one fusion feature matrix, and each fusion feature matrix can be calculated to obtain N corresponding fusion weight matrices; here, a fusion weight matrix between each fused feature matrix and itself is included.

And S106, aiming at each fusion weight matrix corresponding to the fusion characteristic matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix.

Here, the step S103 to S106 are executed for the single-mode feature matrix of each heterogeneous data to realize modal feature fusion between each heterogeneous data and other heterogeneous data except for the single-mode feature matrix of the heterogeneous data among the heterogeneous data, so as to obtain a fusion feature matrix corresponding to each heterogeneous data and a plurality of normalized fusion weight matrices corresponding to the fusion feature matrix.

S107, determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of fusion weight matrixes corresponding to each fusion feature matrix.

Here, the heterogeneous data refers to data with different data structures from different data sources, and for the same target object, multiple heterogeneous data related to the target object can be generally acquired; for example, the target object a participates in a conference and speaks in the conference, for the target object a, video data, picture data, audio data, and text data about the target object a can be acquired at the conference, and for the video data and the audio data, the acquisition devices of the two are different, so that the two come from different data sources, and therefore, the two are heterogeneous data; likewise, the video data, the picture data, the audio data and the text data are heterogeneous data, that is, the heterogeneous data of the target object includes one or more of the video data, the picture data, the audio data and the text data.

In step S102, for each heterogeneous data, if a subsequent feature fusion operation is to be performed on the heterogeneous data, first, a monomodal feature matrix of the heterogeneous data needs to be extracted from the heterogeneous data; for heterogeneous data acquired by different data sources, the extraction modes of the monomodal feature matrix for extracting the heterogeneous data are naturally different.

Further, please refer to fig. 2, fig. 2 is a schematic diagram illustrating a process of extracting a single-mode feature matrix of audio data according to an embodiment of the present disclosure. As shown in fig. 2, step S102 includes:

step S201, converting the audio data into monaural audio data, and performing resampling processing on the monaural audio data to obtain resampled audio data.

In the step, the acquired audio data is converted into mono audio data, the converted mono audio data is resampled, and the mono audio data is resampled into audio data with a preset frequency, so that the resampled audio data is obtained.

Here, the preset frequency may be 16 kHz.

Step S202, moving a Hanning time window with a preset window length on the resampled audio data, and performing Fourier transform on the resampled audio data to obtain an audio frequency spectrum of the audio data.

In the step, a Hanning time window (Hann time window) for carrying out Fourier transform on audio data is obtained, the Hanning time window with the length of a preset window is moved on the audio data after resampling according to the preset window shift, short-time Fourier transform is carried out on the audio data after resampling in the Hanning time window every time until all the audio data after resampling are subjected to short-time Fourier transform, and an audio frequency spectrum of the audio data is obtained.

Here, the preset window length may be 25ms, and the preset window shift may be 10 ms.

Step S203, mapping the audio frequency spectrum to an initial mel frequency cepstrum by using a filter bank.

In this step, the obtained audio frequency spectrum is mapped into a 64-order mel filter bank, and an initial mel-frequency cepstrum of the audio data is calculated through the filter.

And S204, carrying out logarithmic calculation on the initial Mel cepstrum to obtain a stable Mel cepstrum.

In the step, in order to obtain a stable mel cepstrum, logarithm calculation needs to be performed on the initial mel cepstrum obtained through calculation, in order to avoid performing logarithm calculation on the audio data with the frequency spectrum value of 0, the situation that the frequency spectrum value of 0 does not exist in the audio data is ensured through a bias constant in the calculation process, and further, logarithm calculation on 0 can be avoided; specifically, the stable mel-frequency cepstrum is calculated by the following formula:

mels＝log(mel-spectrum+0.01)；

wherein mel is_sTo stabilize the mel-frequency cepstrum, log is logarithmic calculation, mel-spectrum is the initial mel-frequency cepstrum, a is a bias constant, and the value of a is usually 0.01.

And S205, recombining the stable Mel cepstrum according to a preset time length to obtain a single-mode feature matrix of the audio data.

In the step, the stable mel cepstrum obtained by calculation is framed again in 0.96s, characteristic data capable of representing audio data are extracted, overlapped frames do not exist in the reconstructed stable mel cepstrum, each frame of audio data (characteristic data) contains 64 mel frequency bands, and the time length is 10ms (namely 96 frames of characteristic data are obtained after reconstruction); and constructing a monomodal feature matrix of the audio data based on each frame of audio data (features).

It should be noted that the output data format of the monomodal feature matrix corresponding to the audio data in the present application is [ nums _ frames, 128]That is, the output data is 128-dimensional with semantics and meaningA high-level feature vector; wherein, nums _ frames is the frame length of each frame of feature data,

here, 0.96 indicates that, when the monomodal feature matrix of the audio data is extracted, it is reframed with a duration of 0.96 s.

In another embodiment, when the heterogeneous data includes text data, step S102 includes:

step 1: performing natural language preprocessing on the text data, and extracting a plurality of character feature groups and a picture feature group corresponding to each character feature group from the text data; the character feature group comprises at least two of Chinese character features, phrase features and sentence features; the expression form of the character feature group is a one-hot vector form.

In the step, natural language preprocessing is performed on the acquired text data, and a plurality of character feature groups are respectively extracted from the text data, wherein the extraction can be performed according to the form of features, such as Chinese characters, phrases, sentences and the like; thus, the extracted character feature group comprises at least two of Chinese character features, phrase features and sentence features.

Here, the text data may be divided into a plurality of sub-character features one by one, for example, "bite the blue mountain without relaxing," bite "," fix "," cyan "," mountain "," not "," put "and" loose ", so that the" bite "may be used as the sub-character feature in one chinese character feature group, and so on, and the sub-character features in the chinese character feature group include: "bite", "green", "mountain", "don", "free", and "loose"; respectively converting the split Chinese characters into corresponding one-hot vector forms; for example, "bite" [0, 1, 0, 0 ]; "decide": [0,1,1,0].

The text data can be further split into a plurality of sub-character features one by one, for example, "bite into the green mountain without relaxation", and can be split into "bite into the green mountain", "no", and "relaxation", so that "bite" can be used as a sub-character feature in a phrase feature group, and so on, and the sub-character features in the phrase feature group include: "bite", "qingshan", "not" and "relax"; respectively converting the split phrases into corresponding one-hot vector forms; for example, "bite" [1, 1, 0, 0 ]; "Qingshan": [0,1,1,1].

The text data can be further split into a plurality of sub-word features one by one, for example, the 'biting into the green mountain without relaxing' can be taken as one sub-word feature in the sentence feature group as a whole; and converting the split sentences into corresponding one-hot vector forms.

The picture characteristic group is a traditional Chinese character and/or pictographic Chinese character picture corresponding to each sub-character characteristic in the character characteristic group corresponding to the picture characteristic group; corresponding to the embodiment, the picture characteristic group corresponding to the Chinese character characteristic group comprises traditional Chinese characters and/or pictographic Chinese character pictures corresponding to 'biting', traditional Chinese characters and/or pictographic Chinese character pictures corresponding to 'definite', and the like; the picture characteristic group corresponding to the phrase characteristic group comprises a traditional Chinese character group and/or a pictographic Chinese character group picture corresponding to 'biting'; the picture characteristic group corresponding to the sentence characteristic group comprises traditional Chinese characters and/or pictographic Chinese character pictures of the whole sentence corresponding to 'the fixed Qingshan is not relaxed'.

When extracting the character feature group, determining a picture feature group corresponding to each character feature group, wherein the character feature group comprises a plurality of sub-character features; correspondingly, the picture feature group includes sub-picture features corresponding to each sub-word feature.

Specifically, when the character feature group is a Chinese character feature group, the picture feature group corresponding to the Chinese character feature group comprises traditional Chinese characters and/or pictographic Chinese character pictures corresponding to each subfile character feature (each Chinese character) in the Chinese character feature group;

when the character feature group is a phrase feature group, the picture feature group corresponding to the phrase feature group comprises a traditional Chinese character group and/or a pictographic Chinese character group picture corresponding to each subfile character feature (each phrase) in the phrase feature group;

when the character feature group is a sentence feature group, the picture feature group corresponding to the sentence feature group comprises sentences composed of traditional Chinese characters and/or sentences composed of pictographic Chinese characters corresponding to each subfile character feature (each sentence) in the sentence feature group.

Step 2: and determining a character feature matrix of each character feature group and a picture feature matrix of the picture feature group corresponding to each character feature group.

In this step, in order to perform feature fusion in the subsequent process, the expression form of each character feature group needs to be unified and converted into data recognizable by a computer, and the character feature matrix of each character feature group is determined respectively, that is, each sub-character feature in each character feature group is presented in the form of a matrix.

Specifically, when the character feature group is a Chinese character feature group, the sub-character feature is a Chinese character, and the Chinese character is converted into a matrix form, namely the Chinese character is represented in the matrix form; when the character characteristic group is a phrase characteristic group, the sub-character characteristic is a phrase, and the phrase is converted into a matrix form, namely the phrase is represented in the matrix form; when the text feature group is a sentence feature group, the sub-text feature is a sentence, and the sentence is converted into a matrix form, i.e., the sentence is represented in the matrix form.

In one embodiment, step 2 comprises: for each character feature group, determining a character feature matrix capable of representing each sub-character feature in the character features by using a pre-trained feature extraction model corresponding to the character feature group; and determining the picture characteristic matrix of the picture characteristic group corresponding to the character characteristic group by using a pre-trained convolutional neural network.

In the step, the mode of determining the character characteristic matrix of the character characteristic group is different from the mode of determining the picture characteristic matrix of the picture characteristic group; for the character feature group, a character feature matrix capable of representing each sub-character feature (Chinese character, phrase and sentence) in the character feature group is determined by utilizing a pre-trained feature extraction model corresponding to the character feature group.

For the picture feature group, a picture feature matrix of the picture feature group corresponding to the character feature group is determined by using a pre-trained convolutional neural network.

In one embodiment, the feature extraction model corresponding to the text feature group is obtained by:

firstly, acquiring a pre-trained language pre-training model; the language pre-training model is used for extracting a character feature matrix representing each sub-character feature of the character feature group from the one-hot vector of the character feature group.

Here, the language pre-training model is trained by the training data, but because the language pre-training model (e.g., the Bert model) has many parameters and a large scale, the model cannot be used in a computer with general performance, that is, the language pre-training model has poor practicability, and in this case, the scale of the language pre-training model needs to be compressed on the basis of the scale of the compressed model and on the basis of keeping the accuracy of the original language pre-training model.

And then, carrying out model distillation processing on the language pre-training model, and compressing parameters in the language pre-training model to obtain the feature extraction model.

Here, the feature extraction model is obtained by compressing the language pre-training model, and the feature extraction model can overcome the limitation of the language pre-training model on the length of the sub-text feature while maintaining the better generalization performance of the language pre-training model (e.g., Bert model).

And step 3: and for each character feature group, fusing a character feature matrix of the character feature group and a picture feature matrix of a picture feature group corresponding to the character feature group to obtain a primary fusion matrix of the character feature group.

In one embodiment, step 3 comprises: for each sub-text characteristic in the text characteristic group, fusing a sub-text matrix of the sub-text characteristic and a sub-picture matrix of the sub-picture characteristic corresponding to the sub-text characteristic to obtain a sub-fusion matrix of each sub-text characteristic; and determining a preliminary fusion matrix of the character feature group based on the sub-fusion matrix of each sub-character feature.

Determining a sub-text matrix of each sub-text feature in a text feature group, and determining a sub-picture matrix of the sub-picture feature corresponding to each sub-text feature from the picture feature matrix of the picture feature group corresponding to the text feature group;

aiming at each sub-character characteristic, fusing a sub-character matrix of the sub-character characteristic and a sub-picture matrix of a corresponding sub-picture to obtain a sub-fusion matrix of the sub-character characteristic;

and aiming at each character feature group, determining a preliminary fusion matrix of the character feature group based on the sub-fusion matrix of each sub-character feature in the character feature group.

Specifically, a sub-fusion matrix of the sub-character features is determined by the following formula:

wherein u is_i ^wIs a sub-fusion matrix of the ith sub-character feature in the w-th character feature group, h_i ^wA sub-character matrix of the ith sub-character characteristic in the w-th character characteristic group, v_i ^wAnd the sub-picture matrix is the sub-picture characteristic corresponding to the ith sub-character characteristic in the w-th character characteristic group.

And 4, step 4: and for each character feature group, determining a multidimensional correlation matrix between each sub-character feature in the character feature group and the sub-picture feature corresponding to the sub-character feature based on the fusion feature matrix of the character feature group and the picture feature matrix of the picture feature group corresponding to the character feature group.

In one embodiment, step 4 comprises: and aiming at each sub-document feature in the text feature group, determining a multi-dimensional correlation matrix of the sub-document feature based on the sub-fusion matrix of the sub-document feature and the sub-picture matrix of the sub-document feature corresponding to the sub-picture feature.

In this step, for each sub-text feature in each text feature group, a multidimensional correlation matrix between the sub-text feature and the corresponding sub-picture feature is calculated based on the sub-fusion matrix of the sub-text feature and the picture feature matrix of the sub-picture feature corresponding to the sub-text feature; specifically, the multidimensional correlation matrix is calculated by the following formula:

wherein e is_i ^wA multi-dimensional correlation matrix u of ith sub-character feature in the w-th character feature group_i ^wIs a sub-fusion matrix of the ith sub-character feature in the w-th character feature group, v_i ^wAnd the picture characteristic matrix is the picture characteristic matrix of the sub-picture characteristic corresponding to the ith sub-character characteristic in the w-th character characteristic group.

And 5: and determining the multidimensional attention weight of the sub-character features based on the multidimensional correlation matrix of each sub-character feature in the character feature group.

In this step, the multidimensional attention weight of each sub-character feature is calculated by the following formula:

wherein, a_i ^wIs the multidimensional attention weight of the ith sub-character feature in the w-th character feature group, e_i ^wThe matrix is a multidimensional correlation matrix of ith sub-character features in the w-th character feature group, and z is the number of the sub-character features in the w-th character feature group.

Step 6: and determining a monomodal feature matrix of the text data based on the multidimensional attention weight of each sub-character feature in each character feature group and the picture feature matrix of the sub-picture feature corresponding to each sub-character feature.

In this step, a monomodal feature matrix of the text data is determined by the following formula:

wherein S is a monomodal feature matrix of the text data, a_i ^wIs the multidimensional attention weight, v, of the ith sub-character feature in the w-type character feature group_i ^wThe image feature matrix is the image feature matrix of the sub-image features corresponding to the ith sub-character feature in the w-th character feature group, z is the number of the sub-character features in the w-th character feature group, and l is the number of the character feature groups.

In another embodiment, when the heterogeneous data includes video data, the multimodal feature fusion further includes: and segmenting the video data into multi-frame picture data.

In this step, if the acquired heterogeneous data is video data, the video data is divided into multi-frame picture data according to the form of processable picture data, so as to extract the subsequent single-mode feature matrix.

In another embodiment, when the received video data is sliced into multi-frame picture data or the received heterogeneous data is picture data, step S102 includes: and extracting a monomodal feature matrix of the picture data by using a deep neural network with a residual error jumping mechanism.

In the step, a deep neural network with a residual error jump mechanism is obtained through training of training set data in advance, and a monomodal feature matrix capable of representing the picture data is extracted from each picture data by using the deep neural network with the residual error jump mechanism.

Here, for video data and picture data, the present application introduces a deep neural network ResNet with a residual jump structure; by the aid of the residual error network structure of the network, the problems of gradient disappearance, explosion and network degradation caused by the increase of the number of layers of the convolutional network can be solved, and accuracy of single-mode feature extraction results is improved.

And for each extracted single-mode feature matrix, performing feature fusion on every two single-mode feature matrices respectively to change the single-mode feature matrices into multi-mode feature matrices, thereby representing the target object more finely.

In step S103, a single-mode weight matrix between every two single-mode feature matrices is calculated by the following formula:

wherein Q is^xyIs a monomodal weight matrix between the monomodal feature matrix of the xth heterogeneous data and the monomodal feature matrix of the yth heterogeneous data, S^x _mnIs a monomodal feature matrix of the xth heterogeneous data,

for the transpose of the monomodal feature matrix of the y-th heterogeneous data, m and n are rows and columns of the matrix, respectively.

Wherein, N single-mode feature matrixes S are extracted from a plurality of heterogeneous data of a target object^x _mnFor example, each monomodal feature matrix S^x _mnCorresponding to N single-mode weight matrixes Q^xy(ii) a Here, a single-modality weight matrix between each single-modality feature matrix and itself is included.

In step S104, a fusion feature matrix of each monomodal feature matrix is calculated by the following formula:

wherein,

for the fusion feature matrix of the xth heterogeneous data, S^x _mnIs a monomodal feature matrix of the xth heterogeneous data,

the method is characterized in that a single-mode weight matrix between a single-mode feature matrix of the x-th heterogeneous data and a single-mode feature matrix of the y-th heterogeneous data is transposed, and m and n are rows and columns of the matrix respectively.

Here, the monomodal feature matrix S for each heterogeneous data^x _mnAll can obtain a corresponding fusion feature matrix by calculation through the formula

In step S105, a fusion weight matrix between each fusion feature matrix and each fusion feature matrix is calculated by the following formula:

wherein, P^xyIs a fusion weight matrix between the monomodal feature matrix of the xth heterogeneous data and the monomodal feature matrix of the yth heterogeneous data,

is a fusion characteristic matrix of the xth heterogeneous data,

for the transpose of the fusion feature matrix of the y-th heterogeneous data, m and n are rows and columns of the matrix respectively.

Here, each monomodal feature matrix S^x _mnCorresponding to a fusion feature matrix

And each fused feature matrix

All can obtain corresponding N fusion weight matrixes P by calculation through the formula^xy(ii) a Here, a fusion weight matrix between each fused feature matrix and itself is included.

In step S106, a normalized fusion weight matrix is calculated by the following formula:

P^xy'＝tanh(P^xy)；

wherein, P^xy'Is a fusion weight matrix, P, between the normalized monomodal feature matrix of the xth heterogenous and heterogeneous data and the monomodal feature matrix of the yth heterogenous and heterogeneous data^xyThe weight matrix is a fusion weight matrix between the monomodal feature matrix of the xth heterogeneous data and the monomodal feature matrix of the yth heterogeneous data.

Here, each fusion weight matrix P^xyA corresponding normalized fusion weight matrix P, which can be calculated by the above formula^xy'。

Here, for the single-mode feature matrix of each heterogeneous data, steps S103 to S106 are performed to implement modal feature fusion between each heterogeneous data and other heterogeneous data except for the single-mode feature matrix of the heterogeneous data, so as to obtain a fusion feature matrix corresponding to each heterogeneous data and a plurality of normalized fusion weight matrices corresponding to the fusion feature matrix.

In step S107, a multi-modal feature matrix for describing a target object is determined by fusing a plurality of single-modal features of the target object by the following formula:

wherein G is_mnMultimodal feature matrix, P, for a target object^xy'The normalized single-mode feature matrix of the xth heterogeneous data and the normalized single-mode feature matrix of the yth heterogeneous data are fused with each other to form a weight matrix,

is a fusion characteristic matrix of the xth heterogeneous data,m, n are the rows and columns of the matrix, respectively.

According to the multi-modal feature fusion method, a plurality of heterogeneous data of a target object are obtained; aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data; respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix; determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix; aiming at each fusion characteristic matrix, respectively determining a fusion weight matrix between the fusion characteristic matrix and each fusion characteristic matrix; for each fusion weight matrix corresponding to the fusion feature matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix; and determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix. Therefore, different modal characteristics carried by heterogeneous data can be fused, and the characteristics of the target object can be more finely expressed.

Referring to fig. 3 and 4, fig. 3 is a schematic structural diagram of a multi-modal feature fusion apparatus according to an embodiment of the present application, and fig. 4 is a second schematic structural diagram of a multi-modal feature fusion apparatus according to an embodiment of the present application. As shown in fig. 3, the multi-modal feature fusion apparatus 300 includes:

a data obtaining module 310, configured to obtain multiple heterogeneous data of a target object;

a matrix extraction module 320, configured to extract, for each heterogeneous data, a monomodal feature matrix of the heterogeneous data;

a single-modal weight determining module 330, configured to determine, for each single-modal feature matrix, a single-modal weight matrix between the single-modal feature matrix and each single-modal feature matrix;

a fusion matrix determining module 340, configured to determine a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix;

a fusion weight determining module 350, configured to determine, for each fusion feature matrix, a fusion weight matrix between the fusion feature matrix and each fusion feature matrix;

a matrix normalization module 360, configured to perform normalization processing on the fusion weight matrix for each fusion weight matrix corresponding to the fusion feature matrix to obtain a normalized fusion weight matrix;

a multi-modal feature determination module 370, configured to determine a multi-modal feature matrix for describing the target object based on each fused feature matrix and a plurality of normalized fusion weight matrices corresponding to each fused feature matrix.

Further, as shown in fig. 4, when the heterogeneous data includes video data, the multi-modal feature fusion apparatus 300 further includes a video slicing module 380, and the video slicing module 380 is configured to:

segmenting the video data into multi-frame picture data;

when the heterogeneous data includes picture data, when the matrix extraction module 320 is configured to extract a single-mode feature matrix of the heterogeneous data, the matrix extraction module 320 is configured to:

Further, when the heterogeneous data includes audio data, the matrix extraction module 320, when being configured to extract the monomodal feature matrix of the heterogeneous data, is configured to:

Further, when the heterogeneous data includes text data, the matrix extraction module 320, when being configured to extract a monomodal feature matrix of the heterogeneous data, is configured to:

Further, when the matrix extraction module 320 is configured to determine the text feature matrix of each text feature group and the picture feature matrix of the picture feature group corresponding to each text feature group, the matrix extraction module 320 is configured to:

Further, when the matrix extraction module 320 is configured to fuse the text feature matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group to obtain a preliminary fusion matrix of the text feature group, the matrix extraction module 320 is configured to:

Further, when the matrix extraction module 320 is configured to determine a multidimensional correlation matrix between each sub-text feature in the text feature group and the sub-picture feature corresponding to the sub-text feature based on the preliminary fusion matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group, the matrix extraction module 320 is configured to:

Further, the matrix extraction module 320 is configured to obtain a feature extraction model corresponding to the text feature group through the following steps:

The multi-modal feature fusion device provided by the embodiment of the application acquires a plurality of heterogeneous data of a target object; aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data; respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix; determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix; aiming at each fusion characteristic matrix, respectively determining a fusion weight matrix between the fusion characteristic matrix and each fusion characteristic matrix; for each fusion weight matrix corresponding to the fusion feature matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix; and determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix. Therefore, different modal characteristics carried by heterogeneous data can be fused, and the characteristics of the target object can be more finely expressed.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a processor 510, a memory 520, and a bus 530.

The memory 520 stores machine-readable instructions executable by the processor 510, when the electronic device 500 runs, the processor 510 communicates with the memory 520 through the bus 530, and when the machine-readable instructions are executed by the processor 510, the steps of the multimodal feature fusion method in the embodiment of the method shown in fig. 1 may be performed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the multimodal feature fusion method in the method embodiment shown in fig. 1 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A multi-modal feature fusion method, the multi-modal feature fusion method comprising:

acquiring a plurality of heterogeneous data of a target object;

2. The multimodal feature fusion method of claim 1, wherein when the heterogeneous data includes audio data, the extracting the monomodal feature matrix of the heterogeneous data comprises:

3. The multimodal feature fusion method of claim 1, wherein when the heterogeneous data includes text data, the extracting the monomodal feature matrix of the heterogeneous data comprises:

4. The method according to claim 3, wherein the determining the text feature matrix of each text feature group and the picture feature matrix of the picture feature group corresponding to each text feature group comprises:

5. The multi-modal feature fusion method of claim 3, wherein the fusing the text feature matrix of the text feature set with the picture feature matrix of the picture feature set corresponding to the text feature set to obtain the preliminary fusion matrix of the text feature set comprises:

6. The method of claim 5, wherein determining the multi-dimensional correlation matrix between each sub-text feature in the text feature group and the sub-text feature corresponding to the sub-text feature based on the preliminary fusion matrix of the text feature group and the text feature matrix of the text feature group comprises:

7. The multimodal feature fusion method of claim 1 wherein when the heterogeneous data includes video data, the multimodal feature fusion method further comprises:

segmenting the video data into multi-frame picture data;

8. A multimodal feature fusion apparatus, the multimodal feature fusion apparatus comprising:

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is run, the machine-readable instructions when executed by the processor performing the steps of the multimodal feature fusion method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the multimodal feature fusion method as claimed in any one of claims 1 to 7.