CN114332575A - Multi-modal feature fusion method and device, electronic equipment and readable storage medium - Google Patents

Multi-modal feature fusion method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN114332575A
CN114332575A CN202111626977.0A CN202111626977A CN114332575A CN 114332575 A CN114332575 A CN 114332575A CN 202111626977 A CN202111626977 A CN 202111626977A CN 114332575 A CN114332575 A CN 114332575A
Authority
CN
China
Prior art keywords
feature
matrix
fusion
sub
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111626977.0A
Other languages
Chinese (zh)
Inventor
覃祥坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongdian Jinxin Software Co Ltd
Original Assignee
Zhongdian Jinxin Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongdian Jinxin Software Co Ltd filed Critical Zhongdian Jinxin Software Co Ltd
Priority to CN202111626977.0A priority Critical patent/CN114332575A/en
Publication of CN114332575A publication Critical patent/CN114332575A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a multi-modal feature fusion method, a multi-modal feature fusion device, electronic equipment and a readable storage medium, wherein the method comprises the steps of obtaining a plurality of heterogeneous data of a target object; aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data; determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of corresponding single-mode weight matrices for each single-mode feature matrix; determining a fusion weight matrix between the fusion feature matrix and each fusion feature matrix aiming at each fusion feature matrix; normalizing each fusion weight matrix to obtain a normalized fusion weight matrix; and determining a multi-modal feature matrix describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix. Therefore, different modal characteristics carried by heterogeneous data can be fused, and the characteristics of the target object can be more finely expressed.

Description

Multi-modal feature fusion method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of computer processing technologies, and in particular, to a multimodal feature fusion method and apparatus, an electronic device, and a readable storage medium.
Background
Modality means a manner in which a target object occurs or exists, single modality means that a target object exists in only one manner, and multi-modality means that the same target object can occur or exist in a form of a combination between two or more modalities or modalities at the same time. Data or information from the same data source may be referred to as a Modality feature (Modality), and common single Modality features include video, text, pictures, and audio, among others.
The reason why the single-modal features of data from different data sources are fused into multi-modal features is that different modal features represent different target objects, in other words, different modal features look at different angles of the same target object. Complementary information exists between different modal characteristics, and if the different modal characteristics can be fused together, the characteristics of the object itself can be more finely characterized, so how to fuse the different modal characteristics becomes a problem to be solved urgently.
Disclosure of Invention
In view of the above, an object of the present application is to provide a multi-modal feature fusion method, apparatus, electronic device and readable storage medium, which can fuse different modal features carried by heterogeneous data to more finely express features of a target object.
The embodiment of the application provides a multi-modal feature fusion method, which comprises the following steps:
acquiring a plurality of heterogeneous data of a target object;
aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data;
respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix;
determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix;
aiming at each fusion characteristic matrix, respectively determining a fusion weight matrix between the fusion characteristic matrix and each fusion characteristic matrix;
for each fusion weight matrix corresponding to the fusion feature matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix;
and determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix.
Further, when the heterogeneous data includes audio data, the extracting a monomodal feature matrix of the heterogeneous data includes:
converting the audio data into single sound channel audio data, and performing resampling processing on the single sound channel audio data to obtain resampled audio data;
moving a Hanning time window with a preset window length on the resampled audio data, and performing Fourier transform on the resampled audio data to obtain an audio frequency spectrum of the audio data;
mapping the audio spectrum to an initial mel-frequency cepstrum using a filter bank;
carrying out logarithmic calculation on the initial Mel cepstrum to obtain a stable Mel cepstrum;
and recombining the stable Mel cepstrum according to a preset time length to obtain a single-mode characteristic matrix of the audio data.
Further, when the heterogeneous data includes text data, the extracting a monomodal feature matrix of the heterogeneous data includes:
performing natural language preprocessing on the text data, and extracting a plurality of character feature groups and a picture feature group corresponding to each character feature group from the text data; the character feature group comprises at least two of Chinese character features, phrase features and sentence features; the expression form of the character feature group is a one-hot vector form;
determining a character feature matrix of each character feature group and a picture feature matrix of a picture feature group corresponding to each character feature group;
for each character feature group, fusing a character feature matrix of the character feature group and a picture feature matrix of a picture feature group corresponding to the character feature group to obtain a primary fusion matrix of the character feature group;
for each character feature group, determining a multidimensional correlation matrix between each sub-character feature in the character feature group and the sub-picture feature corresponding to the sub-character feature based on the preliminary fusion matrix of the character feature group and the picture feature matrix of the picture feature group corresponding to the character feature group;
determining a multidimensional attention weight of each sub-character feature based on a multidimensional correlation matrix of each sub-character feature in the character feature group;
and determining a monomodal feature matrix of the text data based on the multidimensional attention weight of each sub-character feature in each character feature group and the picture feature matrix of the sub-picture feature corresponding to each sub-character feature.
Further, the determining the character feature matrix of each character feature group and the picture feature matrix of the picture feature group corresponding to each character feature group includes:
for each character feature group, determining a character feature matrix capable of representing each sub-character feature in the character feature group by using a pre-trained feature extraction model corresponding to the character feature group;
and determining the picture characteristic matrix of the picture characteristic group corresponding to the character characteristic group by using a pre-trained convolutional neural network.
Further, the fusing the text feature matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group to obtain a preliminary fusion matrix of the text feature group includes:
for each sub-text characteristic in the text characteristic group, fusing a sub-text matrix of the sub-text characteristic and a sub-picture matrix of the sub-picture characteristic corresponding to the sub-text characteristic to obtain a sub-fusion matrix of each sub-text characteristic;
and determining a preliminary fusion matrix of the character feature group based on the sub-fusion matrix of each sub-character feature.
Further, the determining a multidimensional correlation matrix between each sub-text feature in the text feature group and the sub-picture feature corresponding to the sub-text feature based on the preliminary fusion matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group includes:
and aiming at each sub-document feature in the text feature group, determining a multi-dimensional correlation matrix of the sub-document feature based on the sub-fusion matrix of the sub-document feature and the sub-picture matrix of the sub-document feature corresponding to the sub-picture feature.
Further, a feature extraction model corresponding to the character feature group is obtained through the following steps:
acquiring a pre-trained language pre-training model; the language pre-training model is used for extracting a character feature matrix representing each sub-character feature of the character feature group from one-hot vectors of the character feature group;
and carrying out model distillation treatment on the language pre-training model, and compressing parameters in the language pre-training model to obtain the feature extraction model.
Further, when the heterogeneous data includes video data, the multimodal feature fusion method further includes:
segmenting the video data into multi-frame picture data;
when the heterogeneous data includes picture data, the extracting a monomodal feature matrix of the heterogeneous data includes:
and extracting a monomodal feature matrix of the picture data by using a deep neural network with a residual error jumping mechanism.
The embodiment of the present application further provides a multi-modal feature fusion apparatus, where the multi-modal feature fusion apparatus includes:
the data acquisition module is used for acquiring a plurality of heterogeneous data of the target object;
the matrix extraction module is used for extracting a monomodal feature matrix of the heterogeneous data aiming at each heterogeneous data;
the single-mode weight determining module is used for respectively determining a single-mode weight matrix between each single-mode feature matrix and the single-mode feature matrix aiming at each single-mode feature matrix;
a fusion matrix determining module, configured to determine a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix;
the fusion weight determining module is used for respectively determining fusion weight matrixes between the fusion feature matrix and each fusion feature matrix aiming at each fusion feature matrix;
the matrix normalization module is used for normalizing the fusion weight matrix aiming at each fusion weight matrix corresponding to the fusion characteristic matrix to obtain a normalized fusion weight matrix;
and the multi-modal characteristic determining module is used for determining a multi-modal characteristic matrix used for describing the target object based on each fusion characteristic matrix and a plurality of normalized fusion weight matrixes corresponding to each fusion characteristic matrix.
Further, when the heterogeneous data includes audio data, the matrix extraction module is configured to, when the matrix extraction module is configured to extract a monomodal feature matrix of the heterogeneous data, the matrix extraction module is configured to:
converting the audio data into single sound channel audio data, and performing resampling processing on the single sound channel audio data to obtain resampled audio data;
moving a Hanning time window with a preset window length on the resampled audio data, and performing Fourier transform on the resampled audio data to obtain an audio frequency spectrum of the audio data;
mapping the audio spectrum to an initial mel-frequency cepstrum using a filter bank;
carrying out logarithmic calculation on the initial Mel cepstrum to obtain a stable Mel cepstrum;
and recombining the stable Mel cepstrum according to a preset time length to obtain a single-mode characteristic matrix of the audio data.
Further, when the heterogeneous data includes text data, the matrix extraction module, when being configured to extract a monomodal feature matrix of the heterogeneous data, is configured to:
performing natural language preprocessing on the text data, and extracting a plurality of character feature groups and a picture feature group corresponding to each character feature group from the text data; the character feature group comprises at least two of Chinese character features, phrase features and sentence features; the expression form of the character feature group is a one-hot vector form;
determining a character feature matrix of each character feature group and a picture feature matrix of a picture feature group corresponding to each character feature group;
for each character feature group, fusing a character feature matrix of the character feature group and a picture feature matrix of a picture feature group corresponding to the character feature group to obtain a primary fusion matrix of the character feature group;
for each character feature group, determining a multidimensional correlation matrix between each sub-character feature in the character feature group and the sub-picture feature corresponding to the sub-character feature based on the preliminary fusion matrix of the character feature group and the picture feature matrix of the picture feature group corresponding to the character feature group;
determining a multidimensional attention weight of each sub-character feature based on a multidimensional correlation matrix of each sub-character feature in the character feature group;
and determining a monomodal feature matrix of the text data based on the multidimensional attention weight of each sub-character feature in each character feature group and the picture feature matrix of the sub-picture feature corresponding to each sub-character feature.
Further, when the matrix extraction module is configured to determine the text feature matrix of each text feature group and the picture feature matrix of the picture feature group corresponding to each text feature group, the matrix extraction module is configured to:
for each character feature group, determining a character feature matrix capable of representing each sub-character feature in the character feature group by using a pre-trained feature extraction model corresponding to the character feature group;
and determining the picture characteristic matrix of the picture characteristic group corresponding to the character characteristic group by using a pre-trained convolutional neural network.
Further, when the matrix extraction module is configured to fuse the text feature matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group to obtain a preliminary fusion matrix of the text feature group, the matrix extraction module is configured to:
for each sub-text characteristic in the text characteristic group, fusing a sub-text matrix of the sub-text characteristic and a sub-picture matrix of the sub-picture characteristic corresponding to the sub-text characteristic to obtain a sub-fusion matrix of each sub-text characteristic;
and determining a preliminary fusion matrix of the character feature group based on the sub-fusion matrix of each sub-character feature.
Further, when the matrix extraction module is configured to determine a multidimensional correlation matrix between each sub-text feature in the text feature group and the sub-picture feature corresponding to the sub-text feature based on the preliminary fusion matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group, the matrix extraction module is configured to:
and aiming at each sub-document feature in the text feature group, determining a multi-dimensional correlation matrix of the sub-document feature based on the sub-fusion matrix of the sub-document feature and the sub-picture matrix of the sub-document feature corresponding to the sub-picture feature.
Further, the matrix extraction module is configured to obtain a feature extraction model corresponding to the text feature group through the following steps:
acquiring a pre-trained language pre-training model; the language pre-training model is used for extracting a character feature matrix representing each sub-character feature of the character feature group from one-hot vectors of the character feature group;
and carrying out model distillation treatment on the language pre-training model, and compressing parameters in the language pre-training model to obtain the feature extraction model.
Further, when the heterogeneous data includes video data, the multimodal feature fusion apparatus further includes a video segmentation module, and the video segmentation module is configured to:
segmenting the video data into multi-frame picture data;
when the heterogeneous data includes picture data, and the matrix extraction module is configured to, when the matrix extraction module is configured to extract a monomodal feature matrix of the heterogeneous data, the matrix extraction module is configured to:
and extracting a monomodal feature matrix of the picture data by using a deep neural network with a residual error jumping mechanism.
An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the multimodal feature fusion method as described above.
Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the multimodal feature fusion method as described above.
The multi-modal feature fusion method, the multi-modal feature fusion device, the electronic equipment and the readable storage medium, provided by the embodiment of the application, are used for acquiring a plurality of heterogeneous data of a target object; aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data; respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix; determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix; aiming at each fusion characteristic matrix, respectively determining a fusion weight matrix between the fusion characteristic matrix and each fusion characteristic matrix; for each fusion weight matrix corresponding to the fusion feature matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix; and determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix. Therefore, different modal characteristics carried by heterogeneous data can be fused, and the characteristics of the target object can be more finely expressed.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
FIG. 1 is a flow chart of a multi-modal feature fusion method provided in an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a single-mode feature matrix extraction process of audio data according to an embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of a multi-modal feature fusion apparatus according to an embodiment of the present disclosure;
fig. 4 is a second schematic structural diagram of a multi-modal feature fusion apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.
Research shows that target objects represented by different modal features are different, in other words, angles at which different modal features view the same target object are different, and in essence, some cross information exists between different modal features (so there is also a case of information redundancy), but in addition to the cross information, more importantly, complementary information also exists between different modal features, and if different modal features can be fused together, the features of the object itself can be more finely characterized.
Currently, there are three main common ways of multi-modal feature fusion: first, front-end fusion (early-fusion), namely data-level fusion (data-level fusion); second, post-fusion (late-fusion), i.e. decision-level fusion; and thirdly, intermediate fusion (intermediate-fusion).
Front-end fusion fuses multiple independent datasets into a single feature vector, which is then input into a machine learning classifier. Because front-end fusion of multi-modal data often cannot fully utilize complementarity among multi-modal data, the raw data of front-end fusion usually contains a large amount of redundant information.
The back-end fusion is to fuse the classifier outputs trained respectively by different modal data into scores (decisions). The advantage of doing so is that the errors of the fusion model come from different classifiers, while the errors from different classifiers are often not correlated with each other, do not affect each other, and do not cause further accumulation of errors. Common backend fusion methods include maximum value fusion (max-fusion), average-value fusion (averaged-fusion), Bayes 'rule based fusion (Bayes' rule based), ensemble learning (ensemble learning), and the like. Among them, ensemble learning is a typical representative of a backend fusion mode, and is widely applied to research fields such as communication, computer recognition, voice recognition, and the like.
The intermediate fusion means that different modal data are firstly converted into high-dimensional characteristic expression and then fused in the intermediate layer of the model. Taking a neural network as an example, the intermediate fusion firstly converts the original data into a high-dimensional feature expression by using the neural network, and then obtains the commonality of different modal data on a high-dimensional space.
Based on this, the embodiment of the application provides a multi-modal feature fusion method, which can fuse the single-modal features of heterogeneous data from different data sources, and can express the features of a target object in more detail.
Referring to fig. 1, fig. 1 is a flowchart of a multi-modal feature fusion method according to an embodiment of the present application. As shown in fig. 1, an embodiment of the present application provides a multi-modal feature fusion method, including:
s101, acquiring a plurality of heterogeneous data of the target object.
And S102, aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data.
S103, respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix.
Taking the example of extracting N single-mode feature matrices from a plurality of heterogeneous data of a target object, each single-mode feature matrix corresponds to N single-mode weight matrices; here, a single-modality weight matrix between each single-modality feature matrix and itself is included.
And S104, determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix.
Here, each single-mode feature matrix can be calculated to obtain a corresponding fusion feature matrix.
And S105, aiming at each fusion feature matrix, respectively determining a fusion weight matrix between the fusion feature matrix and each fusion feature matrix.
Here, each monomodal feature matrix corresponds to one fusion feature matrix, and each fusion feature matrix can be calculated to obtain N corresponding fusion weight matrices; here, a fusion weight matrix between each fused feature matrix and itself is included.
And S106, aiming at each fusion weight matrix corresponding to the fusion characteristic matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix.
Here, the step S103 to S106 are executed for the single-mode feature matrix of each heterogeneous data to realize modal feature fusion between each heterogeneous data and other heterogeneous data except for the single-mode feature matrix of the heterogeneous data among the heterogeneous data, so as to obtain a fusion feature matrix corresponding to each heterogeneous data and a plurality of normalized fusion weight matrices corresponding to the fusion feature matrix.
S107, determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of fusion weight matrixes corresponding to each fusion feature matrix.
Here, the heterogeneous data refers to data with different data structures from different data sources, and for the same target object, multiple heterogeneous data related to the target object can be generally acquired; for example, the target object a participates in a conference and speaks in the conference, for the target object a, video data, picture data, audio data, and text data about the target object a can be acquired at the conference, and for the video data and the audio data, the acquisition devices of the two are different, so that the two come from different data sources, and therefore, the two are heterogeneous data; likewise, the video data, the picture data, the audio data and the text data are heterogeneous data, that is, the heterogeneous data of the target object includes one or more of the video data, the picture data, the audio data and the text data.
In step S102, for each heterogeneous data, if a subsequent feature fusion operation is to be performed on the heterogeneous data, first, a monomodal feature matrix of the heterogeneous data needs to be extracted from the heterogeneous data; for heterogeneous data acquired by different data sources, the extraction modes of the monomodal feature matrix for extracting the heterogeneous data are naturally different.
Further, please refer to fig. 2, fig. 2 is a schematic diagram illustrating a process of extracting a single-mode feature matrix of audio data according to an embodiment of the present disclosure. As shown in fig. 2, step S102 includes:
step S201, converting the audio data into monaural audio data, and performing resampling processing on the monaural audio data to obtain resampled audio data.
In the step, the acquired audio data is converted into mono audio data, the converted mono audio data is resampled, and the mono audio data is resampled into audio data with a preset frequency, so that the resampled audio data is obtained.
Here, the preset frequency may be 16 kHz.
Step S202, moving a Hanning time window with a preset window length on the resampled audio data, and performing Fourier transform on the resampled audio data to obtain an audio frequency spectrum of the audio data.
In the step, a Hanning time window (Hann time window) for carrying out Fourier transform on audio data is obtained, the Hanning time window with the length of a preset window is moved on the audio data after resampling according to the preset window shift, short-time Fourier transform is carried out on the audio data after resampling in the Hanning time window every time until all the audio data after resampling are subjected to short-time Fourier transform, and an audio frequency spectrum of the audio data is obtained.
Here, the preset window length may be 25ms, and the preset window shift may be 10 ms.
Step S203, mapping the audio frequency spectrum to an initial mel frequency cepstrum by using a filter bank.
In this step, the obtained audio frequency spectrum is mapped into a 64-order mel filter bank, and an initial mel-frequency cepstrum of the audio data is calculated through the filter.
And S204, carrying out logarithmic calculation on the initial Mel cepstrum to obtain a stable Mel cepstrum.
In the step, in order to obtain a stable mel cepstrum, logarithm calculation needs to be performed on the initial mel cepstrum obtained through calculation, in order to avoid performing logarithm calculation on the audio data with the frequency spectrum value of 0, the situation that the frequency spectrum value of 0 does not exist in the audio data is ensured through a bias constant in the calculation process, and further, logarithm calculation on 0 can be avoided; specifically, the stable mel-frequency cepstrum is calculated by the following formula:
mels=log(mel-spectrum+0.01);
wherein mel issTo stabilize the mel-frequency cepstrum, log is logarithmic calculation, mel-spectrum is the initial mel-frequency cepstrum, a is a bias constant, and the value of a is usually 0.01.
And S205, recombining the stable Mel cepstrum according to a preset time length to obtain a single-mode feature matrix of the audio data.
In the step, the stable mel cepstrum obtained by calculation is framed again in 0.96s, characteristic data capable of representing audio data are extracted, overlapped frames do not exist in the reconstructed stable mel cepstrum, each frame of audio data (characteristic data) contains 64 mel frequency bands, and the time length is 10ms (namely 96 frames of characteristic data are obtained after reconstruction); and constructing a monomodal feature matrix of the audio data based on each frame of audio data (features).
It should be noted that the output data format of the monomodal feature matrix corresponding to the audio data in the present application is [ nums _ frames, 128]That is, the output data is 128-dimensional with semantics and meaningA high-level feature vector; wherein, nums _ frames is the frame length of each frame of feature data,
Figure BDA0003439900980000141
here, 0.96 indicates that, when the monomodal feature matrix of the audio data is extracted, it is reframed with a duration of 0.96 s.
In another embodiment, when the heterogeneous data includes text data, step S102 includes:
step 1: performing natural language preprocessing on the text data, and extracting a plurality of character feature groups and a picture feature group corresponding to each character feature group from the text data; the character feature group comprises at least two of Chinese character features, phrase features and sentence features; the expression form of the character feature group is a one-hot vector form.
In the step, natural language preprocessing is performed on the acquired text data, and a plurality of character feature groups are respectively extracted from the text data, wherein the extraction can be performed according to the form of features, such as Chinese characters, phrases, sentences and the like; thus, the extracted character feature group comprises at least two of Chinese character features, phrase features and sentence features.
Here, the text data may be divided into a plurality of sub-character features one by one, for example, "bite the blue mountain without relaxing," bite "," fix "," cyan "," mountain "," not "," put "and" loose ", so that the" bite "may be used as the sub-character feature in one chinese character feature group, and so on, and the sub-character features in the chinese character feature group include: "bite", "green", "mountain", "don", "free", and "loose"; respectively converting the split Chinese characters into corresponding one-hot vector forms; for example, "bite" [0, 1, 0, 0 ]; "decide": [0,1,1,0].
The text data can be further split into a plurality of sub-character features one by one, for example, "bite into the green mountain without relaxation", and can be split into "bite into the green mountain", "no", and "relaxation", so that "bite" can be used as a sub-character feature in a phrase feature group, and so on, and the sub-character features in the phrase feature group include: "bite", "qingshan", "not" and "relax"; respectively converting the split phrases into corresponding one-hot vector forms; for example, "bite" [1, 1, 0, 0 ]; "Qingshan": [0,1,1,1].
The text data can be further split into a plurality of sub-word features one by one, for example, the 'biting into the green mountain without relaxing' can be taken as one sub-word feature in the sentence feature group as a whole; and converting the split sentences into corresponding one-hot vector forms.
The picture characteristic group is a traditional Chinese character and/or pictographic Chinese character picture corresponding to each sub-character characteristic in the character characteristic group corresponding to the picture characteristic group; corresponding to the embodiment, the picture characteristic group corresponding to the Chinese character characteristic group comprises traditional Chinese characters and/or pictographic Chinese character pictures corresponding to 'biting', traditional Chinese characters and/or pictographic Chinese character pictures corresponding to 'definite', and the like; the picture characteristic group corresponding to the phrase characteristic group comprises a traditional Chinese character group and/or a pictographic Chinese character group picture corresponding to 'biting'; the picture characteristic group corresponding to the sentence characteristic group comprises traditional Chinese characters and/or pictographic Chinese character pictures of the whole sentence corresponding to 'the fixed Qingshan is not relaxed'.
When extracting the character feature group, determining a picture feature group corresponding to each character feature group, wherein the character feature group comprises a plurality of sub-character features; correspondingly, the picture feature group includes sub-picture features corresponding to each sub-word feature.
Specifically, when the character feature group is a Chinese character feature group, the picture feature group corresponding to the Chinese character feature group comprises traditional Chinese characters and/or pictographic Chinese character pictures corresponding to each subfile character feature (each Chinese character) in the Chinese character feature group;
when the character feature group is a phrase feature group, the picture feature group corresponding to the phrase feature group comprises a traditional Chinese character group and/or a pictographic Chinese character group picture corresponding to each subfile character feature (each phrase) in the phrase feature group;
when the character feature group is a sentence feature group, the picture feature group corresponding to the sentence feature group comprises sentences composed of traditional Chinese characters and/or sentences composed of pictographic Chinese characters corresponding to each subfile character feature (each sentence) in the sentence feature group.
Step 2: and determining a character feature matrix of each character feature group and a picture feature matrix of the picture feature group corresponding to each character feature group.
In this step, in order to perform feature fusion in the subsequent process, the expression form of each character feature group needs to be unified and converted into data recognizable by a computer, and the character feature matrix of each character feature group is determined respectively, that is, each sub-character feature in each character feature group is presented in the form of a matrix.
Specifically, when the character feature group is a Chinese character feature group, the sub-character feature is a Chinese character, and the Chinese character is converted into a matrix form, namely the Chinese character is represented in the matrix form; when the character characteristic group is a phrase characteristic group, the sub-character characteristic is a phrase, and the phrase is converted into a matrix form, namely the phrase is represented in the matrix form; when the text feature group is a sentence feature group, the sub-text feature is a sentence, and the sentence is converted into a matrix form, i.e., the sentence is represented in the matrix form.
In one embodiment, step 2 comprises: for each character feature group, determining a character feature matrix capable of representing each sub-character feature in the character features by using a pre-trained feature extraction model corresponding to the character feature group; and determining the picture characteristic matrix of the picture characteristic group corresponding to the character characteristic group by using a pre-trained convolutional neural network.
In the step, the mode of determining the character characteristic matrix of the character characteristic group is different from the mode of determining the picture characteristic matrix of the picture characteristic group; for the character feature group, a character feature matrix capable of representing each sub-character feature (Chinese character, phrase and sentence) in the character feature group is determined by utilizing a pre-trained feature extraction model corresponding to the character feature group.
For the picture feature group, a picture feature matrix of the picture feature group corresponding to the character feature group is determined by using a pre-trained convolutional neural network.
In one embodiment, the feature extraction model corresponding to the text feature group is obtained by:
firstly, acquiring a pre-trained language pre-training model; the language pre-training model is used for extracting a character feature matrix representing each sub-character feature of the character feature group from the one-hot vector of the character feature group.
Here, the language pre-training model is trained by the training data, but because the language pre-training model (e.g., the Bert model) has many parameters and a large scale, the model cannot be used in a computer with general performance, that is, the language pre-training model has poor practicability, and in this case, the scale of the language pre-training model needs to be compressed on the basis of the scale of the compressed model and on the basis of keeping the accuracy of the original language pre-training model.
And then, carrying out model distillation processing on the language pre-training model, and compressing parameters in the language pre-training model to obtain the feature extraction model.
Here, the feature extraction model is obtained by compressing the language pre-training model, and the feature extraction model can overcome the limitation of the language pre-training model on the length of the sub-text feature while maintaining the better generalization performance of the language pre-training model (e.g., Bert model).
And step 3: and for each character feature group, fusing a character feature matrix of the character feature group and a picture feature matrix of a picture feature group corresponding to the character feature group to obtain a primary fusion matrix of the character feature group.
In one embodiment, step 3 comprises: for each sub-text characteristic in the text characteristic group, fusing a sub-text matrix of the sub-text characteristic and a sub-picture matrix of the sub-picture characteristic corresponding to the sub-text characteristic to obtain a sub-fusion matrix of each sub-text characteristic; and determining a preliminary fusion matrix of the character feature group based on the sub-fusion matrix of each sub-character feature.
Determining a sub-text matrix of each sub-text feature in a text feature group, and determining a sub-picture matrix of the sub-picture feature corresponding to each sub-text feature from the picture feature matrix of the picture feature group corresponding to the text feature group;
aiming at each sub-character characteristic, fusing a sub-character matrix of the sub-character characteristic and a sub-picture matrix of a corresponding sub-picture to obtain a sub-fusion matrix of the sub-character characteristic;
and aiming at each character feature group, determining a preliminary fusion matrix of the character feature group based on the sub-fusion matrix of each sub-character feature in the character feature group.
Specifically, a sub-fusion matrix of the sub-character features is determined by the following formula:
Figure BDA0003439900980000181
wherein u isi wIs a sub-fusion matrix of the ith sub-character feature in the w-th character feature group, hi wA sub-character matrix of the ith sub-character characteristic in the w-th character characteristic group, vi wAnd the sub-picture matrix is the sub-picture characteristic corresponding to the ith sub-character characteristic in the w-th character characteristic group.
And 4, step 4: and for each character feature group, determining a multidimensional correlation matrix between each sub-character feature in the character feature group and the sub-picture feature corresponding to the sub-character feature based on the fusion feature matrix of the character feature group and the picture feature matrix of the picture feature group corresponding to the character feature group.
In one embodiment, step 4 comprises: and aiming at each sub-document feature in the text feature group, determining a multi-dimensional correlation matrix of the sub-document feature based on the sub-fusion matrix of the sub-document feature and the sub-picture matrix of the sub-document feature corresponding to the sub-picture feature.
In this step, for each sub-text feature in each text feature group, a multidimensional correlation matrix between the sub-text feature and the corresponding sub-picture feature is calculated based on the sub-fusion matrix of the sub-text feature and the picture feature matrix of the sub-picture feature corresponding to the sub-text feature; specifically, the multidimensional correlation matrix is calculated by the following formula:
Figure BDA0003439900980000182
wherein e isi wA multi-dimensional correlation matrix u of ith sub-character feature in the w-th character feature groupi wIs a sub-fusion matrix of the ith sub-character feature in the w-th character feature group, vi wAnd the picture characteristic matrix is the picture characteristic matrix of the sub-picture characteristic corresponding to the ith sub-character characteristic in the w-th character characteristic group.
And 5: and determining the multidimensional attention weight of the sub-character features based on the multidimensional correlation matrix of each sub-character feature in the character feature group.
In this step, the multidimensional attention weight of each sub-character feature is calculated by the following formula:
Figure BDA0003439900980000191
wherein, ai wIs the multidimensional attention weight of the ith sub-character feature in the w-th character feature group, ei wThe matrix is a multidimensional correlation matrix of ith sub-character features in the w-th character feature group, and z is the number of the sub-character features in the w-th character feature group.
Step 6: and determining a monomodal feature matrix of the text data based on the multidimensional attention weight of each sub-character feature in each character feature group and the picture feature matrix of the sub-picture feature corresponding to each sub-character feature.
In this step, a monomodal feature matrix of the text data is determined by the following formula:
Figure BDA0003439900980000192
wherein S is a monomodal feature matrix of the text data, ai wIs the multidimensional attention weight, v, of the ith sub-character feature in the w-type character feature groupi wThe image feature matrix is the image feature matrix of the sub-image features corresponding to the ith sub-character feature in the w-th character feature group, z is the number of the sub-character features in the w-th character feature group, and l is the number of the character feature groups.
In another embodiment, when the heterogeneous data includes video data, the multimodal feature fusion further includes: and segmenting the video data into multi-frame picture data.
In this step, if the acquired heterogeneous data is video data, the video data is divided into multi-frame picture data according to the form of processable picture data, so as to extract the subsequent single-mode feature matrix.
In another embodiment, when the received video data is sliced into multi-frame picture data or the received heterogeneous data is picture data, step S102 includes: and extracting a monomodal feature matrix of the picture data by using a deep neural network with a residual error jumping mechanism.
In the step, a deep neural network with a residual error jump mechanism is obtained through training of training set data in advance, and a monomodal feature matrix capable of representing the picture data is extracted from each picture data by using the deep neural network with the residual error jump mechanism.
Here, for video data and picture data, the present application introduces a deep neural network ResNet with a residual jump structure; by the aid of the residual error network structure of the network, the problems of gradient disappearance, explosion and network degradation caused by the increase of the number of layers of the convolutional network can be solved, and accuracy of single-mode feature extraction results is improved.
And for each extracted single-mode feature matrix, performing feature fusion on every two single-mode feature matrices respectively to change the single-mode feature matrices into multi-mode feature matrices, thereby representing the target object more finely.
In step S103, a single-mode weight matrix between every two single-mode feature matrices is calculated by the following formula:
Figure BDA0003439900980000201
wherein Q isxyIs a monomodal weight matrix between the monomodal feature matrix of the xth heterogeneous data and the monomodal feature matrix of the yth heterogeneous data, Sx mnIs a monomodal feature matrix of the xth heterogeneous data,
Figure BDA0003439900980000202
for the transpose of the monomodal feature matrix of the y-th heterogeneous data, m and n are rows and columns of the matrix, respectively.
Wherein, N single-mode feature matrixes S are extracted from a plurality of heterogeneous data of a target objectx mnFor example, each monomodal feature matrix Sx mnCorresponding to N single-mode weight matrixes Qxy(ii) a Here, a single-modality weight matrix between each single-modality feature matrix and itself is included.
In step S104, a fusion feature matrix of each monomodal feature matrix is calculated by the following formula:
Figure BDA0003439900980000211
wherein,
Figure BDA0003439900980000212
for the fusion feature matrix of the xth heterogeneous data, Sx mnIs a monomodal feature matrix of the xth heterogeneous data,
Figure BDA0003439900980000213
the method is characterized in that a single-mode weight matrix between a single-mode feature matrix of the x-th heterogeneous data and a single-mode feature matrix of the y-th heterogeneous data is transposed, and m and n are rows and columns of the matrix respectively.
Here, the monomodal feature matrix S for each heterogeneous datax mnAll can obtain a corresponding fusion feature matrix by calculation through the formula
Figure BDA0003439900980000214
In step S105, a fusion weight matrix between each fusion feature matrix and each fusion feature matrix is calculated by the following formula:
Figure BDA0003439900980000215
wherein, PxyIs a fusion weight matrix between the monomodal feature matrix of the xth heterogeneous data and the monomodal feature matrix of the yth heterogeneous data,
Figure BDA0003439900980000216
is a fusion characteristic matrix of the xth heterogeneous data,
Figure BDA0003439900980000217
for the transpose of the fusion feature matrix of the y-th heterogeneous data, m and n are rows and columns of the matrix respectively.
Here, each monomodal feature matrix Sx mnCorresponding to a fusion feature matrix
Figure BDA0003439900980000218
And each fused feature matrix
Figure BDA0003439900980000219
All can obtain corresponding N fusion weight matrixes P by calculation through the formulaxy(ii) a Here, a fusion weight matrix between each fused feature matrix and itself is included.
In step S106, a normalized fusion weight matrix is calculated by the following formula:
Pxy'=tanh(Pxy);
wherein, Pxy'Is a fusion weight matrix, P, between the normalized monomodal feature matrix of the xth heterogenous and heterogeneous data and the monomodal feature matrix of the yth heterogenous and heterogeneous dataxyThe weight matrix is a fusion weight matrix between the monomodal feature matrix of the xth heterogeneous data and the monomodal feature matrix of the yth heterogeneous data.
Here, each fusion weight matrix PxyA corresponding normalized fusion weight matrix P, which can be calculated by the above formulaxy'
Here, for the single-mode feature matrix of each heterogeneous data, steps S103 to S106 are performed to implement modal feature fusion between each heterogeneous data and other heterogeneous data except for the single-mode feature matrix of the heterogeneous data, so as to obtain a fusion feature matrix corresponding to each heterogeneous data and a plurality of normalized fusion weight matrices corresponding to the fusion feature matrix.
In step S107, a multi-modal feature matrix for describing a target object is determined by fusing a plurality of single-modal features of the target object by the following formula:
Figure BDA0003439900980000221
wherein G ismnMultimodal feature matrix, P, for a target objectxy'The normalized single-mode feature matrix of the xth heterogeneous data and the normalized single-mode feature matrix of the yth heterogeneous data are fused with each other to form a weight matrix,
Figure BDA0003439900980000222
is a fusion characteristic matrix of the xth heterogeneous data,m, n are the rows and columns of the matrix, respectively.
According to the multi-modal feature fusion method, a plurality of heterogeneous data of a target object are obtained; aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data; respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix; determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix; aiming at each fusion characteristic matrix, respectively determining a fusion weight matrix between the fusion characteristic matrix and each fusion characteristic matrix; for each fusion weight matrix corresponding to the fusion feature matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix; and determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix. Therefore, different modal characteristics carried by heterogeneous data can be fused, and the characteristics of the target object can be more finely expressed.
Referring to fig. 3 and 4, fig. 3 is a schematic structural diagram of a multi-modal feature fusion apparatus according to an embodiment of the present application, and fig. 4 is a second schematic structural diagram of a multi-modal feature fusion apparatus according to an embodiment of the present application. As shown in fig. 3, the multi-modal feature fusion apparatus 300 includes:
a data obtaining module 310, configured to obtain multiple heterogeneous data of a target object;
a matrix extraction module 320, configured to extract, for each heterogeneous data, a monomodal feature matrix of the heterogeneous data;
a single-modal weight determining module 330, configured to determine, for each single-modal feature matrix, a single-modal weight matrix between the single-modal feature matrix and each single-modal feature matrix;
a fusion matrix determining module 340, configured to determine a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix;
a fusion weight determining module 350, configured to determine, for each fusion feature matrix, a fusion weight matrix between the fusion feature matrix and each fusion feature matrix;
a matrix normalization module 360, configured to perform normalization processing on the fusion weight matrix for each fusion weight matrix corresponding to the fusion feature matrix to obtain a normalized fusion weight matrix;
a multi-modal feature determination module 370, configured to determine a multi-modal feature matrix for describing the target object based on each fused feature matrix and a plurality of normalized fusion weight matrices corresponding to each fused feature matrix.
Further, as shown in fig. 4, when the heterogeneous data includes video data, the multi-modal feature fusion apparatus 300 further includes a video slicing module 380, and the video slicing module 380 is configured to:
segmenting the video data into multi-frame picture data;
when the heterogeneous data includes picture data, when the matrix extraction module 320 is configured to extract a single-mode feature matrix of the heterogeneous data, the matrix extraction module 320 is configured to:
and extracting a monomodal feature matrix of the picture data by using a deep neural network with a residual error jumping mechanism.
Further, when the heterogeneous data includes audio data, the matrix extraction module 320, when being configured to extract the monomodal feature matrix of the heterogeneous data, is configured to:
converting the audio data into single sound channel audio data, and performing resampling processing on the single sound channel audio data to obtain resampled audio data;
moving a Hanning time window with a preset window length on the resampled audio data, and performing Fourier transform on the resampled audio data to obtain an audio frequency spectrum of the audio data;
mapping the audio spectrum to an initial mel-frequency cepstrum using a filter bank;
carrying out logarithmic calculation on the initial Mel cepstrum to obtain a stable Mel cepstrum;
and recombining the stable Mel cepstrum according to a preset time length to obtain a single-mode characteristic matrix of the audio data.
Further, when the heterogeneous data includes text data, the matrix extraction module 320, when being configured to extract a monomodal feature matrix of the heterogeneous data, is configured to:
performing natural language preprocessing on the text data, and extracting a plurality of character feature groups and a picture feature group corresponding to each character feature group from the text data; the character feature group comprises at least two of Chinese character features, phrase features and sentence features; the expression form of the character feature group is a one-hot vector form;
determining a character feature matrix of each character feature group and a picture feature matrix of a picture feature group corresponding to each character feature group;
for each character feature group, fusing a character feature matrix of the character feature group and a picture feature matrix of a picture feature group corresponding to the character feature group to obtain a primary fusion matrix of the character feature group;
for each character feature group, determining a multidimensional correlation matrix between each sub-character feature in the character feature group and the sub-picture feature corresponding to the sub-character feature based on the preliminary fusion matrix of the character feature group and the picture feature matrix of the picture feature group corresponding to the character feature group;
determining a multidimensional attention weight of each sub-character feature based on a multidimensional correlation matrix of each sub-character feature in the character feature group;
and determining a monomodal feature matrix of the text data based on the multidimensional attention weight of each sub-character feature in each character feature group and the picture feature matrix of the sub-picture feature corresponding to each sub-character feature.
Further, when the matrix extraction module 320 is configured to determine the text feature matrix of each text feature group and the picture feature matrix of the picture feature group corresponding to each text feature group, the matrix extraction module 320 is configured to:
for each character feature group, determining a character feature matrix capable of representing each sub-character feature in the character feature group by using a pre-trained feature extraction model corresponding to the character feature group;
and determining the picture characteristic matrix of the picture characteristic group corresponding to the character characteristic group by using a pre-trained convolutional neural network.
Further, when the matrix extraction module 320 is configured to fuse the text feature matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group to obtain a preliminary fusion matrix of the text feature group, the matrix extraction module 320 is configured to:
for each sub-text characteristic in the text characteristic group, fusing a sub-text matrix of the sub-text characteristic and a sub-picture matrix of the sub-picture characteristic corresponding to the sub-text characteristic to obtain a sub-fusion matrix of each sub-text characteristic;
and determining a preliminary fusion matrix of the character feature group based on the sub-fusion matrix of each sub-character feature.
Further, when the matrix extraction module 320 is configured to determine a multidimensional correlation matrix between each sub-text feature in the text feature group and the sub-picture feature corresponding to the sub-text feature based on the preliminary fusion matrix of the text feature group and the picture feature matrix of the picture feature group corresponding to the text feature group, the matrix extraction module 320 is configured to:
and aiming at each sub-document feature in the text feature group, determining a multi-dimensional correlation matrix of the sub-document feature based on the sub-fusion matrix of the sub-document feature and the sub-picture matrix of the sub-document feature corresponding to the sub-picture feature.
Further, the matrix extraction module 320 is configured to obtain a feature extraction model corresponding to the text feature group through the following steps:
acquiring a pre-trained language pre-training model; the language pre-training model is used for extracting a character feature matrix representing each sub-character feature of the character feature group from one-hot vectors of the character feature group;
and carrying out model distillation treatment on the language pre-training model, and compressing parameters in the language pre-training model to obtain the feature extraction model.
The multi-modal feature fusion device provided by the embodiment of the application acquires a plurality of heterogeneous data of a target object; aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data; respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix; determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix; aiming at each fusion characteristic matrix, respectively determining a fusion weight matrix between the fusion characteristic matrix and each fusion characteristic matrix; for each fusion weight matrix corresponding to the fusion feature matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix; and determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix. Therefore, different modal characteristics carried by heterogeneous data can be fused, and the characteristics of the target object can be more finely expressed.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a processor 510, a memory 520, and a bus 530.
The memory 520 stores machine-readable instructions executable by the processor 510, when the electronic device 500 runs, the processor 510 communicates with the memory 520 through the bus 530, and when the machine-readable instructions are executed by the processor 510, the steps of the multimodal feature fusion method in the embodiment of the method shown in fig. 1 may be performed.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the multimodal feature fusion method in the method embodiment shown in fig. 1 may be executed.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A multi-modal feature fusion method, the multi-modal feature fusion method comprising:
acquiring a plurality of heterogeneous data of a target object;
aiming at each heterogeneous data, extracting a monomodal feature matrix of the heterogeneous data;
respectively determining a single-mode weight matrix between the single-mode feature matrix and each single-mode feature matrix aiming at each single-mode feature matrix;
determining a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix;
aiming at each fusion characteristic matrix, respectively determining a fusion weight matrix between the fusion characteristic matrix and each fusion characteristic matrix;
for each fusion weight matrix corresponding to the fusion feature matrix, carrying out normalization processing on the fusion weight matrix to obtain a normalized fusion weight matrix;
and determining a multi-modal feature matrix for describing the target object based on each fusion feature matrix and a plurality of normalized fusion weight matrices corresponding to each fusion feature matrix.
2. The multimodal feature fusion method of claim 1, wherein when the heterogeneous data includes audio data, the extracting the monomodal feature matrix of the heterogeneous data comprises:
converting the audio data into single sound channel audio data, and performing resampling processing on the single sound channel audio data to obtain resampled audio data;
moving a Hanning time window with a preset window length on the resampled audio data, and performing Fourier transform on the resampled audio data to obtain an audio frequency spectrum of the audio data;
mapping the audio spectrum to an initial mel-frequency cepstrum using a filter bank;
carrying out logarithmic calculation on the initial Mel cepstrum to obtain a stable Mel cepstrum;
and recombining the stable Mel cepstrum according to a preset time length to obtain a single-mode characteristic matrix of the audio data.
3. The multimodal feature fusion method of claim 1, wherein when the heterogeneous data includes text data, the extracting the monomodal feature matrix of the heterogeneous data comprises:
performing natural language preprocessing on the text data, and extracting a plurality of character feature groups and a picture feature group corresponding to each character feature group from the text data; the character feature group comprises at least two of Chinese character features, phrase features and sentence features; the expression form of the character feature group is a one-hot vector form;
determining a character feature matrix of each character feature group and a picture feature matrix of a picture feature group corresponding to each character feature group;
for each character feature group, fusing a character feature matrix of the character feature group and a picture feature matrix of a picture feature group corresponding to the character feature group to obtain a primary fusion matrix of the character feature group;
for each character feature group, determining a multidimensional correlation matrix between each sub-character feature in the character feature group and the sub-picture feature corresponding to the sub-character feature based on the preliminary fusion matrix of the character feature group and the picture feature matrix of the picture feature group corresponding to the character feature group;
determining a multidimensional attention weight of each sub-character feature based on a multidimensional correlation matrix of each sub-character feature in the character feature group;
and determining a monomodal feature matrix of the text data based on the multidimensional attention weight of each sub-character feature in each character feature group and the picture feature matrix of the sub-picture feature corresponding to each sub-character feature.
4. The method according to claim 3, wherein the determining the text feature matrix of each text feature group and the picture feature matrix of the picture feature group corresponding to each text feature group comprises:
for each character feature group, determining a character feature matrix capable of representing each sub-character feature in the character feature group by using a pre-trained feature extraction model corresponding to the character feature group;
and determining the picture characteristic matrix of the picture characteristic group corresponding to the character characteristic group by using a pre-trained convolutional neural network.
5. The multi-modal feature fusion method of claim 3, wherein the fusing the text feature matrix of the text feature set with the picture feature matrix of the picture feature set corresponding to the text feature set to obtain the preliminary fusion matrix of the text feature set comprises:
for each sub-text characteristic in the text characteristic group, fusing a sub-text matrix of the sub-text characteristic and a sub-picture matrix of the sub-picture characteristic corresponding to the sub-text characteristic to obtain a sub-fusion matrix of each sub-text characteristic;
and determining a preliminary fusion matrix of the character feature group based on the sub-fusion matrix of each sub-character feature.
6. The method of claim 5, wherein determining the multi-dimensional correlation matrix between each sub-text feature in the text feature group and the sub-text feature corresponding to the sub-text feature based on the preliminary fusion matrix of the text feature group and the text feature matrix of the text feature group comprises:
and aiming at each sub-document feature in the text feature group, determining a multi-dimensional correlation matrix of the sub-document feature based on the sub-fusion matrix of the sub-document feature and the sub-picture matrix of the sub-document feature corresponding to the sub-picture feature.
7. The multimodal feature fusion method of claim 1 wherein when the heterogeneous data includes video data, the multimodal feature fusion method further comprises:
segmenting the video data into multi-frame picture data;
when the heterogeneous data includes picture data, the extracting a monomodal feature matrix of the heterogeneous data includes:
and extracting a monomodal feature matrix of the picture data by using a deep neural network with a residual error jumping mechanism.
8. A multimodal feature fusion apparatus, the multimodal feature fusion apparatus comprising:
the data acquisition module is used for acquiring a plurality of heterogeneous data of the target object;
the matrix extraction module is used for extracting a monomodal feature matrix of the heterogeneous data aiming at each heterogeneous data;
the single-mode weight determining module is used for respectively determining a single-mode weight matrix between each single-mode feature matrix and the single-mode feature matrix aiming at each single-mode feature matrix;
a fusion matrix determining module, configured to determine a fusion feature matrix of the single-mode feature matrix based on the single-mode feature matrix and a plurality of single-mode weight matrices corresponding to the single-mode feature matrix;
the fusion weight determining module is used for respectively determining fusion weight matrixes between the fusion feature matrix and each fusion feature matrix aiming at each fusion feature matrix;
the matrix normalization module is used for normalizing the fusion weight matrix aiming at each fusion weight matrix corresponding to the fusion characteristic matrix to obtain a normalized fusion weight matrix;
and the multi-modal characteristic determining module is used for determining a multi-modal characteristic matrix used for describing the target object based on each fusion characteristic matrix and a plurality of normalized fusion weight matrixes corresponding to each fusion characteristic matrix.
9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is run, the machine-readable instructions when executed by the processor performing the steps of the multimodal feature fusion method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the multimodal feature fusion method as claimed in any one of claims 1 to 7.
CN202111626977.0A 2021-12-28 2021-12-28 Multi-modal feature fusion method and device, electronic equipment and readable storage medium Pending CN114332575A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111626977.0A CN114332575A (en) 2021-12-28 2021-12-28 Multi-modal feature fusion method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111626977.0A CN114332575A (en) 2021-12-28 2021-12-28 Multi-modal feature fusion method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN114332575A true CN114332575A (en) 2022-04-12

Family

ID=81014442

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111626977.0A Pending CN114332575A (en) 2021-12-28 2021-12-28 Multi-modal feature fusion method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114332575A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052171A (en) * 2023-03-31 2023-05-02 国网数字科技控股有限公司 Electronic evidence correlation calibration method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052171A (en) * 2023-03-31 2023-05-02 国网数字科技控股有限公司 Electronic evidence correlation calibration method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105976812B (en) A kind of audio recognition method and its equipment
KR102413067B1 (en) Method and device for updating language model and performing Speech Recognition based on language model
US20110224982A1 (en) Automatic speech recognition based upon information retrieval methods
CN111859940B (en) Keyword extraction method and device, electronic equipment and storage medium
CN112687258B (en) Speech synthesis method, apparatus and computer storage medium
JP7355865B2 (en) Video processing methods, apparatus, devices and storage media
CN114143479B (en) Video abstract generation method, device, equipment and storage medium
CN113051371A (en) Chinese machine reading understanding method and device, electronic equipment and storage medium
CN114461852A (en) Audio and video abstract extraction method, device, equipment and storage medium
CN112188306A (en) Label generation method, device, equipment and storage medium
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN110738061A (en) Ancient poetry generation method, device and equipment and storage medium
CN114332575A (en) Multi-modal feature fusion method and device, electronic equipment and readable storage medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN114881668A (en) Multi-mode-based deception detection method
JP2015175859A (en) Pattern recognition device, pattern recognition method, and pattern recognition program
CN116665642A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
KR102129575B1 (en) Word spelling correction system
CN113948064A (en) Speech synthesis and speech recognition
CN116541551A (en) Music classification method, music classification device, electronic device, and storage medium
WO2012134396A1 (en) A method, an apparatus and a computer-readable medium for indexing a document for document retrieval
JP2016162437A (en) Pattern classification device, pattern classification method and pattern classification program
Gedam et al. Development of automatic speech recognition of Marathi numerals-a review
CN115169368A (en) Machine reading understanding method and device based on multiple documents
Wang et al. Listen, Decipher and Sign: Toward Unsupervised Speech-to-Sign Language Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination