CN116204848A

CN116204848A - Multi-mode fusion method, device, equipment and medium

Info

Publication number: CN116204848A
Application number: CN202310145041.9A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-02-17
Filing date: 2023-02-17
Publication date: 2023-06-02

Abstract

The invention relates to the technical field of artificial intelligence, and provides a multi-mode fusion method, device, equipment and medium, wherein the method comprises the following steps: the method comprises the steps of encoding a plurality of modes to obtain feature vectors of the modes, preprocessing the feature vectors to obtain feature vectors corresponding to the modes, setting a plurality of weight matrixes for each target feature vector according to dimensions corresponding to the target feature vectors to obtain temporary vectors, adding the temporary vectors element by element to obtain the mode vectors, and fusing the mode vectors to obtain total vectors corresponding to the modes, so that the method can be applied to electronic commerce and is realized through a neural network. The invention has the beneficial effects that: more modal information is reserved, and the fusion effect of the final total vector is better.

Description

Multi-mode fusion method, device, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multi-mode fusion method, device, equipment and medium.

Background

Currently, multi-modal learning has become one of the hot spots of continuous research in recent years. Modalities refer to sources or forms of information, for example, a form of information may be represented in various forms such as video, voice, image, text, etc., and each form of representation is a modality of the information. Currently, in the field of electronic commerce, in general, multi-modal feature fusion is to splice feature vectors of multiple modalities together, however, this method may lose much modal information, resulting in an undesirable fusion effect.

Disclosure of Invention

The invention mainly aims to provide a multi-mode fusion method, device, equipment and medium, and aims to solve the problem that the existing multi-mode feature fusion method can lose much mode information, so that the fusion effect is not ideal.

The invention provides a multi-mode fusion method, which comprises the following steps:

acquiring a plurality of modes to be fused;

encoding each mode to be fused through an encoder to obtain a feature vector corresponding to each mode to be fused;

preprocessing the feature vectors to obtain target feature vectors of the feature vectors; the preprocessing mode is to increase or decrease the dimension of the feature vector;

setting a plurality of weight matrixes for each target feature vector according to the dimension corresponding to each target feature vector; the number of the transverse columns of the weight matrix is the same as the number of the longitudinal columns of the target feature vector, and the number of the longitudinal columns of each weight matrix is a preset value;

multiplying each target feature vector by a plurality of corresponding weight matrixes to obtain a plurality of temporary vectors respectively corresponding to each target feature vector, wherein the number of the temporary vectors is the same as that of the weight matrixes;

adding a plurality of temporary vectors corresponding to the target feature vectors element by element to obtain modal vectors corresponding to the target feature vectors and with the number of columns being a preset value;

and carrying out fusion operation on the modal vectors to obtain total vectors corresponding to the plurality of modes.

Further, the step of preprocessing the feature vectors to obtain target feature vectors of the feature vectors includes:

and adding a dimension with a scalar of 1 to the last position of the feature vector to obtain a target feature vector.

Further, the step of fusing the modal vectors to obtain a total vector corresponding to a plurality of modalities includes:

and carrying out vector inner product operation on each modal vector to obtain a total vector corresponding to the plurality of modes.

vector stitching is carried out on each modal vector to obtain a stitched vector;

inputting the spliced vector into a full-connection layer, and multiplying the full-connection layer by the weight of n multiplied by m to obtain a total vector; wherein n is the number of modal vectors, and m is the preset value.

Further, the step of encoding each to-be-fused mode through an encoder to obtain a feature vector corresponding to each to-be-fused mode includes:

acquiring the expression forms of all modes; wherein the expression forms at least comprise three expression forms of text, image and voice;

setting corresponding encoders according to the expression forms of all modes;

and encoding each mode by using a corresponding encoder to obtain the feature vector corresponding to each mode.

Further, after the step of fusing the modal vectors to obtain the total vector corresponding to the plurality of modes, the method further includes:

acquiring a multi-modal data sample, wherein the multi-modal data sample comprises a plurality of total vectors and corresponding actual recognition results;

inputting each total vector to a preset neural network model for recognition to obtain a predicted recognition result;

calculating a loss function of each multi-mode data sample according to the actual recognition result and the predictive recognition;

and updating parameters of the neural network model and/or updating the generated weight matrix by using the loss function of each multi-mode data sample through a preset parameter adjustment strategy.

The invention also provides a multi-mode fusion device, which comprises:

the acquisition module is used for acquiring a plurality of modes to be fused;

the coding module is used for coding each mode to be fused through a coder to obtain a feature vector corresponding to each mode to be fused;

the preprocessing module is used for preprocessing the feature vectors to obtain target feature vectors of the feature vectors; the preprocessing mode is to increase or decrease the dimension of the feature vector;

the setting module is used for setting a plurality of weight matrixes for each target feature vector according to the dimension corresponding to each target feature vector; the number of the transverse columns of the weight matrix is the same as the number of the longitudinal columns of the target feature vector, and the number of the longitudinal columns of each weight matrix is a preset value;

the multiplication module is used for multiplying each target feature vector with a plurality of corresponding weight matrixes to obtain a plurality of temporary vectors respectively corresponding to each target feature vector, wherein the number of the temporary vectors is the same as that of the weight matrixes;

the adding module is used for adding a plurality of temporary vectors corresponding to the target feature vectors element by element to obtain modal vectors corresponding to the target feature vectors and with the number of columns being a preset value;

and the fusion module is used for carrying out fusion operation on the modal vectors to obtain total vectors corresponding to a plurality of modes.

Further, the preprocessing module includes:

and the preprocessing sub-module is used for adding a dimension with a scalar of 1 to the last position of the feature vector to obtain a target feature vector.

The invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the preceding claims.

The invention has the beneficial effects that: the method comprises the steps of encoding a plurality of modes to obtain feature vectors of the modes, preprocessing the feature vectors to obtain feature vectors corresponding to the modes, setting a plurality of weight matrixes for each target feature vector according to dimensions corresponding to the target feature vectors to obtain temporary vectors, adding the temporary vectors element by element to obtain the mode vectors, and carrying out fusion operation to obtain total vectors corresponding to the modes, so that more mode information is reserved, and the fusion effect of the final total vectors is better.

Drawings

FIG. 1 is a flow chart of a multi-modal fusion method according to an embodiment of the invention;

FIG. 2 is a schematic block diagram of a multi-modal fusion device according to one embodiment of the invention;

fig. 3 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, in the embodiments of the present invention, all directional indicators (such as up, down, left, right, front, and back) are merely used to explain the relative positional relationship, movement conditions, and the like between the components in a specific posture (as shown in the drawings), if the specific posture is changed, the directional indicators correspondingly change, and the connection may be a direct connection or an indirect connection.

The term "and/or" is herein merely an association relation describing an associated object, meaning that there may be three relations, e.g., a and B, may represent: a exists alone, A and B exist together, and B exists alone.

Furthermore, descriptions such as those referred to as "first," "second," and the like, are provided for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implying an order of magnitude of the indicated technical features in the present disclosure. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

Referring to fig. 1, the present invention proposes a multi-modal fusion method, including:

s1: acquiring a plurality of modes to be fused;

s2: encoding each mode to be fused through an encoder to obtain a feature vector corresponding to each mode to be fused;

s3: preprocessing the feature vectors to obtain target feature vectors of the feature vectors; the preprocessing mode is to increase or decrease the dimension of the feature vector;

s4: setting a plurality of weight matrixes for each target feature vector according to the dimension corresponding to each target feature vector; the number of the transverse columns of the weight matrix is the same as the number of the longitudinal columns of the target feature vector, and the number of the longitudinal columns of each weight matrix is a preset value;

s5: multiplying each target feature vector by a plurality of corresponding weight matrixes to obtain a plurality of temporary vectors respectively corresponding to each target feature vector, wherein the number of the temporary vectors is the same as that of the weight matrixes;

s6: adding a plurality of temporary vectors corresponding to the target feature vectors element by element to obtain modal vectors corresponding to the target feature vectors and with the number of columns being a preset value;

s7: and carrying out fusion operation on the modal vectors to obtain total vectors corresponding to the plurality of modes.

As described in step S1, a plurality of modes to be fused are acquired, the modes refer to sources or forms of information, for example, one type of information may be represented in various forms such as video, voice, images, text, etc., where the modes may be the same form or different forms, and the acquisition mode may be acquired by a crawler technology or manually input.

As described in step S2, each of the modes to be fused is encoded by an encoder to obtain feature vectors corresponding to each of the modes to be fused, where the encoders corresponding to the different modes to be fused are different, for example, one multi-mode data sample has three modes to be fused, namely, a text mode, an image mode and a voice mode, and then the three modes of the sample can be encoded by a text encoding mode, an image encoding mode and a voice encoding mode, respectively, so as to obtain the feature vectors corresponding to each mode. In the field of electronic commerce, data are distributed in various departments, and thus fusion of data is required.

Preprocessing the feature vectors to obtain target feature vectors of the feature vectors as described in the step S3; the preprocessing method is to increase or decrease the dimension of the feature vector, and in a specific embodiment, the dimension is increased or decreased to bias the feature vector, so as to improve the effect of subsequent fusion and keep more data information as much as possible.

As described in step S4, a plurality of weight matrices are set for each target feature vector according to the dimension corresponding to each target feature vector, and it should be noted that the plurality of weight matrices are different weight matrices, and the values of the elements in the specific matrices may be different.

As described in step S5, the plurality of temporary vectors corresponding to the respective target feature vectors are obtained by multiplying the respective target feature vectors by the corresponding plurality of weight matrices. In addition, the weight matrix is multiplied by the target feature vector to obtain a plurality of corresponding temporary vectors, so that as much data information as possible can be stored in the temporary vectors, wherein the number of rows of the weight matrix is the same as the number of columns of the target feature vector, the number of columns of each weight matrix is a preset value, and the obtained temporary vectors are vector matrices with the number of columns being the preset value.

As described in step S6, the temporary vectors corresponding to the target feature vectors are added element by element to obtain the modal vectors corresponding to the target feature vectors and having the number of columns of the modal vectors of the preset value, that is, the same elements are added to obtain the values at the positions of the elements, so as to obtain the corresponding modal vectors, that is, the target feature vectors are fused, and more data information is retained.

And (7) performing a fusion operation on the modal vectors to obtain total vectors corresponding to the plurality of modalities, wherein the fusion operation may be a vector inner product operation, that is, multiplying the total vectors by one element, and then multiplying the total vectors to obtain a vector with a column number of a preset value, which represents the total vectors fused by the plurality of modalities.

In a specific embodiment, the feature vector of the mode a is obtained by the mode a through the encoder a, for example, a vector of 1×512;

the mode B obtains a characteristic vector of the mode B through the encoder A, for example, a vector of 1 x 256;

the mode C obtains a characteristic vector of the mode C through the encoder A, for example, a vector of 1 x 64;

then add a scalar 1 dimension to the last position of each feature vector, namely:

the eigenvector of modality a is 1 x 513;

the eigenvector of modality B is 1 x 257;

the eigenvector of modality C is 1 x 65;

r=4 of the low-rank is manually set, and the characteristic dimension h=128 after multi-mode fusion is desired.

R (r=4) learnable weight matrices are randomly initialized for modality a, modality B, modality C, respectively, each Wa being 513 x 128, each Wb being 257 x 128, each Wc being 65 x 128.

Wa is a matrix of 4 513 x 128;

wb is a matrix of 4 257×128;

wc is 4 matrices of 65 x 128;

multiplying the feature vector of the mode A by 4 Was, multiplying the feature vector of the mode B by 4 Wbs, and multiplying the feature vector of the mode C by 4 Wcs, namely:

fa=modal a eigenvector Wa, fa is a vector of 1 x 128, 4;

fb = modal B eigenvectors Wb, fb is a vector of 1×128, 4;

fc=modal C eigenvectors Wc, fc is a vector of 1×128, 4;

the 4 eigenvectors are added element-by-element, so that fa, fb, fc represent a vector of 1×128, i.e., fset=fa×fb×fc. Here, the vector inner product operation, i.e. multiplication by element, is that the multiplied f total is a vector of 1×128, representing the total vector of the three modes fused.

The method comprises the steps of encoding a plurality of modes to obtain feature vectors of the modes, preprocessing the feature vectors to obtain feature vectors corresponding to the modes, setting a plurality of weight matrixes for each target feature vector according to dimensions corresponding to the target feature vectors to obtain temporary vectors, adding the temporary vectors element by element to obtain the mode vectors, and carrying out fusion operation to obtain total vectors corresponding to the modes, so that more mode information is reserved, and the fusion effect of the final total vectors is better.

In one embodiment, the step S3 of preprocessing the feature vectors to obtain target feature vectors of the feature vectors includes:

s301: and adding a dimension with a scalar of 1 to the last position of the feature vector to obtain a target feature vector.

As described in step S301, a scalar 1 dimension is added to the last position of the feature vector to obtain the target feature vector, where adding a scalar 1 dimension can increase the width of the feature vector, so that more data information of the original feature vector can be retained under the condition of multiplying the feature vector with the weight matrix.

In one embodiment, the step S7 of performing the fusion operation on the modal vectors to obtain total vectors corresponding to a plurality of modalities includes:

s701: and carrying out vector inner product operation on each modal vector to obtain a total vector corresponding to the plurality of modes.

As described in step S701, the vector inner product operation, that is, the multiplication by element, is performed, where the multiplied vector represents the total vector fused by the plurality of modes, and the dimension of the total vector obtained after multiplication is the same as the dimension of each mode, where the dimensions of the vectors of each mode are also consistent.

s711: vector stitching is carried out on each modal vector to obtain a stitched vector;

s712: inputting the spliced vector into a full-connection layer, and multiplying the full-connection layer by the weight of n multiplied by m to obtain a total vector; wherein n is the number of modal vectors, and m is the preset value.

As described in the above steps S711-S712, vector stitching is performed on each modal vector to obtain a stitched vector, the stitched vector is input into a full connection layer, and the full connection layer is multiplied by the weight of nxm to obtain a total vector, that is, the generated modal vector already has complete data information, where stitching can be performed, and then the weight vector is multiplied to obtain the total vector, so that another fusion method is provided, and subsequently in the use process, a user can select a better fusion method based on the loss values of the two fusion methods. It should be noted that, each modal vector may specifically be related data such as video data and audio data generated in the electronic commerce field, and the data is distributed in each department, so that the data needs to be fused.

In one embodiment, the step S2 of encoding each of the to-be-fused modes by using an encoder to obtain feature vectors corresponding to each of the to-be-fused modes includes:

s201: acquiring the expression forms of all modes; wherein the expression forms at least comprise three expression forms of text, image and voice;

s202: setting corresponding encoders according to the expression forms of all modes;

s203: and encoding each mode by using a corresponding encoder to obtain the feature vector corresponding to each mode.

As described in the above steps S201 to S203, a plurality of multi-modal data samples are acquired. For each acquired multi-mode data sample, the computer device can adopt a proper coding mode for each mode in each sample, and convert each mode of each multi-mode data sample into a feature vector corresponding to each mode. For example, a multi-modal data sample has three modes, namely a text mode, an image mode and a voice mode, so that the three modes of the sample can be respectively encoded by a text encoding mode, an image encoding mode and a voice encoding mode for the multi-modal data sample to obtain feature vectors of the three modes of the sample. In electronic commerce, data related to video data, audio data, and the like generated in each electronic commerce field is distributed in each department, so that data fusion is required, and vectorization, that is, encoding can be performed separately.

In one embodiment, after the step S7 of performing the fusion operation on the modal vectors to obtain the total vectors corresponding to the plurality of modalities, the method further includes:

s801: acquiring a multi-modal data sample, wherein the multi-modal data sample comprises a plurality of total vectors and corresponding actual recognition results;

s802: inputting each total vector to a preset neural network model for recognition to obtain a predicted recognition result;

s803: calculating a loss function of each multi-mode data sample according to the actual recognition result and the predictive recognition;

s804: and updating parameters of the neural network model and/or updating the generated weight matrix by using the loss function of each multi-mode data sample through a preset parameter adjustment strategy.

Taking a multi-modal data sample as described in the above steps S801-S802, wherein the multi-modal data sample includes a plurality of total vectors and corresponding actual recognition results; and inputting each total vector into a preset neural network model for recognition to obtain a predicted recognition result. And inputting each total vector into a corresponding neural network model for recognition, so that a neural network model prediction recognition result can be obtained.

Calculating a loss function of each multi-mode data sample according to the actual recognition result and the predictive recognition as described in the above steps S803-S804; and updating parameters of the neural network model and/or updating the generated weight matrix by using the loss function of each multi-mode data sample through a preset parameter adjustment strategy. The loss function may be any loss function, which is not limited in this application, and the parameters of the neural network model are updated and/or the generated weight matrix is updated by using the loss function of each multi-mode data sample through a preset parameter adjustment policy, so that the generation mode of the total vector or the recognition effect of the model is better. The data of video data, audio data and other related data generated in the electronic commerce field are distributed in each department, so that the data are required to be fused, the fused whole forms the total vector of the electronic commerce, and then the neural network model is updated, thereby being convenient for the neural network model to predict the result based on the data of each electronic commerce.

Referring to fig. 2, the present invention provides a multi-modal fusion device comprising:

an acquisition module 10, configured to acquire a plurality of modalities to be fused;

the encoding module 20 is configured to encode each of the to-be-fused modes through an encoder to obtain feature vectors corresponding to each of the to-be-fused modes;

a preprocessing module 30, configured to preprocess the feature vectors to obtain target feature vectors of the feature vectors; the preprocessing mode is to increase or decrease the dimension of the feature vector;

a setting module 40, configured to set a plurality of weight matrices for each of the target feature vectors according to dimensions corresponding to the target feature vectors; the number of the transverse columns of the weight matrix is the same as the number of the longitudinal columns of the target feature vector, and the number of the longitudinal columns of each weight matrix is a preset value;

a multiplication module 50, configured to multiply each of the target feature vectors with a corresponding plurality of weight matrices to obtain a plurality of temporary vectors corresponding to each of the target feature vectors, where the number of temporary vectors is the same as the weight matrices;

the adding module 60 is configured to add the temporary vectors corresponding to the target feature vectors element by element, so as to obtain modal vectors corresponding to the target feature vectors and having a preset number of columns;

and the fusion module 70 is configured to perform a fusion operation on the modal vectors to obtain total vectors corresponding to a plurality of modes.

In one embodiment, the preprocessing module 30 includes:

In one embodiment, the fusion module 70 includes:

and the first fusion sub-module is used for carrying out vector inner product operation on each modal vector to obtain a total vector corresponding to a plurality of modes.

In one embodiment, the fusion module 70 includes:

the splicing sub-module is used for carrying out vector splicing on each modal vector to obtain a spliced vector;

the input sub-module is used for inputting the spliced vector into the full-connection layer, and multiplying the weight of n multiplied by m in the full-connection layer to obtain a total vector; wherein n is the number of modal vectors, and m is the preset value.

In one embodiment, the encoding module 20 includes:

the expression form acquisition sub-module is used for acquiring the expression form of each mode; wherein the expression forms at least comprise three expression forms of text, image and voice;

the encoder setting submodule is used for setting corresponding encoders according to the expression forms of all modes;

and the encoding submodule is used for encoding each mode by utilizing the corresponding encoder to obtain the feature vector corresponding to each mode.

Referring to fig. 3, a computer device is further provided in the embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store various feature vectors and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, may implement the multi-modal fusion method according to any of the embodiments described above.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.

The embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, where the computer program can implement the multi-mode fusion method according to any one of the above embodiments when executed by a processor.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by hardware associated with a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A method of multi-modal fusion, comprising:

acquiring a plurality of modes to be fused;

2. The multi-modal fusion method as claimed in claim 1 wherein the step of preprocessing the feature vectors to obtain target feature vectors for each of the feature vectors includes:

3. The multi-modal fusion method as claimed in claim 1, wherein the step of fusing the modal vectors to obtain a total vector corresponding to a plurality of modalities includes:

4. The multi-modal fusion method as claimed in claim 1, wherein the step of fusing the modal vectors to obtain a total vector corresponding to a plurality of modalities includes:

5. The multi-modal fusion method as set forth in claim 1, wherein the step of encoding each of the modalities to be fused by an encoder to obtain feature vectors corresponding to each of the modalities to be fused includes:

setting corresponding encoders according to the expression forms of all modes;

6. The multi-modal fusion method as set forth in claim 1 wherein after the step of fusing the modal vectors to obtain a total vector corresponding to a plurality of modalities, further includes:

and updating parameters of the neural network model and/or updating the generated weight matrix by using a loss function of each multi-mode data sample through a preset parameter adjustment strategy.

7. A multi-modal fusion device, comprising:

the acquisition module is used for acquiring a plurality of modes to be fused;

8. The multi-modal fusion device of claim 7, wherein the preprocessing module includes:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.