CN117113270A

CN117113270A - Knowledge fusion multi-mode interaction method and device based on improved alignment method

Info

Publication number: CN117113270A
Application number: CN202310977144.1A
Authority: CN
Inventors: 胡建国; 黄文俊; 吴劲; 林冰胜; 李启文
Original assignee: Development Research Institute Of Guangzhou Smart City; Sun Yat Sen University
Current assignee: Development Research Institute Of Guangzhou Smart City; Sun Yat Sen University
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-11-24

Abstract

The invention discloses a knowledge fusion multi-mode interaction method and device based on an improved alignment method, wherein the method comprises the steps of obtaining multi-mode data; constructing a query transformation model; performing improvement treatment on the visual transducer according to a multi-stream feature extraction method to obtain a multi-scale visual transducer; training the multi-scale visual transducer according to a joint text method to obtain a target visual transducer; freezing the target visual transducer, and performing visual text alignment training treatment on the query transformation model according to the target visual transducer to obtain a target query transformation model; connecting the target query transformation model with the language model to obtain a knowledge fusion multi-modal interaction model; and inputting the multi-mode data into a knowledge fusion multi-mode interaction model to obtain an interaction result. According to the knowledge fusion multi-modal interaction model, visual representation and text representation are aligned, mutual information between the visual representation and the text representation can be maximized, and the knowledge fusion multi-modal interaction model can be widely applied to the technical field of artificial intelligence.

Description

Knowledge fusion multi-mode interaction method and device based on improved alignment method

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a knowledge fusion multi-mode interaction method and device based on an improved alignment method.

Background

In recent years, the fields of computer vision and natural language processing have each been rapidly developed. Many practical problems are multi-modal in nature, i.e. they involve several different forms of data at the same time, such as images and text. In the related art, most man-machine interaction methods only perform interaction through words or voice, cannot perform image processing combined with multi-mode data such as texts, and have poor interaction efficiency. In view of the foregoing, there is a need for solving the technical problems in the related art.

Disclosure of Invention

In view of this, the embodiment of the invention provides a knowledge fusion multi-mode interaction method and device based on an improved alignment method, so as to improve the man-machine interaction efficiency.

In one aspect, the present invention provides a knowledge fusion multi-modal interaction method based on an improved alignment method, the multi-modal interaction method comprising:

acquiring multi-mode data;

constructing a query transformation model, wherein the query transformation model comprises a visual encoder, a visual transformer and a text transformer;

Performing improvement treatment on the visual transducer according to a multi-stream feature extraction method to obtain a multi-scale visual transducer;

training the multi-scale visual transducer according to a joint text method to obtain a target visual transducer;

freezing the target visual transducer, and performing visual text alignment training treatment on the query transformation model according to the target visual transducer to obtain a target query transformation model;

connecting the target query transformation model with a language model to obtain a knowledge fusion multi-modal interaction model;

and inputting the multi-modal data into the knowledge fusion multi-modal interaction model to obtain an interaction result.

Optionally, the constructing a query transformation model includes:

freezing the visual encoder, and connecting the frozen visual encoder with the visual converter;

and interacting the visual transformer with the text transformer through a shared self-attention layer, and constructing and obtaining a query transformation model.

Optionally, the improving the visual transformer according to the multi-stream feature extraction method to obtain a multi-scale visual transformer includes:

Obtaining a plurality of video streams through the output of the visual encoder;

and adding a residual network to the visual transducer to obtain a multi-scale visual transducer, wherein the residual network is used for carrying out multi-stream feature extraction processing on the plurality of video streams to obtain multi-scale features, and comprises a residual block, a spatial pyramid pooling layer and a full connection layer.

Optionally, the performing multi-stream feature extraction processing on the multiple video streams to obtain multi-scale features includes:

respectively carrying out convolution and identity mapping processing on the plurality of video streams through the residual blocks to obtain a mapping vector set;

local information extraction processing is carried out on the mapping vector set under different scales through the spatial pyramid pooling layer, so that a video stream feature vector set is obtained;

cascading the video stream feature vector set to obtain a cascading data set;

and mapping the cascade data set through the full connection layer to obtain multi-scale features.

Optionally, the training processing is performed on the multi-scale visual transducer according to a joint text method to obtain a target visual transducer, which includes:

acquiring training data, wherein the training data comprises the multi-scale features and text features;

Performing cascade fusion processing on the training data to obtain an input feature matrix;

performing multi-head self-attention processing on the input feature matrix through a self-attention layer to obtain an output feature matrix;

performing feature extraction processing on the output feature matrix through a forward neural network to obtain spliced features;

carrying out multi-mode fusion processing on the spliced features by combining a word segmentation model to obtain fusion features;

updating the fusion characteristics into training data, and returning to the step of carrying out cascade fusion processing on the training data to obtain an input characteristic matrix until the cycle times reach a preset threshold value to obtain the target visual transducer.

Optionally, the multi-mode fusion processing is performed on the spliced features by combining with a word segmentation model to obtain fusion features, including:

performing word segmentation processing on the training data through the word segmentation model to obtain word segmentation characteristics;

performing projection processing on the word segmentation characteristics through a convolutional neural network to obtain characteristic marks;

performing relation modeling processing on the feature marks through an autonomous attention mechanism to obtain feature representations;

and performing splicing processing on the characteristic representation and the spliced characteristic to obtain a fusion characteristic.

Optionally, the performing, according to the target visual transformer, visual text alignment training processing on the query transformation model to obtain a target query transformation model includes:

obtaining sample data, the sample data comprising a visual representation and a textual representation;

extracting the sample data through the target visual transducer to obtain query expression;

aligning the query representation with the text representation to obtain text similarity;

and updating the parameters of the text converter according to the text similarity to obtain a target query conversion model.

In another aspect, an embodiment of the present invention further provides a multi-modal interaction device based on an improved alignment method, where the device includes:

the first module is used for acquiring multi-mode data;

a second module for constructing a query transformation model, the query transformation model comprising a visual encoder, a visual transformer, and a text transformer;

the third module is used for carrying out improvement treatment on the visual transducer according to a multi-stream feature extraction method to obtain a multi-scale visual transducer;

a fourth module, configured to perform training processing on the multi-scale visual transducer according to a joint text method, so as to obtain a target visual transducer;

A fifth module, configured to perform freezing treatment on the target visual transducer, and perform visual text alignment training treatment on the query transformation model according to the target visual transducer, so as to obtain a target query transformation model;

a sixth module, configured to connect the target query transformation model to a language model to obtain a knowledge fusion multimodal interaction model;

and a seventh module, configured to input the multimodal data into the knowledge fusion multimodal interaction model to obtain an interaction result.

On the other hand, the embodiment of the invention also discloses electronic equipment, which comprises a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

In another aspect, embodiments of the present invention also disclose a computer readable storage medium storing a program for execution by a processor to implement a method as described above.

In another aspect, embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.

Compared with the prior art, the technical scheme provided by the application has the following technical effects: according to the embodiment of the application, the visual transducer is improved according to the multi-stream feature extraction method to obtain the multi-scale visual transducer, and different types of feature extraction can be carried out on the video frame; in addition, according to the embodiment of the application, the target visual transducer performs visual text alignment training treatment on the query transformation model, so that the multi-mode alignment effect on the video can be improved; furthermore, the embodiment of the application obtains the knowledge fusion multi-modal interaction model by connecting the target query transformation model with the language model, so that the multi-modal interaction capability on the video can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a knowledge fusion multimodal interaction method based on an improved alignment method provided by an embodiment of the application;

FIG. 2 is a schematic diagram of a query transformation model according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a multi-scale visual transducer according to an embodiment of the present application;

FIG. 4 is a training schematic diagram of a multi-scale visual transducer according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a knowledge fusion multi-modal interaction device based on an improved alignment method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The embodiment of the application provides a knowledge fusion multi-mode interaction method, which can be applied to a terminal, a server, software running in the terminal or the server and the like. The terminal may be, but is not limited to, a tablet computer, a notebook computer, a desktop computer, etc. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms.

Referring to fig. 1, an embodiment of the present invention provides a knowledge fusion multi-modal interaction method based on an improved alignment method, where the multi-modal interaction method includes:

s101, acquiring multi-mode data;

s102, constructing a query transformation model, wherein the query transformation model comprises a visual encoder, a visual transformer and a text transformer;

s103, carrying out improvement treatment on the visual transducer according to a multi-stream feature extraction method to obtain a multi-scale visual transducer;

s104, training the multi-scale visual transducer according to a joint text method to obtain a target visual transducer;

s105, freezing the target visual transducer, and performing visual text alignment training treatment on the query transformation model according to the target visual transducer to obtain a target query transformation model;

s106, connecting the target query transformation model with a language model to obtain a knowledge fusion multi-mode interaction model;

s107, inputting the multi-modal data into the knowledge fusion multi-modal interaction model to obtain an interaction result.

In the embodiment of the invention, the multi-modal interaction method can be applied to a terminal for human-computer interaction, for example, multi-modal data is input through a target object on a computer, corresponding processing is carried out through the terminal through the multi-modal interaction method of the embodiment of the invention, corresponding interaction results are output, and the interaction results can be corresponding answers, suggestions and the like. The embodiment of the invention firstly acquires multi-mode data, wherein the multi-mode data comprises image, text, video and other data, and is obtained by uploading or inputting a target object. The embodiment of the invention mainly analyzes and processes the multi-modal data uploaded by the target object through a knowledge fusion multi-modal interaction model, and particularly constructs a query transformation model, wherein the query transformation model comprises a visual encoder, a visual transformer and a text transformer. In order to be able to process multi-stream data, the embodiment of the invention carries out improved processing on the visual transducer by a multi-stream feature extraction method to obtain a multi-scale visual transducer, and a plurality of different scales of video input in time and space can be simultaneously considered by the multi-scale visual transducer. And then training the multi-scale visual transducer according to a joint text method to obtain the target visual transducer. And then freezing the target visual transducer, and performing visual text alignment training treatment on the query transformation model according to the target visual transducer to obtain a target query transformation model. In a visual text alignment learning task, visual and textual representations need to be aligned to maximize mutual information between them. The target visual transducer is fixed and no training is performed, which reduces model parameters and computation and makes more efficient use of the already trained visual transducer than training the visual transducer and text transducer simultaneously in an end-to-end model. And finally, connecting the target query transformation model with a language model to obtain a knowledge fusion multi-modal interaction model. And inputting the multi-modal data into the knowledge fusion multi-modal interaction model to obtain an interaction result. For example, the target object uploads a video segment to the terminal, inputs corresponding text content, analyzes the video and the text content through the knowledge fusion multi-modal interaction model in the embodiment of the invention, and outputs a corresponding interaction result.

It should be noted that, in each specific embodiment of the present application, when related processing is required to be performed according to data related to the identity or characteristics of the target object, such as information of the target object, behavior data of the target object, history data of the target object, and position information of the target object, permission or consent of the target object is obtained first, and the collection, use, processing, etc. of the data complies with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive information of the target object, the independent permission or independent consent of the target object is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the target object is explicitly acquired, the necessary target object related data for enabling the embodiment of the application to normally operate is acquired.

Further as an alternative embodiment, the constructing a query transformation model includes:

In an embodiment of the invention, referring to FIG. 2, the query transformation model includes a visual encoder and a Q-Former model. The Q-Former model is a trainable module for connecting the gap between a frozen visual encoder and a fixed language model (LLM) and is capable of extracting a fixed number of output features from the visual encoder independent of the input image resolution. The Q-Former consists of two transducer sub-modules, namely a visual transducer and a text transducer, sharing the same self-attention (self-attention) layer. Wherein, an image transducer interacting with the frozen visual encoder is used for visual feature extraction, and the frozen image transducer comprises a self-attention layer, a cross-attention layer and a feedforward layer. And freezing the visual encoder, and connecting the frozen visual encoder with the visual transducer. The other is a text transducer, which can be used as a text encoder and a text decoder, and comprises a self-attention layer and a feedforward layer. And interacting the visual transformer with the text transformer through a shared self-attention layer, and constructing and obtaining a query transformation model. Embodiments of the present invention create a set of learnable embeddings as input query embeddings for visual transducers. Queries interact with each other through the self-attention layer and with frozen image features through the cross-attention layer (inserting each transducer block). Depending on the pre-training task, embodiments of the present invention use different self-intent masks to control query text interactions. This bottleneck architecture, along with the pre-training goals of embodiments of the present invention, forces queries to extract visual information most relevant to text.

Further as an optional implementation manner, the improving the visual transformer according to the multi-stream feature extraction method to obtain a multi-scale visual transformer includes:

In an embodiment of the invention, the visual transducer is improved, and the construction process of the visual transducer is the idea of multi-scale feature extraction, and can simultaneously consider a plurality of different scales of video input in time and space. Specifically, embodiments of the present invention use a multi-stream approach, where each stream performs feature extraction for a different spatial and temporal scale, and then concatenates the features and processes them with text/question input. The multi-scale visual transducer is obtained mainly by adding a residual error network to the visual transducer, wherein the residual error network is used for extracting multi-stream characteristics of the video streams to obtain multi-scale characteristics, and comprises a residual error block, a spatial pyramid pooling layer and a full connection layer. In particular, embodiments of the present invention use an X3D network as a base model that has been shown to be excellent in many video intelligent answering technology (VideoQA) tasks. Referring to fig. 3, an embodiment of the present invention uses a residual network (res net) for feature extraction for each video stream and three different scale video streams to obtain feature representations for multiple visual modalities.

Further as an optional implementation manner, the performing multi-stream feature extraction processing on the multiple video streams to obtain multi-scale features includes:

cascading the video stream feature vector set to obtain a cascading data set;

In the embodiment of the present invention, a set of residual blocks (residual blocks) is used, where each block includes a convolution kernel with a size of 3×3 and an identity mapping (identity mapping), and the residual blocks are used to respectively perform convolution and identity mapping processing on the multiple video streams to obtain a mapping vector set. This process can be expressed by the following formula:

f _vi ＝ResNet(x _vi )

wherein x is _vi Representing the input of the i-th video stream, i.e. corresponding to the multi-frame video in the stream. After each residual block, a spatial pyramid pooling (Spatial Pyramid Pooling) layer is further added, and local information extraction processing is performed on the mapping vector set under different scales through the spatial pyramid pooling layer, so that a video stream feature vector set is obtained, and local information is captured under different scales. Finally, the feature vector set of the video stream is subjected to cascading processing by cascading the features generated by each stream to obtain a cascading data set, and the cascading data set is mapped into a vector f with a fixed length by using a full connection layer _v The method comprises the following steps:

f _v ＝FCN9concat(f _v1 ,f _v2 ,f _v3 ))

here concat means cascading three feature sets together, FCN means fully connected layer used. Mapping the cascade data set through the full connection layer to obtain multi-scale characteristics, wherein f _v Is a multi-scale feature for the entire video segment.

Further as an optional implementation manner, the training processing is performed on the multi-scale visual transducer according to a joint text method to obtain a target visual transducer, which includes:

In the embodiment of the present invention, referring to fig. 4, the multi-scale visual transducer is trained according to a joint text method, and the multi-scale visual transducer is further optimized according to the joint text method, and first training data is obtained, where the training data includes the multi-scale features and the text features. Then, the training data is subjected to cascade fusion processing to obtain an input feature matrix, and the feature f generated by the video encoder is specifically obtained _v And embedding of text/question features f _t Fusion is carried out through cascading operation, and an input feature matrix X with the dimension of (M+N) X D is obtained ₀ :

Where M represents the number of video features, N represents the number of text features, and D represents the dimensions of the features.

Next, embodiments of the present invention use a multi-layer transducer model to further process the feature matrix X ₀ . In the ith layer of transducer, the input characteristic matrix X _i 1 a self-attention mechanism is applied to capture the relationship between different features. The multi-head self-attention processing is carried out on the input feature matrix through the self-attention layer, and an output feature matrix Z is obtained _i Can be expressed as:

Z _i ＝MultiHead(X _i 1)

where Multi-Head denotes a Multi-Head Attention mechanism (Multi-Head Attention), it divides the input matrix into several sub-matrices and applies one Attention mechanism to each sub-matrix, resulting in a plurality of feature maps. Dividing the input matrix into K sub-matrices, then the MultiHead can be expressed as:

MultiHead(X)＝concat(head ₁ ,…,head _K )

Wherein head is _k A feature map obtained after applying the attention mechanism to the kth sub-matrix is shown. Next, in the embodiment of the present invention, the feature extraction process is performed on the output feature matrix through a forward neural network to obtain a spliced feature, and specifically, the output feature matrix Z is obtained through using a forward neural network (Feedforward Network) _i Further processing is performed to extract more useful information. The output of this forward neural network is H _i Then it can be expressed as:

H _i ＝FFN(Z _i )

where FFN denotes a forward neural network, which consists of two fully connected layers, each containing an activation function and regularization operations between them. Finally, the embodiment of the invention carries out multi-mode fusion processing on the spliced features by combining word segmentation models to obtain fusion features, and generates an information word example (informative tokens) by using a word segmentation model (TokenLearoner) in each transducer layer. In the ith layer of transducer, first, the feature matrix X is input _i 1 to the video part and the text/question part, respectively, to obtain a set informative tokens, respectivelyAnd->The tokens are then spliced together and further processed through a fully-connected layer to yield informative tokenT having a dimension D _i :

Wherein FCN ₂ A network comprising two fully connected layers is shown for mapping the stitched token to a fixed length vector. Finally, in each transducer layer, informative tokenT will be _i And the feature matrix H output by the former layer _i And (5) splicing to obtain fusion characteristics. Wherein, the splicing operation can be expressed as:

X _i ＝concat(H _i ,T _i )

the embodiment of the invention updates the fusion characteristics into training data to be input into a next layer of transducer for processing, namely returning to the step of cascade fusion processing of the training data to obtain an input characteristic matrix until the cycle times reach a preset threshold value, and obtaining the target visual transducer. Typically, the number of fine-tuned rounds, i.e. the preset threshold, may be between a few hundred and a few thousand.

Further as an optional implementation manner, the combining word segmentation model performs multi-mode fusion processing on the spliced features to obtain fusion features, including:

In an embodiment of the invention, feature representations of each modality are compressed by learning token and converted into a set of meaningful markup representations for better multimodal fusion. Specifically, the method adopts a Token Learner model (word segmentation module) to learn the word segmentation of each modality and convert the word segmentation into a marker representation of the same dimension. This has the advantage that the number of feature representations can be significantly reduced, thereby improving the efficiency and accuracy of the model. In addition, as TokenLearner is an adaptive mechanism, the TokenLearner can automatically select the most important information according to the input data, and the effect of the model is further improved. For video input, a multi-scale, multi-stream feature f { vi } is employed, where i is the index of the feature stream. Firstly, word segmentation processing is carried out on the training data through the word segmentation model to obtain word segmentation features, and specifically, the text features and the video features of the segmented words are added to obtain a comprehensive feature f=f_ { vi } +r. Then, a Convolutional Neural Network (CNN) is used for projecting the N-dimensional mark representation space, a Self-Attention mechanism (Self-Attention) is used for modeling the relation between different marks to obtain a feature representation, and the feature representation and the spliced features are subjected to splicing processing to obtain the fusion features. Specifically, the input feature r is first transformed into a tensor of l× (t·h·w) by a linear layer Φ (r), where T, H, W represents the time axis, height dimension, and width dimension, respectively, of the video feature. Then, the video is converted into tensors of C× (T.H.W) through a linear layer, and added with the original video features to obtain a new feature representation f. Next, a Convolutional Neural Network (CNN) is used to convert it into a feature representation in the shape t×h×w×n, and a softmax function is used to select the most important information in each marker representation. Finally, the relationship between the different markers is modeled using a Self-Attention mechanism (Self-Attention), and this feature representation is fused with the original video feature. The visual transducer trained by the model can be better suitable for video text alignment tasks.

Further as an optional implementation manner, the performing, according to the target visual transformer, visual text alignment training processing on the query transformation model to obtain a target query transformation model includes:

In embodiments of the present invention, it is desirable to align the visual representation and the textual representation to maximize the mutual information between them. Specifically, embodiments of the present invention accomplish this task by learning visual-text similarity of positive sample pairs compared to negative sample pairs. Sample data is first obtained, the sample data comprising a visual representation and a textual representation, wherein the visual representation is a feature extracted from a video image, which can be done using a previously constructed visual encoder. Text representation is typically by inputting natural language text into a text transformer. And using the query as a bridge between the visual representation and the text representation, the query representation Z from the visual encoder is aligned with the output embedding t of the text transformer. Specifically, the sample data is extracted and processed through the target visual transducer to obtain a query representation, and then the query representation and the text representation are aligned to obtain the text similarity. Since the query representation Z contains multiple output embeddings, the pairwise similarity between each query output and t is calculated first, and then the highest similarity is selected as the visual-text similarity. To avoid information leakage, a single-mode self-attention mask is used, making the query and text invisible to each other. And finally, updating parameters of the text converter according to the text similarity to obtain a target query conversion model, wherein in the process, a fixed visual converter can be selected, and only the text converter is trained. This has the advantage that model parameters and calculations can be reduced and that already trained visual transducers can be utilized more fully. Specifically, in a fixed visual transformer, the visual encoder and q-former model are combined for visual text alignment training, the already trained visual transformer is used to extract the query representation Z, which is then aligned with the output embedment t of the text transformer. The vision transducer is here fixed and no training is performed. This approach may reduce model parameters and computation effort and may more fully utilize already trained transform encoders than training both visual and text transformers in an end-to-end model.

In one possible embodiment, the multi-modal interaction model is generated for question-answer output by training the link language model through the full-link layer by the query transformation model. In the generative pre-training phase, the Q-form (with frozen image encoder) is connected to a frozen LLM to take advantage of the LLM's generative language capabilities. The query embedding Z is projected linearly using the fully connected layer (FC layer) to have the same dimensions as the text embedding of LLM. The projected query embeddings are then stitched together with the entered text embeddings, conditioning the LLM as a soft visual cue, and providing it with a visual representation extracted by the Q-Former. Because the Q-Former has been pre-trained to extract language dependent visual representations, it can effectively act as an information bottleneck, providing the LLM with the most useful information and removing extraneous visual information. This relieves the LLM from learning visual language alignment, thereby alleviating the forgetting problem in the model.

On the other hand, referring to fig. 5, an embodiment of the present invention further provides a multi-modal interaction device based on an improved alignment method, where the device includes:

a first module 501, configured to obtain multi-modal data;

a second module 502 for constructing a query transformation model, the query transformation model comprising a visual encoder, a visual transformer, and a text transformer;

A third module 503, configured to perform improvement processing on the visual transducer according to a multi-stream feature extraction method, so as to obtain a multi-scale visual transducer;

a fourth module 504, configured to perform training processing on the multi-scale visual transducer according to a joint text method, so as to obtain a target visual transducer;

a fifth module 505, configured to freeze the target visual transducer, and perform visual text alignment training on the query transformation model according to the target visual transducer, so as to obtain a target query transformation model;

a sixth module 506, configured to connect the target query transformation model to a language model to obtain a knowledge fusion multimodal interaction model;

and a seventh module 507, configured to input the multimodal data into the knowledge fusion multimodal interaction model to obtain an interaction result.

Referring to fig. 6, an embodiment of the present invention further provides an electronic device, including a processor 601 and a memory 602; the memory is used for storing programs; the processor executes the program to implement the method as described above.

Corresponding to the method of fig. 1, an embodiment of the present invention also provides a computer-readable storage medium storing a program to be executed by a processor to implement the method as described above.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

In summary, the embodiment of the invention has the following advantages: the multi-modal interaction model provided by the embodiment of the invention can process different types of information (such as RGB, optical flow and the like) in the video frame by utilizing the multi-stream video encoder, and send the information into the query transformation model so as to perform cross-modal interaction. In addition, a multi-level alignment mechanism is introduced, so that the multi-mode alignment effect on the video can be effectively improved. Finally, the language model is used for reasoning and answering, so that the multi-mode interaction capability on the video is improved.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. The knowledge fusion multi-modal interaction method based on the improved alignment method is characterized by comprising the following steps of:

acquiring multi-mode data;

2. The multi-modal interaction method of claim 1, wherein the constructing a query transformation model includes:

3. The multi-modal interaction method according to claim 1, wherein the modifying the visual transducer according to the multi-stream feature extraction method results in a multi-scale visual transducer, comprising:

4. The multi-modal interaction method according to claim 3, wherein the multi-stream feature extraction processing is performed on the plurality of video streams to obtain multi-scale features, including:

cascading the video stream feature vector set to obtain a cascading data set;

5. The multi-modal interaction method according to claim 4, wherein the training the multi-scale visual transducer according to the joint text method to obtain a target visual transducer includes:

6. The multi-modal interaction method according to claim 5, wherein the performing multi-modal fusion processing on the spliced features by combining word segmentation models to obtain fusion features includes:

7. The multi-modal interaction method according to claim 1, wherein the performing visual text alignment training processing on the query transformation model according to the target visual transformer to obtain a target query transformation model includes:

8. A multi-modal interaction device based on an improved alignment method, the device comprising:

the first module is used for acquiring multi-mode data;

9. An electronic device comprising a memory and a processor;

the memory is used for storing programs;

the processor executing the program implements the method of any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.