CN115392365B

CN115392365B - Multi-mode feature acquisition method and device and electronic equipment

Info

Publication number: CN115392365B
Application number: CN202210994209.9A
Authority: CN
Inventors: 袁宇辰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2024-04-26
Anticipated expiration: 2042-08-18
Also published as: CN115392365A

Abstract

The application provides a method and a device for acquiring multi-mode characteristics and electronic equipment; the method comprises the following steps: performing modal feature extraction on a video in a video-text pair to obtain video modal features, and performing modal feature extraction on a text in the video-text pair to obtain text modal features; splicing the video mode characteristics and the text mode characteristics to obtain splicing characteristics; performing linear mapping on the spliced features at least twice to obtain at least two linear mapping features; combining the at least two linear mapping features to determine an intermediate modality feature; and carrying out modal fusion on the intermediate modal feature, the text modal feature and the video modal feature to obtain the multi-modal feature of the video-text pair. By the method, the cross-modal characterization performance of the multi-modal characteristics can be effectively improved, and meanwhile, the single-modal characterization performance of the multi-modal characteristics is improved.

Description

Multi-mode feature acquisition method and device and electronic equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for acquiring multi-modal characteristics, and an electronic device.

Background

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

In the related art, for the generation of multi-mode features, features of each mode are usually obtained by directly performing feature fusion, or features of each mode are independently processed, so that cross-mode characterization performance and single-mode characterization performance of the obtained multi-mode features cannot be effectively balanced.

Disclosure of Invention

The embodiment of the application provides a method, a device, electronic equipment, a computer readable storage medium and a computer program product for acquiring multi-modal characteristics, which can effectively improve the cross-modal characterization performance of the multi-modal characteristics and simultaneously improve the single-modal characterization performance of the multi-modal characteristics.

The technical scheme of the embodiment of the application is realized as follows:

The embodiment of the application provides a method for acquiring multi-mode characteristics, which comprises the following steps:

Performing modal feature extraction on a video in a video-text pair to obtain video modal features, and performing modal feature extraction on a text in the video-text pair to obtain text modal features;

splicing the video mode characteristics and the text mode characteristics to obtain splicing characteristics;

performing linear mapping on the spliced features at least twice to obtain at least two linear mapping features;

combining the at least two linear mapping features to determine an intermediate modality feature;

And carrying out modal fusion on the intermediate modal feature, the text modal feature and the video modal feature to obtain the multi-modal feature of the video-text pair.

The embodiment of the application provides a device for acquiring multi-mode characteristics, which comprises the following steps:

the feature extraction module is used for carrying out modal feature extraction on the video in the video-text pair to obtain video modal features, and carrying out modal feature extraction on the text in the video-text pair to obtain text modal features;

The splicing module is used for splicing the video mode characteristics and the text mode characteristics to obtain splicing characteristics;

the linear mapping module is used for carrying out linear mapping on the splicing characteristics at least twice to obtain at least two linear mapping characteristics;

The determining module is used for combining the at least two linear mapping features to determine intermediate mode features;

and the modal fusion module is used for carrying out modal fusion on the intermediate modal characteristics, the text modal characteristics and the video modal characteristics to obtain the multi-modal characteristics of the video-text pair.

In some embodiments, the above linear mapping module is further configured to invoke at least two linear mapping networks, and perform linear mapping on the spliced feature to obtain at least two linear mapping features, where the linear mapping features and the linear mapping networks have a one-to-one correspondence; the feature dimensions of the linear mapping features are the same, and feature elements contained in different linear mapping features are different.

In some embodiments, the determining module is further configured to obtain a reference feature set including at least two reference features; comparing each linear mapping feature with each reference feature in the reference feature group to obtain similarity of each linear mapping feature and each reference feature in the reference feature group; determining the weight of each linear mapping feature based on the similarity between each linear mapping feature and each reference feature in the reference feature set; and carrying out weighted summation on the at least two linear mapping features based on the weight of each linear mapping feature to obtain the intermediate modal feature.

In some embodiments, the number of reference features in the reference feature set is N, the number of linear mapping features is M, the M linear mapping features exist in a linear mapping feature sequence, and both M and N are positive integers greater than or equal to 2; the determining module is further configured to obtain feature averages of the first i linear mapping features in the linear mapping feature sequence, and traverse the i to obtain M feature averages; when the M is smaller than or equal to the N, determining the M feature mean values as the reference features, and constructing a reference feature group comprising the M reference features; when M is larger than N, selecting N from the M feature mean values as the reference features, and constructing a reference feature group comprising the N reference features; wherein i is a positive integer less than or equal to M.

In some embodiments, the apparatus for acquiring a multi-modal feature further includes: the acquisition module is used for acquiring a memory buffer area, wherein the memory buffer area comprises N storage bits; and sequentially storing each reference feature in the reference feature group into the memory buffer area, wherein storage bits in the memory buffer area are in one-to-one correspondence with the reference features.

In some embodiments, the obtaining module is further configured to sort the M feature averages according to a sequence of obtaining the feature average, to obtain a feature average sequence; and starting from the last feature mean value in the feature mean value sequence, selecting N feature mean values as the reference features.

In some embodiments, the determining module is further configured to perform the following processing for each of the linear mapping features: based on the similarity of the linear mapping characteristics, determining a similarity index corresponding to each similarity; summing the similarity indexes to obtain a sum value of the similarity indexes; determining the ratio of each similarity index to the sum of the similarity indexes as the similarity probability corresponding to each similarity index respectively; and determining the maximum value in the similarity probability as the weight of the linear mapping characteristic.

In some embodiments, the determining module is further configured to obtain a reference feature set including at least two reference features; comparing each reference feature in the reference feature group with each linear mapping feature respectively to obtain similarity between each reference feature in the reference feature group and each linear mapping feature; determining the weight of each reference feature in the reference feature group based on the similarity between each reference feature in the reference feature group and each linear mapping feature; and carrying out weighted summation on the at least two reference features based on the weight of each reference feature to obtain the intermediate mode feature.

In some embodiments, the determining module is further configured to perform the following processing for each of the reference features in the reference feature set: based on the similarity of the reference features, determining a similarity index corresponding to each similarity; summing the similarity indexes to obtain a sum value of the similarity indexes; determining the ratio of each similarity index to the sum of the similarity indexes as the similarity probability corresponding to each similarity index respectively; and determining the maximum value of the similarity probability as the weight of the reference feature.

In some embodiments, the feature extraction module is further configured to perform modal feature extraction on each video frame in the video, to obtain frame modal features of each video frame; and carrying out feature fusion on each frame mode feature to obtain the video mode feature.

In some embodiments, the above-mentioned modal fusion module is further configured to perform modal fusion on the intermediate modal feature and the text modal feature to obtain a first fusion feature; performing modal fusion on the intermediate modal feature and the video modal feature to obtain a second fusion feature; and carrying out modal fusion on the first fusion feature and the second fusion feature to obtain the multi-modal feature of the video-text pair.

In some embodiments, the modality fusion is achieved by a cross-modality encoding network comprising a first fusion network and a second fusion network; the above-mentioned mode fusion module is also used for calling the first fusion network, carrying out feature fusion on the intermediate mode feature and the text mode feature to obtain a first multi-mode feature; invoking the second fusion network to perform feature fusion on the intermediate mode feature and the first multi-mode feature to obtain a second multi-mode feature; and performing feature stitching on the first multi-mode features and the second multi-mode features to obtain the first fusion features.

In some embodiments, the video-text pairs comprise a plurality of video frame-text pairs, the multimodal features of the video-text pairs comprising frame multimodal features of each of the video frame-text pairs; the device for acquiring the multi-mode characteristics further comprises: the cover determining module is used for carrying out cover prediction on each video frame-text pair in the video-text pair based on the frame multi-mode characteristics of each video frame-text pair to obtain the prediction probability that each video frame-text pair is a cover frame-text pair; and determining the video frame-text pair corresponding to the maximum value in the prediction probability as the cover frame-text pair.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the multi-mode characteristic acquisition method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium which stores executable instructions for causing a processor to execute, thereby realizing the method for acquiring the multi-mode characteristics.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the method for acquiring the multi-mode features according to the embodiment of the application.

The embodiment of the application has the following beneficial effects:

the video mode characteristics and the text mode characteristics in the video-text pair are spliced to obtain splicing characteristics, the splicing characteristics are subjected to multiple linear mapping to obtain multiple linear mapping characteristics, the multiple linear mapping characteristics are combined, the middle mode characteristics are determined, and the multi-mode characteristics of the video-text pair are determined by combining the middle mode characteristics, the text mode characteristics and the video mode characteristics. In the process of generating the multi-mode features, the intermediate mode features and the single mode features (namely the text mode features and the video mode features) are combined, so that the single mode characterization performance of each mode feature is reserved to the maximum extent in the generated multi-mode features, and the cross-mode characterization performance of the multi-mode features is effectively improved.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture of a method for acquiring multi-modal features according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a multi-modal feature acquisition device according to an embodiment of the present application;

Fig. 3A to fig. 3F are schematic flow diagrams of a method for acquiring a multi-mode feature according to an embodiment of the present application;

fig. 4A to fig. 4C are schematic diagrams of a method for acquiring a multi-mode feature according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like.

2) Natural language processing (Nature Language processing, NLP): is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

3) Convolutional neural network (CNN, convolutional Neural Networks): is a type of feedforward neural network (FNN, feed forward Neural Networks) with a convolutional calculation and a depth structure, and is one of representative algorithms of deep learning (DEEP LEARNING). The convolutional neural network has a token learning (Representation Learning) capability, which enables translation-invariant classification (Shift-INVARIANT CLASSIFICATION) of the input image in its hierarchical structure.

4) Multimodal features: the different forms of presence or sources of information may be referred to as a modality. Data composed of two or more modalities is referred to as multi-modal data (multi-modal is used to represent data forms of different modalities, or formats of the same modality, generally representing text modalities, image modalities, audio modalities, video modalities, etc.). Multimodal features refer to data acquired through different fields or perspectives for the same descriptive object, and each field or perspective describing the data is called a modality.

5) Modality fusion: the method mainly refers to comprehensive processing of multi-mode data by using a computer, and is responsible for fusing information of all modes to execute a target task. The mode fusion is responsible for effectively integrating the characteristics of a plurality of modes, drawing the advantages of different modes and completing the integration of information.

6) Linear mapping (LINEAR MAPPING, LM): is a mapping from one vector space V to another vector space W and maintains addition and number multiplication operations, while the linear transformation (linear transformation) is a linear mapping of the linear space V to itself.

In the implementation of the embodiments of the present application, the applicant found that the related art has the following problems:

in the related art, referring to fig. 4A (a), fig. 4A is a schematic diagram of a method for obtaining a multi-mode feature according to an embodiment of the present application. For the generation of the multi-mode characteristics, by independently processing the video mode and the text mode, no interaction is performed before the downstream task, and the method can furthest reserve the characterization capability of each mode, but the downstream task needing the details of the video and the text interaction often does not perform well due to lack of the detail interaction between the modes.

In the related art, referring to fig. 4A (b), for generating multi-mode features, a unified cross-mode encoder is used to perform unified cross-mode encoding on video mode features and text mode features, and interactive information including the video mode features and the text mode features is used as input of a downstream task after falling down, so that independent characterization capability of each mode is weakened, and the performance effect is poor.

The embodiment of the application provides a method, a device, an electronic device, a computer readable storage medium and a computer program product for acquiring multi-modal characteristics, which can effectively improve the cross-modal characterization performance of the multi-modal characteristics and simultaneously improve the single-modal characterization performance of the multi-modal characteristics. In the following, an exemplary application when the device is implemented as a server will be described.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a multi-mode feature acquisition system 100 according to an embodiment of the present application, and in order to implement an application scenario of multi-mode feature acquisition, a terminal (a terminal 400 is shown in an example) is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two.

The terminal 400 is configured for display on a graphical interface 410-1 (graphical interface 410-1 is shown for example) for use by a user using a client 410. The terminal 400 and the server 200 are connected to each other through a wired or wireless network.

In some embodiments, the server 200 may be a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

In some embodiments, the server 200 obtains video-text pairs from the terminal 400 and processes the video-text pairs to obtain multi-modal features of the video-text pairs and sends the multi-modal features of the video-text pairs to the terminal 400.

In other embodiments, the terminal 400 obtains and processes the video-text pairs to obtain the multimodal features of the video-text pairs and sends the multimodal features of the video-text pairs to the server 200.

In other embodiments, the embodiments of the present application may be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology that unifies serial resources such as hardware, software, networks, etc. in a wide area network or a local area network, so as to implement calculation, storage, processing, and sharing of data.

The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical network systems require a large amount of computing and storage resources.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 of a method for obtaining a multi-mode feature according to an embodiment of the present application, and the server 200 shown in fig. 2 includes: at least one processor 210, a memory 250, at least one network interface 220. The various components in server 200 are coupled together by bus system 240. It is understood that the bus system 240 is used to enable connected communications between these components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 250 optionally includes one or more storage devices physically located remote from processor 210.

Memory 250 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM) and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 250 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

The operating system 251, which includes system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., is used to implement various basic services and handle hardware-based tasks.

A network communication module 252 for reaching other electronic devices via one or more (wired or wireless) network interfaces 220, the exemplary network interfaces 220 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.

In some embodiments, the multi-mode feature acquiring device provided by the embodiments of the present application may be implemented in software, and fig. 2 shows the multi-mode feature acquiring device 255 stored in the memory 250, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the feature extraction module 2551, the stitching module 2552, the linear mapping module 2553, the determination module 2554, and the modality fusion module 2555 are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be described hereinafter.

In other embodiments, the multi-mode feature acquiring apparatus provided in the embodiments of the present application may be implemented in hardware, and by way of example, the multi-mode feature acquiring apparatus provided in the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the multi-mode feature acquiring method provided in the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more Application specific integrated circuits (ASICs, application SPECIFIC INTEGRATED circuits), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field-Programmable gate arrays (FPGAs), field-Programmable GATE ARRAY), or other electronic components.

The method for acquiring the multi-mode features provided by the embodiment of the application will be described in conjunction with the exemplary application and implementation of the server or the terminal provided by the embodiment of the application.

Referring to fig. 3A, fig. 3A is a schematic flow chart of a method for acquiring a multi-mode feature according to an embodiment of the present application, which will be described with reference to steps 101 to 105 shown in fig. 3A, where an execution body of the steps 101 to 105 may be a server or a terminal, and an execution body will be described below as an example of the server.

In step 101, performing modal feature extraction on a video in a video-text pair to obtain video modal features, and performing modal feature extraction on a text in the video-text pair to obtain text modal features.

In some embodiments, a video-text pair refers to a collection of videos and texts that have correspondence, and the text in the video-text pair is text that matches the video in the video-text pair, e.g., the video-text pair may be a video with subtitles, a movie with subtitles, etc., then the video in the video-text pair may be movie video visual content, and the text in the video-text pair may be subtitles corresponding to the movie video visual content.

In some embodiments, in step 101, the extracting of the modal feature of the video in the video-text pair may be implemented as follows: and calling a video modal feature extraction network to extract modal features of the video in the video-text pair, so as to obtain video modal features. A step of

In some embodiments, the Video modality feature extraction network may be a feature extraction network implementation such as a joint scaling network (EFFICIENT NET), a Video coding network (Video Swin), a three-dimensional convolution network (3D Concolutional Network,C3D), or the like.

In some embodiments, in step 101, the extracting of the modal feature from the text in the video-text pair may be implemented as follows: and calling a text modal feature extraction network to extract modal features of the text in the video-text pair, so as to obtain text modal features.

In some embodiments, the Text modality feature extraction network described above may be a Text encoding network (Text-RCNN), a bi-directional encoding network (Bidirectional Encoder Representations from Transformers, BERT), or the like.

As an example, referring to fig. 4B, fig. 4B is a schematic diagram of a method for acquiring a multi-mode feature according to an embodiment of the present application. And carrying out modal feature extraction on the video in the video-text pair to obtain video modal features, and carrying out modal feature extraction on the text in the video-text pair to obtain text modal features.

In this way, the text modal characteristics and the video modal characteristics are correspondingly obtained by respectively carrying out modal characteristic extraction on the video and the text in the video-text pair, so that the subsequent accurate determination of the multi-modal characteristics corresponding to the video-text pair through the text modal characteristics and the video modal characteristics is facilitated, and the text modal characteristics and the video modal characteristics have the respective corresponding representation capabilities of the text and the video, so that the subsequently determined multi-modal characteristics can accurately reflect the video and the text in the video-text pair.

In some embodiments, the extracting of the modal feature of the video in the video-text pair in step 101 may be implemented by: respectively extracting modal characteristics of each video frame in the video to obtain frame modal characteristics of each video frame; and carrying out feature fusion on the modal features of each frame to obtain video modal features.

In some embodiments, the frame modality features are encoded representations of video frames, the video modality features including frame modality features.

In step 102, the video mode feature and the text mode feature are spliced to obtain a spliced feature.

In some embodiments, the dimension of the stitching feature is equal to the sum of the dimension of the video modality feature and the dimension of the text modality feature.

In some embodiments, stitching refers to the process of stitching any two features to obtain stitched features.

As an example, the video modality feature V _base∈R^D and the text modality feature t _base∈R^D are spliced to obtain a spliced feature v= [ V _base,t_base]∈R^2D, wherein R ^D represents a dimension of the video modality feature and a dimension of the text modality feature, R ^2D represents a dimension of the spliced feature, V _base represents the video modality feature, t _base represents the text modality feature, and v= [ V _base,t_base ] represents the spliced feature.

In this way, the video mode characteristics and the text mode characteristics are spliced to obtain the spliced characteristics, so that the spliced characteristics effectively fuse information of the video mode characteristics and information of the text mode characteristics, the intermediate mode characteristics are convenient to determine subsequently, the intermediate mode characteristics effectively fuse the text mode characteristics and the video mode characteristics, and the multi-mode characteristics of the video-text pairs determined subsequently are favorable for effectively representing information of videos and texts.

In step 103, at least two linear mapping is performed on the stitching features, resulting in at least two linear mapping features.

In some embodiments, the step 103 may be implemented by: and calling at least two linear mapping networks, and respectively carrying out linear mapping on the splicing characteristics to obtain at least two linear mapping characteristics, wherein the linear mapping characteristics and the linear mapping networks are in one-to-one correspondence.

In some embodiments, the feature dimensions of each linear mapping feature are the same, and different linear mapping features contain different feature elements.

In some embodiments, the linear mapping is a mapping from one vector space V to another vector space W and maintains addition and number multiplication operations, while the linear transformation (linear transformation) is a linear mapping of the linear space V to itself.

In some embodiments, the linear mapping network may be a multi-layer perceptron, which is a neural network of fully connected layers containing at least one hidden layer, and the output of each hidden layer is transformed by an activation function, wherein the hidden layer and the input layer are fully connected, the multi-layer perceptron being a multi-layer feedforward neural network.

In some embodiments, each linear mapping network corresponds to a linear mapping feature, and the network parameters of each linear mapping network are different, so that feature elements contained in the linear mapping features obtained by different linear mapping networks are different.

In some embodiments, the linear mapping is used for reducing the dimension of the spliced feature to obtain a linear mapping feature which is the same as the dimension of the text mode feature and the video mode feature, the linear mapping feature can reflect the information of the text mode feature and the video mode feature at the same time, and feature elements contained in different linear mapping features are different.

In this way, the linear mapping characteristic is obtained by linearly mapping the splicing characteristic, so that dimension adjustment of the splicing characteristic is realized, and the dimension of the obtained linear mapping characteristic is consistent with the dimension of the text modal characteristic and the dimension of the video modal characteristic. The linear mapping characteristics corresponding to each linear mapping are obtained through carrying out linear mapping on the spliced characteristics for a plurality of times, so that the characteristic diversity of the obtained linear mapping characteristics is ensured, and the accuracy of the intermediate mode characteristics determined later is effectively ensured.

In step 104, an intermediate modality feature is determined in combination with at least two linear mapping features.

In some embodiments, the intermediate modality features are determined by combining the generated at least two linear mapping features, and feature diversity is effectively ensured due to different feature elements contained in different linear mapping features, so that the determined intermediate modality features can accurately represent cross-modality interaction information of video modality features and text modality features.

As an example, referring to fig. 4C, fig. 4C is a schematic diagram of a method for acquiring a multi-mode feature according to an embodiment of the present application. Splicing the video mode characteristics and the text mode characteristics to obtain splicing characteristics, performing linear mapping on the splicing characteristics at least twice to obtain at least two linear mapping characteristics, and determining intermediate mode characteristics by combining the at least two linear mapping characteristics.

In some embodiments, intermediate modality features, also known as bridge modality features, include features of each of the video and text in a video-text pair, capable of characterizing the content of the video and text.

In some embodiments, referring to fig. 3B, fig. 3B is a flowchart of a method for acquiring a multi-modal feature according to an embodiment of the present application. Step 104 shown in fig. 3B may be implemented by performing the following steps 1041 to 1044.

In step 1041, a set of reference features including at least two reference features is acquired.

In some embodiments, the number of reference features in the set of reference features is N, the number of linear mapping features is M, and the M linear mapping features exist in a linear mapping feature sequence, where M and N are positive integers greater than or equal to 2.

In some embodiments, the linear mapping feature sequence includes M sequentially arranged linear mapping features.

In some embodiments, the reference feature is a feature mean of at least one linear mapping feature, e.g., the reference feature may be the linear mapping feature itself, a feature mean of two linear mapping features, a feature mean of three linear mapping features, etc.

In some embodiments, the reference feature set is a set of a plurality of reference features, the reference feature set comprising at least two reference features arranged in a sequence.

In some embodiments, referring to fig. 3C, fig. 3C is a flow chart of a method for acquiring a multi-modal feature according to an embodiment of the present application. Step 1041 shown in fig. 3C may be implemented by performing the following steps 10411 to 10413.

In step 10411, a feature average of the first i linear mapping features in the linear mapping feature sequence is obtained, and the i is traversed to obtain M feature averages.

In some embodiments, i is a positive integer less than or equal to M.

As an example, when i=1, the first 1 linear mapping feature in the linear mapping feature sequence is acquired, and the first 1 linear mapping feature is taken as a feature mean of the first 1 linear mapping feature.

As an example, when i=2, a feature average value of the first 2 linear mapping features in the linear mapping feature sequence is obtained, where the feature average value of the first 2 linear mapping features may be feature average values of the 1 st linear mapping feature and the 2 nd linear mapping feature, and the feature average values of the 1 st linear mapping feature and the 2 nd linear mapping feature may be average values of feature elements of the 1 st linear mapping feature and feature elements of the 2 nd linear mapping feature.

As an example, when i=3, a feature average of the first 3 linear mapping features in the linear mapping feature sequence is obtained, where the feature average of the first 3 linear mapping features may be a feature average of the 1 st linear mapping feature, the 2 nd linear mapping feature, and the 3 rd linear mapping feature, and the feature average of the 1 st linear mapping feature, the 2 nd linear mapping feature, and the 3 rd linear mapping feature may be an average of feature elements of the 1 st linear mapping feature, feature elements of the 2 nd linear mapping feature, and feature elements of the 3 rd linear mapping feature.

As an example, the expression of the feature mean of the first 3 linear mapping features in the linear mapping feature sequence may be:

Wherein x represents the feature mean value of the first 3 linear mapping features in the linear mapping feature sequence, x represents the feature element value of the 1 st linear mapping feature, y represents the feature element value of the 2 nd linear mapping feature, and z represents the feature element value of the 3 rd linear mapping feature.

In step 10412, when M is less than or equal to N, the M feature means is determined as a reference feature, and a reference feature group including M reference features is constructed.

In some embodiments, when the number M of feature averages is less than or equal to the number N of reference features, all M feature averages are determined to be reference features and a reference feature set including M reference features is constructed.

As an example, when m=3, n=5, i.e., M is smaller than N, each of the 3 feature averages is determined as a reference feature, and a reference feature group including 3 reference features is constructed.

In step 10413, when M is greater than N, N are selected from the M feature averages as reference features, and a reference feature set including N reference features is constructed.

In some embodiments, when the number M of feature averages is greater than the number N of reference features, N feature averages are selected from the M feature averages as reference features, and a reference feature set including 3 reference features is constructed.

As an example, when m= 5,N =3, i.e., M is greater than N, 3 feature averages are selected as reference features from the 5 feature averages, and a reference feature group including 3 reference features is constructed.

In some embodiments, the selecting N from the M feature averages as the reference features may be implemented by: sequencing the M characteristic average values according to the sequence of the acquired characteristic average values to obtain a characteristic average value sequence; and starting from the last feature mean value in the feature mean value sequence, selecting N feature mean values as reference features.

In some embodiments, the order of acquiring the feature mean may be that the feature mean of the first 1 linear mapping feature is acquired earlier than the feature mean of the first 2 linear mapping features, earlier than the feature mean of the first 3 linear mapping features, and so on.

In some embodiments, the earlier the feature mean is acquired in the feature mean sequence, the earlier the feature mean is in the feature mean sequence, the later the feature mean is acquired, and the later the feature mean is in the feature mean sequence. The sequence of the feature mean value acquisition time is consistent with the sequence of the feature mean values in the feature mean value sequence.

In some embodiments, the last feature mean in the feature mean sequence is the feature mean of the first M linear mapping features, that is, the last feature mean in the feature mean sequence is the mean of all the linear mapping features, so that the accuracy of the last feature mean in the feature mean sequence is highest, it can be understood that, starting from the last feature mean in the feature mean sequence, N feature mean values are selected as reference features, the N selected feature mean values are N feature mean values with the largest number of linear mapping features in the feature mean sequence, and the N feature mean values can better represent the information of the linear mapping features.

In some embodiments, the method for obtaining the multi-modal feature provided by the embodiments of the present application may further store the reference feature by: acquiring a memory buffer area, wherein the memory buffer area comprises N storage bits; and sequentially storing each reference feature in the reference feature group into a memory buffer area, wherein storage bits in the memory buffer area correspond to the reference features one by one.

In some embodiments, the memory buffer may be an area in the memory for storing the reference feature, where the memory buffer includes N storage bits, each of which may store one of the reference features.

It can be understood that the number of the reference features can be specifically set according to the capacity of the memory buffer, so that each reference feature can be stored in the memory buffer, and the linear mapping feature and each reference feature in the memory buffer can be conveniently compared later.

In step 1042, each linear mapping feature is compared with each reference feature in the reference feature set to obtain the similarity between each linear mapping feature and each reference feature in the reference feature set.

In some embodiments, the similarity between features may be cosine similarity, which is measured by consistency of values among dimensions, and may be Euclidean metric, which is measured by focusing on differences among dimensions, not on differences in values.

In some embodiments, the comparison may be a process of similarity calculation, where the similarity between each linear mapping feature and each reference feature in the reference feature set is obtained by comparing each linear mapping feature with each reference feature in the reference feature set.

In step 1043, a weight for each linear mapping feature is determined based on the similarity of each linear mapping feature to each reference feature in the set of reference features.

In some embodiments, the weights of the linear mapping features characterize the proportions of the plurality of linear mapping features that correspond when weighted, the greater the weights of the linear mapping features characterize the greater the proportions of the linear mapping features that correspond when weighted.

In some embodiments, referring to fig. 3D, fig. 3D is a flow chart of a method for acquiring a multi-modal feature according to an embodiment of the present application. Step 1043 shown in fig. 3D may be implemented by performing the following steps 10431 to 10433 for each linear mapping feature, respectively.

In step 10431, based on each similarity of the linear mapping features, determining a similarity index corresponding to each similarity; and summing the similarity indexes to obtain a sum value of the similarity indexes.

In some embodiments, the expression of the similarity index corresponding to the similarity may be:

T_i＝e^t (2)

wherein T _i represents a similarity index corresponding to the similarity, T represents the similarity, e is a natural constant, is an infinite non-cyclic decimal, and is an overrun.

In step 10432, a ratio of each similarity index to the sum of the similarity indexes is determined as a similarity probability corresponding to each similarity index.

In some embodiments, the expression of the similarity probability may be:

Wherein w represents the similarity probability, T _i represents the similarity index corresponding to the similarity, and Sigma T _i represents the sum of the similarity indexes.

In step 10433, the maximum of the similar probabilities is determined as the weight of the linear mapping feature.

In some embodiments, the step 10433 may be implemented as follows: and sequencing the similarity probabilities to obtain the maximum value in the similarity probabilities, and determining the maximum value of the similarity probabilities as the weight of the linear mapping characteristic.

In step 1044, at least two linear mapping features are weighted and summed based on the weights of the linear mapping features to obtain an intermediate modality feature.

In some embodiments, the expression of the intermediate modality feature may be:

Y＝∑wY_i (4)

Wherein, Y represents the middle mode characteristic, Y _i represents each linear mapping characteristic, and w represents the weight corresponding to the linear mapping characteristic.

In this way, the weighting corresponding to each linear mapping feature is determined, each linear mapping feature is weighted and summed based on the weighting to obtain the intermediate mode feature, and each linear mapping feature is weighted and summed according to the weighting because each linear mapping feature fuses the video mode feature and the text mode feature from different angles, so that the obtained intermediate mode feature can accurately reflect the information of the video mode feature and the text mode feature.

In some embodiments, referring to fig. 3E, fig. 3E is a flow chart of a method for acquiring a multi-modal feature according to an embodiment of the present application. Step 104 shown in fig. 3E may be implemented by performing the following steps 1045 to 1048.

In step 1045, a set of reference features including at least two reference features is obtained.

In step 1046, each reference feature in the reference feature set is compared with each linear mapping feature, so as to obtain similarity between each reference feature in the reference feature set and each linear mapping feature.

In some embodiments, the comparison may be a process of similarity calculation, where the similarity between each reference feature in the reference feature set and each linear mapping feature is obtained by comparing each linear mapping feature with each reference feature in the reference feature set.

In step 1047, a weight of each reference feature in the set of reference features is determined based on the similarity of each reference feature in the set of reference features to each linear mapping feature.

In some embodiments, the weights of the reference features characterize the proportions of the plurality of reference features that correspond when weighted summed, the greater the weights of the reference features, the greater the proportions of the reference features that correspond when weighted summed.

In some embodiments, referring to fig. 3F, fig. 3F is a flow chart of a method for acquiring a multi-modal feature according to an embodiment of the present application. Step 1047 shown in fig. 3F may be implemented by performing the following steps 10471 to 10474.

In step 10471, based on the respective similarities of the reference features, a respective similarity index is determined for each of the similarities.

Q_i＝e^q (5)

Wherein Q _i represents a similarity index corresponding to the similarity, Q represents the similarity, e is a natural constant, is an infinite non-cyclic decimal, and is an overrun.

In step 10472, the similarity indices are summed to obtain a sum of similarity indices.

T_i＝e^t (6)

In step 10473, the ratio of each similarity index to the sum of the similarity indexes is determined as a similarity probability corresponding to each similarity index.

In some embodiments, the expression of the similarity probability may be:

In step 10474, the maximum value of the similar probability is determined as the weight of the reference feature.

In step 1048, at least two reference features are weighted and summed based on the weights of the reference features to obtain an intermediate mode feature.

Y＝∑wQ_i (8)

Wherein Y represents the middle mode feature, Q _i represents each reference feature, and w represents the weight corresponding to the reference feature.

In this way, the intermediate mode characteristics are obtained by determining the weights corresponding to the reference characteristics and carrying out weighted summation on the reference characteristics based on the weights, and the obtained intermediate mode characteristics can accurately reflect the information of the video mode characteristics and the text mode characteristics because the video mode characteristics and the text mode characteristics are fused from different angles by the reference characteristics and the reference characteristics are weighted summation according to the weights.

In step 105, the intermediate modality feature, the text modality feature, and the video modality feature are subjected to modality fusion, so as to obtain a multi-modality feature of the video-text pair.

In some embodiments, the mode fusion mainly refers to comprehensive processing of multi-mode data by using a computer, and is responsible for fusing information of each mode to execute a target task. The mode fusion is responsible for effectively integrating the characteristics of a plurality of modes, drawing the advantages of different modes and completing the integration of information.

In some embodiments, the multi-modal characteristics of the video-text pair are obtained by carrying out modal fusion on the intermediate modal characteristics, the text modal characteristics and the video modal characteristics, and the inter-modal characteristic characterization capability of the generated multi-modal characteristics can be effectively improved by the participation of the intermediate modal characteristics in the modal fusion process, and meanwhile, the characterization capability of each of the single-modal characteristics of the video modal characteristics and the text modal characteristics in the video-text pair is effectively reserved.

In some embodiments, referring to fig. 3B, fig. 3B is a flowchart of a method for acquiring a multi-modal feature according to an embodiment of the present application. Step 105 shown in fig. 3B may be implemented by performing the following steps 1051 to 1053.

In step 1051, a first fusion feature is obtained by performing a modal fusion of the intermediate modal feature and the text modal feature.

In some embodiments, the first fusion feature is obtained by performing modal fusion on the intermediate modal feature and the text modal feature, and the first fusion feature effectively fuses the text modal feature and the video modal feature.

In some embodiments, the modality fusion is implemented by a Cross-modality encoding network (Cross-Modal Encoder) that includes a first fusion network and a second fusion network.

In some embodiments, step 1051 described above may be implemented as follows: invoking a first fusion network, and carrying out feature fusion on the middle modal feature and the text modal feature to obtain a first multi-modal feature; invoking a second fusion network to perform feature fusion on the intermediate modal feature and the first multi-modal feature to obtain a second multi-modal feature; and performing feature stitching on the first multi-mode features and the second multi-mode features to obtain first fusion features.

In some embodiments, the first fusion network and the second fusion network have the same network structure, the first fusion network is used for carrying out feature fusion on the intermediate mode feature and the text mode feature, and the second fusion network is used for carrying out feature fusion on the intermediate mode feature and the first multi-mode feature.

In step 1052, a second fusion feature is obtained by performing a modal fusion of the intermediate modal feature and the video modal feature.

In some embodiments, step 1052 described above may be implemented as follows: invoking a first fusion network to perform feature fusion on the middle mode feature and the video mode feature to obtain a third multi-mode feature; invoking a second fusion network to perform feature fusion on the intermediate modal feature and the third multi-modal feature to obtain a fourth multi-modal feature; and performing feature stitching on the third multi-mode feature and the fourth multi-mode feature to obtain a second fusion feature.

In step 1053, the first fusion feature and the second fusion feature are subjected to modal fusion, so as to obtain multi-modal features of the video-text pair.

In some embodiments, step 1053 described above may be implemented as follows: invoking a first fusion network, and carrying out feature fusion on the first fusion feature and the second fusion feature to obtain a fifth multi-modal feature; invoking a second fusion network to perform feature fusion on the intermediate modal feature and the fifth multi-modal feature to obtain a sixth multi-modal feature; and performing feature stitching on the fifth multi-modal feature and the sixth multi-modal feature to obtain the multi-modal feature of the video-text pair.

In some embodiments, the video-text pairs include a plurality of video frame-text pairs, and the multimodal features of the video-text pairs include frame multimodal features of each video frame-text pair.

In some embodiments, after step 105 described above, the cover frame of the video-text pair may be determined by: based on the frame multi-mode characteristics of each video frame-text pair, carrying out cover prediction on each video frame-text pair in the video-text pair to obtain the prediction probability that each video frame-text pair is a cover frame-text pair; and determining the video frame-text pair corresponding to the maximum value in the prediction probability as a cover frame-text pair.

In some embodiments, the foregoing prediction probability for each video frame-text pair to be a cover frame-text pair based on the frame multi-modal feature of each video frame-text pair may be implemented by: invoking a full connection layer, and carrying out scoring prediction on the frame multi-mode characteristics of each video frame-text pair to obtain a prediction score of each video frame-text pair as a cover frame-text pair; and normalizing the prediction scores to obtain the prediction probability of each video frame-text pair as a cover frame-text pair.

In some embodiments, after step 105 described above, the answer to the video question may be determined by: acquiring a problem of video-text pairs; and calling a question-answering model based on the multi-mode characteristics of the video-text pair to calculate the score of the question of the video-text pair, obtaining the probability value of each answer corresponding to the question of the video-text pair, and determining the answer corresponding to the maximum probability value as the answer corresponding to the acquired question of the video-text pair.

In some embodiments, a question-answer model is used to predict the probability that an acquired question corresponds to each answer.

In some embodiments, following step 105 described above, the title of the video-text pair may be determined by: based on the frame multi-mode characteristics of each video frame-text pair, performing title prediction on each video frame-text pair in the video-text pair to obtain the prediction probability of each video frame-text pair as the title frame-text pair; and determining the text in the video frame-text pair corresponding to the maximum value in the prediction probability as the title of the video-text pair.

In this way, the video mode characteristics and the text mode characteristics in the video-text pair are spliced to obtain splicing characteristics, the splicing characteristics are subjected to multiple linear mapping to obtain multiple linear mapping characteristics, the multiple linear mapping characteristics are combined, the middle mode characteristics are determined, and the multi-mode characteristics of the video-text pair are determined by combining the middle mode characteristics, the text mode characteristics and the video mode characteristics. In the process of generating the multi-mode features, the intermediate mode features and the single mode features (namely the text mode features and the video mode features) are combined, so that the single mode characterization performance of each mode feature is reserved to the maximum extent in the generated multi-mode features, and the cross-mode characterization performance of the multi-mode features is effectively improved.

In the following, an exemplary application of an embodiment of the present application in an application scenario of an actual video-text pair will be described.

In the Video-text Pre-training field (Video-Language Pre-training), related art generates efficient multi-Modal characterization (embedding) for Video-text pairs by building Cross-Modal Learning (Cross-Modal Learning) structures based on Video, text interactions as training means for subsequent downstream task use, such as Video-text retrieval (Video/Text Retrieval), video title generation (Video Captioning), video question-answering (Video Question Answering), and the like.

Aiming at the problem that independent modes or fusion modes of video-text pairs are difficult to effectively balance in cross-mode learning, the application provides a learnable bridge mode (namely the intermediate mode described above) which is used as an intermediate mode of video-text pair interaction, and in the process of video and text interaction, the video and text modes do not directly interact with each other, but interact with the bridge mode respectively. The method not only can effectively learn cross-modal information, but also can effectively reserve the characterization capability of each individual modality.

In order to realize a more effective cross-modal learning mode, the embodiment of the application provides a concept of a bridge mode, and referring to fig. 4A, the bridge mode characterization is generated based on the content of a video mode and a text mode and the past sample characterization in a cache, and the video mode and the text mode do not directly interact with each other in the training process, but interact with the bridge mode respectively to learn cross-modal information. The method can not only effectively learn cross-mode interaction information, but also can reserve the representation capability of each single mode of the video mode and the text mode.

In some embodiments, referring to fig. 4B, fig. 4B is a schematic diagram of a method for obtaining a multi-modal feature according to an embodiment of the present application.

First, a video-text pair is acquired, and the video-text pair is decomposed to obtain a video of the video-text pair and a text of the video-text pair.

And then, carrying out modal feature extraction on the video of the video-text pair to obtain video modal features, and carrying out modal feature extraction on the text of the video-text pair to obtain text modal features.

As an example, the modal feature extraction of video for video-text pairs may be achieved by: decomposing the video of the video-text pair to obtain a video frame sequence; and encoding the video frame sequence to obtain video mode characteristics. The encoding of the Video frame sequence can be realized through a joint scaling network (EFFICIENT NET), a Video encoding network (Video Swin), a three-dimensional convolution network (3D Concolutional Network,C3D) and other encoding networks.

As an example, the modal feature extraction of text for video-text pairs may be achieved by: decomposing the text of the video-text pair to obtain a word sequence; and encoding the word sequence to obtain the text modal characteristics. The encoding of the word sequence may be performed through a Text encoding network (Text-RCNN), a bi-directional encoding network (Bidirectional Encoder Representations from Transformers, BERT), or the like.

Referring to fig. 4B, intermediate modality features are generated from video modality features and text modality features. In the process of mode fusion, direct interaction is not performed between the video mode characteristics and the text mode characteristics, interaction is performed with the generated intermediate mode characteristics respectively, the dimension of the text mode characteristics is set to be t epsilon R ^L×D, the dimension of the video mode characteristics is set to be v epsilon R ^F×D, and the dimension of the intermediate mode characteristics is set to be b epsilon R ^B×D. The length of the characteristic mode features is represented by D, and the characteristic lengths of the text mode features, the video mode features and the middle mode features are the same; and L, F, B represent the number of corresponding modality features, respectively.

Referring to fig. 4B, for the generation of the intermediate mode feature, the generation may be implemented by means of a Memory buffer, where m= [ M ₁,M₂,…,M_N]∈R^N×D, the initial content is empty, and N represents the size of the buffer by creating a Memory buffer (Memory bank) with a length of N; here, each M _j can be understood as a representation unit of a bridge modality, which is formed by combining these representation units by a certain weight for a specific video-text input sample.

Splicing the video mode characteristic V epsilon R ^F×D and the text mode characteristic t epsilon R ^L×D to obtain a splicing characteristic V= [ V _base,t_base]∈R^2D, and settingFor a series of learnable linear mapping layers (typically multi-layer perceptron MLP), the stitching feature v= [ V _base,t_base ] may be mapped from the 2D dimension to the D dimension. Here/>The subscript i=1, …, B, indicates that B intermediate modality features are to be generated. Each/>And (3) calculating cosine similarity with M= [ M ₁,M₂,…,M_N ] in the buffer area respectively, and performing normalization calculation (Softmax) after all cosine similarity (summary) to obtain the weight corresponding to each M _j. The final intermediate modality feature is then equal in its entirety to the weighted sum of all M _j with their weights.

As an example, the expression for cosine similarity may be:

wherein S _i,j characterizes the similarity of the ith intermediate modality feature to the average feature in the jth buffer, S _c characterizes a cosine similarity function, Characterizing the ith intermediate modality feature, M _j characterizes the jth average feature within the buffer.

As an example, the expression of the normalization calculation may be:

p_i,j＝Softmax_j(s_i,j) (10)

wherein Softmax _j characterizes a normalization function, s _i,j characterizes similarity of the ith intermediate modality feature to the average feature in the jth buffer, and p _i,j characterizes the weight corresponding to M _j.

As an example, the expression of the intermediate modality feature may be:

b＝pM (11)

Wherein b represents the middle mode feature, p represents the weight corresponding to the average feature in the buffer region, and M represents the average feature in the buffer region.

When the memory buffer is not fully written, each intermediate mode characteristic is directly adoptedAt the same time will all/>The average value of (1) is written into the memory buffer area to establish the initial value of the memory buffer area M= [ M ₁,M₂,…,M_N ], thereby establishing the middle mode characteristic.

In some embodiments, referring to fig. 4C, fig. 4C is a schematic diagram of a method for obtaining a multi-modal feature according to an embodiment of the present application. And carrying out linear mapping on the text modal characteristics and the video modal characteristics to obtain a plurality of fusion modal characteristics, writing the average value of the fusion modal characteristics into a data buffer area, and carrying out normalization calculation on the average value of the fusion modal characteristics and the average value in the data buffer area to obtain intermediate modal characteristics.

The bridge mode (namely the intermediate mode characteristic described above) established by the embodiment of the application can be used as an intermediate medium to perform interactive calculation with the video mode characteristic and the text mode characteristic respectively, learn effective cross-mode information, and can also avoid direct interaction between the video mode characteristic and the text mode characteristic, thereby reserving the independent characterization capability of each mode to the maximum extent. The embodiment of the application can promote the multi-mode pre-training model to learn more effective multi-mode characterization (embedding), thereby bringing better effects for downstream tasks (such as video-text retrieval, video title generation, video question-answering and the like).

In the embodiment of the present application, any model meeting the requirements can be used by the single-mode encoder, including but not limited to the Text-RCNN, BERT, C-D, efficientNet, videoSwin and the like mentioned above, the bridge mode can have various establishment modes, the mode based on the memory module mentioned above is one with better generation effect, and in practical application, the mode is not limited to the Text mode characteristics and the video mode characteristics illustrated, and can be extended to more modes (such as audio mode characteristics and the like) as required.

It will be appreciated that in embodiments of the present application that video-text peer-to-peer related data is involved, user permissions or consent need to be obtained when the embodiments of the present application are applied to a particular product or technology, and the collection, use and processing of the related data is subject to relevant laws and regulations and standards of the relevant country and region.

Continuing with the description below of an exemplary architecture of the multi-modal feature acquisition device 255 implemented as a software module provided by embodiments of the present application, in some embodiments, as shown in fig. 2, the software modules stored in the multi-modal feature acquisition device 255 of the memory 250 may include: the feature extraction module 2551 is configured to perform modal feature extraction on a video in a video-text pair to obtain a video modal feature, and perform modal feature extraction on a text in the video-text pair to obtain a text modal feature; the splicing module 2552 is configured to splice the video mode feature and the text mode feature to obtain a spliced feature; the linear mapping module 2553 is configured to perform linear mapping on the spliced feature at least twice to obtain at least two linear mapping features; a determining module 2554, configured to combine at least two linear mapping features to determine an intermediate modality feature; the modality fusion module 2555 is configured to perform modality fusion on the middle modality feature, the text modality feature and the video modality feature to obtain a multi-modality feature of the video-text pair.

In some embodiments, the linear mapping module 2553 is further configured to invoke at least two linear mapping networks, and perform linear mapping on the spliced features to obtain at least two linear mapping features, where the linear mapping features and the linear mapping networks have a one-to-one correspondence; the feature dimensions of the linear mapping features are the same, and feature elements contained in different linear mapping features are different.

In some embodiments, the determining module 2554 is further configured to obtain a reference feature set including at least two reference features; comparing each linear mapping feature with each reference feature in the reference feature group to obtain the similarity of each linear mapping feature and each reference feature in the reference feature group; determining weights of the linear mapping features based on the similarity of the linear mapping features and the reference features in the reference feature group; and carrying out weighted summation on at least two linear mapping features based on the weight of each linear mapping feature to obtain the intermediate modal feature.

In some embodiments, the number of reference features in the reference feature set is N, the number of linear mapping features is M, the M linear mapping features exist in a linear mapping feature sequence, and M and N are positive integers greater than or equal to 2; the determining module 2554 is further configured to obtain feature averages of the first i linear mapping features in the linear mapping feature sequence, and traverse i to obtain M feature averages; when M is less than or equal to N, determining M feature means as reference features, and constructing a reference feature group comprising M reference features; when M is larger than N, N are selected from M feature mean values to serve as reference features, and a reference feature group comprising N reference features is constructed; wherein i is a positive integer less than or equal to M.

In some embodiments, the apparatus for acquiring a multi-modal feature further includes: the acquisition module is used for acquiring a memory buffer area, wherein the memory buffer area comprises N storage bits; and sequentially storing each reference feature in the reference feature group into a memory buffer area, wherein storage bits in the memory buffer area correspond to the reference features one by one.

In some embodiments, the obtaining module is further configured to sort the M feature averages according to the order of obtaining the feature averages, to obtain a feature average sequence; and starting from the last feature mean value in the feature mean value sequence, selecting N feature mean values as reference features.

In some embodiments, the determining module 2554 is further configured to perform the following processing for each linear mapping feature: based on each similarity of the linear mapping characteristics, determining a similarity index corresponding to each similarity; summing the similarity indexes to obtain a similarity index sum value; determining the ratio of each similarity index to the sum value of the similarity indexes as the similarity probability corresponding to each similarity index respectively; the maximum value in the similar probability is determined as the weight of the linear mapping feature.

In some embodiments, the determining module 2554 is further configured to obtain a reference feature set including at least two reference features; comparing each reference feature in the reference feature group with each linear mapping feature respectively to obtain the similarity between each reference feature in the reference feature group and each linear mapping feature; determining weights of all the reference features in the reference feature group based on the similarity of all the reference features in the reference feature group and all the linear mapping features; and carrying out weighted summation on at least two reference features based on the weight of each reference feature to obtain the intermediate mode feature.

In some embodiments, the determining module 2554 is further configured to perform the following processing for each reference feature in the reference feature set: based on the similarity of the reference features, determining a similarity index corresponding to each similarity; summing the similarity indexes to obtain a similarity index sum value; determining the ratio of each similarity index to the sum value of the similarity indexes as the similarity probability corresponding to each similarity index respectively; and determining the maximum value of the similarity probability as the weight of the reference feature.

In some embodiments, the feature extraction module 2551 is further configured to perform modal feature extraction on each video frame in the video, to obtain frame modal features of each video frame; and carrying out feature fusion on the modal features of each frame to obtain video modal features.

In some embodiments, the above-mentioned modal fusion module 2555 is further configured to perform modal fusion on the intermediate modal feature and the text modal feature to obtain a first fusion feature; performing modal fusion on the middle modal feature and the video modal feature to obtain a second fusion feature; and carrying out modal fusion on the first fusion feature and the second fusion feature to obtain the multi-modal feature of the video-text pair.

In some embodiments, the modality fusion is achieved by a cross-modality encoding network comprising a first fusion network and a second fusion network; the above-mentioned modal fusion module 2555 is further configured to invoke a first fusion network to perform feature fusion on the intermediate modal feature and the text modal feature, so as to obtain a first multi-modal feature; invoking a second fusion network to perform feature fusion on the intermediate modal feature and the first multi-modal feature to obtain a second multi-modal feature; and performing feature stitching on the first multi-mode features and the second multi-mode features to obtain first fusion features.

In some embodiments, the video-text pairs include a plurality of video frame-text pairs, and the multimodal features of the video-text pairs include frame multimodal features of each video frame-text pair; the device for acquiring the multi-mode characteristics further comprises: the cover determining module is used for carrying out cover prediction on each video frame-text pair in the video-text pair based on the frame multi-mode characteristics of each video frame-text pair to obtain the prediction probability that each video frame-text pair is a cover frame-text pair; and determining the video frame-text pair corresponding to the maximum value in the prediction probability as a cover frame-text pair.

Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the computer device executes the method for acquiring the multi-modal characteristics according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for acquiring a multi-modal feature provided by embodiments of the present application, for example, the method for acquiring a multi-modal feature as shown in fig. 3A.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the application has the following beneficial effects:

(1) The video mode characteristics and the text mode characteristics in the video-text pair are spliced to obtain splicing characteristics, the splicing characteristics are subjected to multiple linear mapping to obtain multiple linear mapping characteristics, the multiple linear mapping characteristics are combined, the middle mode characteristics are determined, and the multi-mode characteristics of the video-text pair are determined by combining the middle mode characteristics, the text mode characteristics and the video mode characteristics. In the process of generating the multi-mode features, the intermediate mode features and the single mode features (namely the text mode features and the video mode features) are combined, so that the single mode characterization performance of each mode feature is reserved to the maximum extent in the generated multi-mode features, and the cross-mode characterization performance of the multi-mode features is effectively improved.

(2) The method comprises the steps of respectively carrying out modal feature extraction on a video and a text in a video-text pair to correspondingly obtain text modal features and video modal features, so that the subsequent accurate determination of multi-modal features corresponding to the video-text pair through the text modal features and the video modal features is facilitated, and the text modal features and the video modal features have the respective corresponding text and video characterization capabilities, so that the subsequently determined multi-modal features can accurately reflect the video and the text in the video-text pair.

(3) The video mode characteristics and the text mode characteristics are spliced to obtain splicing characteristics, so that the splicing characteristics effectively fuse information of the video mode characteristics and information of the text mode characteristics, intermediate mode characteristics are convenient to determine subsequently, the intermediate mode characteristics effectively fuse the text mode characteristics and the video mode characteristics, and the multi-mode characteristics of the video-text pairs determined subsequently are favorable for effectively representing information of videos and texts.

(3) The linear mapping characteristic is obtained by carrying out linear mapping on the splicing characteristic, so that dimension adjustment on the splicing characteristic is realized, and the dimension of the obtained linear mapping characteristic is consistent with the dimension of the text modal characteristic and the dimension of the video modal characteristic. The linear mapping characteristics corresponding to each linear mapping are obtained through carrying out linear mapping on the spliced characteristics for a plurality of times, so that the characteristic diversity of the obtained linear mapping characteristics is ensured, and the accuracy of the intermediate mode characteristics determined later is effectively ensured.

(4) The intermediate mode features are determined by combining the generated at least two linear mapping features, and feature diversity is effectively ensured because feature elements contained in different linear mapping features are different, so that the determined intermediate mode features can accurately represent cross-mode interaction information of video mode features and text mode features.

(5) The number of the reference features can be specifically set according to the capacity of the memory buffer zone so as to ensure that each reference feature can be stored in the memory buffer zone, and the linear mapping feature and each reference feature in the memory buffer zone can be conveniently compared later.

(6) And the weighting summation is carried out on each linear mapping feature according to the weight, so that the obtained intermediate mode feature can accurately reflect the information of the video mode feature and the text mode feature.

(7) The multi-modal characteristics of the video-text pair are obtained by carrying out modal fusion on the intermediate modal characteristics, the text modal characteristics and the video modal characteristics, and the inter-modal characteristic characterization capability of the generated multi-modal characteristics can be effectively improved by the participation of the intermediate modal characteristics in the modal fusion process, and meanwhile, the characterization capability of each of the single-modal characteristics of the video-text pair video modal characteristics and the text modal characteristics is effectively reserved.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for acquiring a multi-modal feature, the method comprising:

Acquiring a reference feature set comprising at least two reference features;

Comparing each linear mapping feature with each reference feature in the reference feature group to obtain similarity of each linear mapping feature and each reference feature in the reference feature group;

determining the weight of each linear mapping feature based on the similarity between each linear mapping feature and each reference feature in the reference feature set;

based on the weight of each linear mapping feature, carrying out weighted summation on the at least two linear mapping features to obtain an intermediate mode feature;

2. The method of claim 1, wherein the linearly mapping the splice feature at least twice to obtain at least two linear mapping features comprises:

Invoking at least two linear mapping networks, and respectively performing linear mapping on the splicing characteristics to obtain at least two linear mapping characteristics, wherein the linear mapping characteristics and the linear mapping networks are in one-to-one correspondence;

The feature dimensions of the linear mapping features are the same, and feature elements contained in different linear mapping features are different.

3. The method of claim 1, wherein the number of reference features in the set of reference features is N, the number of linear mapping features is M, the M linear mapping features exist in a linear mapping feature sequence, and both M and N are positive integers greater than or equal to 2;

acquiring a reference feature set including at least two reference features, comprising:

acquiring characteristic average values of the first i linear mapping characteristics in the linear mapping characteristic sequence, and traversing the i to obtain M characteristic average values;

When the M is smaller than or equal to the N, determining the M feature mean values as the reference features, and constructing a reference feature group comprising the M reference features;

When M is larger than N, selecting N from the M feature mean values as the reference features, and constructing a reference feature group comprising the N reference features;

Wherein i is a positive integer less than or equal to M.

4. A method according to claim 3, characterized in that the method further comprises:

acquiring a memory buffer area, wherein the memory buffer area comprises N storage bits;

and sequentially storing each reference feature in the reference feature group into the memory buffer area, wherein storage bits in the memory buffer area are in one-to-one correspondence with the reference features.

5. A method according to claim 3, wherein selecting N from the M feature averages as the reference features comprises:

Sequencing the M characteristic average values according to the sequence of acquiring the characteristic average values to obtain a characteristic average value sequence;

And starting from the last feature mean value in the feature mean value sequence, selecting N feature mean values as the reference features.

6. The method of claim 1, wherein determining the weight of each of the linear mapping features based on the similarity of each of the linear mapping features to each of the set of reference features comprises:

The following processing is performed for each of the linear mapping features:

based on the similarity of the linear mapping characteristics, determining a similarity index corresponding to each similarity; summing the similarity indexes to obtain a sum value of the similarity indexes;

determining the ratio of each similarity index to the sum of the similarity indexes as the similarity probability corresponding to each similarity index respectively;

And determining the maximum value in the similarity probability as the weight of the linear mapping characteristic.

7. The method according to claim 1, wherein the method further comprises:

Acquiring a reference feature set comprising at least two reference features;

comparing each reference feature in the reference feature group with each linear mapping feature respectively to obtain similarity between each reference feature in the reference feature group and each linear mapping feature;

determining the weight of each reference feature in the reference feature group based on the similarity between each reference feature in the reference feature group and each linear mapping feature;

And carrying out weighted summation on the at least two reference features based on the weight of each reference feature to obtain the intermediate mode feature.

8. The method of claim 7, wherein determining the weight of each of the reference features in the set of reference features based on the similarity of each of the reference features in the set of reference features to each of the linear mapping features comprises:

the following processing is respectively executed for each reference feature in the reference feature group:

based on the similarity of the reference features, determining a similarity index corresponding to each similarity;

Summing the similarity indexes to obtain a sum value of the similarity indexes;

and determining the maximum value of the similarity probability as the weight of the reference feature.

9. The method according to claim 1, wherein the performing the modal feature extraction on the video in the video-text pair to obtain the video modal feature comprises:

Respectively carrying out modal feature extraction on each video frame in the video to obtain frame modal features of each video frame;

and carrying out feature fusion on each frame mode feature to obtain the video mode feature.

10. The method of claim 1, wherein the performing the modal fusion of the intermediate modal feature, the text modal feature, and the video modal feature to obtain the multi-modal feature of the video-text pair comprises:

Performing modal fusion on the intermediate modal feature and the text modal feature to obtain a first fusion feature;

Performing modal fusion on the intermediate modal feature and the video modal feature to obtain a second fusion feature;

and carrying out modal fusion on the first fusion feature and the second fusion feature to obtain the multi-modal feature of the video-text pair.

11. The method of claim 10, wherein the modality fusion is achieved by a cross-modality encoding network comprising a first fusion network and a second fusion network;

performing modal fusion on the intermediate modal feature and the text modal feature to obtain a first fusion feature, including:

invoking the first fusion network, and carrying out feature fusion on the intermediate modal feature and the text modal feature to obtain a first multi-modal feature;

Invoking the second fusion network to perform feature fusion on the intermediate mode feature and the first multi-mode feature to obtain a second multi-mode feature;

and performing feature stitching on the first multi-mode features and the second multi-mode features to obtain the first fusion features.

12. The method of claim 1, wherein the video-text pairs comprise a plurality of video frame-text pairs, the multi-modal features of the video-text pairs comprising frame multi-modal features of each of the video frame-text pairs; the method further comprises the steps of after the intermediate mode feature, the text mode feature and the video mode feature are subjected to mode fusion to obtain the multi-mode feature of the video-text pair:

Performing cover prediction on each video frame-text pair in the video-text pair based on the frame multi-mode characteristics of each video frame-text pair to obtain the prediction probability that each video frame-text pair is a cover frame-text pair;

and determining the video frame-text pair corresponding to the maximum value in the prediction probability as the cover frame-text pair.

13. A multi-modal feature acquisition apparatus, the apparatus comprising:

a determining module, configured to obtain a reference feature set including at least two reference features; comparing each linear mapping feature with each reference feature in the reference feature group to obtain similarity of each linear mapping feature and each reference feature in the reference feature group; determining the weight of each linear mapping feature based on the similarity between each linear mapping feature and each reference feature in the reference feature set; based on the weight of each linear mapping feature, carrying out weighted summation on the at least two linear mapping features to obtain the intermediate modal feature;

14. An electronic device, the electronic device comprising:

a memory for storing executable instructions;

A processor for implementing the method of acquiring multi-modal characteristics according to any one of claims 1 to 12 when executing executable instructions or computer programs stored in the memory.

15. A computer-readable storage medium storing computer-executable instructions or a computer program, wherein the computer-executable instructions or the computer program when executed by a processor implement the method of acquiring multi-modal features according to any one of claims 1 to 12.

16. A computer program product comprising computer executable instructions or a computer program which when executed by a processor implements the method of acquiring multimodal features as claimed in any of claims 1 to 12.