CN115392365A

CN115392365A - Multi-modal feature acquisition method and device and electronic equipment

Info

Publication number: CN115392365A
Application number: CN202210994209.9A
Authority: CN
Inventors: 袁宇辰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2022-11-25
Anticipated expiration: 2042-08-18
Also published as: CN115392365B

Abstract

The application provides a method and a device for acquiring multi-modal characteristics and electronic equipment; the method comprises the following steps: performing modal feature extraction on a video in a video-text pair to obtain a video modal feature, and performing modal feature extraction on a text in the video-text pair to obtain a text modal feature; splicing the video modal characteristics and the text modal characteristics to obtain spliced characteristics; performing linear mapping on the splicing characteristics at least twice to obtain at least two linear mapping characteristics; determining an intermediate modal characteristic in combination with the at least two linear mapping characteristics; and carrying out modal fusion on the intermediate modal characteristics, the text modal characteristics and the video modal characteristics to obtain the multi-modal characteristics of the video-text pair. By the method and the device, the cross-modal characterization performance of the multi-modal features can be effectively improved, and meanwhile, the single-modal characterization performance of the multi-modal features is improved.

Description

Method and device for acquiring multi-modal features and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a device for acquiring multi-modal characteristics and electronic equipment.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

In the related art, for the generation of multi-modal features, features of each modality are usually directly subjected to feature fusion, or the features of each modality are independently processed, so that cross-modal characterization performance and single-modal characterization performance of the obtained multi-modal features cannot be effectively balanced.

Disclosure of Invention

Embodiments of the present application provide a method and an apparatus for obtaining multi-modal features, an electronic device, a computer-readable storage medium, and a computer program product, which can effectively improve cross-modal characterization performance of multi-modal features and improve single-modal characterization performance of multi-modal features.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a method for acquiring multi-modal features, which comprises the following steps:

performing modal feature extraction on a video in a video-text pair to obtain a video modal feature, and performing modal feature extraction on a text in the video-text pair to obtain a text modal feature;

splicing the video modal characteristics and the text modal characteristics to obtain spliced characteristics;

performing linear mapping on the splicing characteristics at least twice to obtain at least two linear mapping characteristics;

determining an intermediate modal characteristic in combination with the at least two linear mapping characteristics;

and carrying out modal fusion on the intermediate modal characteristics, the text modal characteristics and the video modal characteristics to obtain the multi-modal characteristics of the video-text pair.

The embodiment of the application provides an obtaining device of multi-modal characteristics, includes:

the feature extraction module is used for performing modal feature extraction on a video in a video-text pair to obtain a video modal feature, and performing modal feature extraction on a text in the video-text pair to obtain a text modal feature;

the splicing module is used for splicing the video modal characteristics and the text modal characteristics to obtain splicing characteristics;

the linear mapping module is used for carrying out linear mapping on the splicing characteristics for at least two times to obtain at least two linear mapping characteristics;

a determination module for determining intermediate modal characteristics in combination with the at least two linear mapping characteristics;

and the modal fusion module is used for carrying out modal fusion on the intermediate modal characteristics, the text modal characteristics and the video modal characteristics to obtain the multi-modal characteristics of the video-text pair.

In some embodiments, the linear mapping module is further configured to invoke at least two linear mapping networks, and perform linear mapping on the splicing features respectively to obtain at least two linear mapping features, where the linear mapping features and the linear mapping networks are in a one-to-one correspondence relationship; the feature dimensions of each linear mapping feature are the same, and feature elements contained in different linear mapping features are different.

In some embodiments, the determining module is further configured to obtain a reference feature group including at least two reference features; comparing each linear mapping feature with each reference feature in the reference feature group respectively to obtain the similarity of each linear mapping feature and each reference feature in the reference feature group; determining the weight of each linear mapping feature based on the similarity of each linear mapping feature and each reference feature in the reference feature group; and performing weighted summation on the at least two linear mapping characteristics based on the weight of each linear mapping characteristic to obtain the intermediate modal characteristic.

In some embodiments, the number of reference features in the reference feature set is N, the number of linear mapping features is M, M of the linear mapping features exist in a linear mapping feature sequence, and M and N are both positive integers greater than or equal to 2; the determining module is further configured to obtain a feature mean of the first i linear mapping features in the linear mapping feature sequence, and traverse i to obtain M feature means; when the M is smaller than or equal to the N, determining the M feature mean values as the reference features, and constructing a reference feature group comprising the M reference features; when M is larger than N, selecting N from the M feature mean values as the reference features, and constructing a reference feature group comprising the N reference features; wherein i is a positive integer less than or equal to M.

In some embodiments, the device for obtaining multi-modal features further includes: the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a memory cache region which comprises N storage bits; and sequentially storing each reference feature in the reference feature group into the memory cache area, wherein the storage bits in the memory cache area correspond to the reference features one to one.

In some embodiments, the obtaining module is further configured to sort the M feature mean values according to the order of obtaining the feature mean values, so as to obtain a feature mean value sequence; and selecting N characteristic mean values as the reference characteristics from the last characteristic mean value in the characteristic mean value sequence.

In some embodiments, the determining module is further configured to perform the following processing for each linear mapping feature: determining similarity indexes corresponding to the similarity degrees respectively based on the similarity degrees of the linear mapping characteristics; summing the similarity indexes to obtain a similarity index sum value; determining the ratio of each similarity index to the similarity index sum value as a similarity probability corresponding to each similarity index; and determining the maximum value in the similarity probability as the weight of the linear mapping characteristic.

In some embodiments, the determining module is further configured to obtain a reference feature group including at least two reference features; comparing each reference feature in the reference feature group with each linear mapping feature respectively to obtain the similarity between each reference feature in the reference feature group and each linear mapping feature; determining the weight of each reference feature in the reference feature group based on the similarity of each reference feature in the reference feature group and each linear mapping feature; and carrying out weighted summation on the at least two reference features based on the weight of each reference feature to obtain the intermediate modal feature.

In some embodiments, the determining module is further configured to perform the following processing for each reference feature in the reference feature group: determining similarity indexes corresponding to the similarity degrees respectively based on the similarity degrees of the reference features; summing the similarity indexes to obtain a similarity index sum value; determining the ratio of each similarity index to the similarity index sum value as a similarity probability corresponding to each similarity index; and determining the maximum value of the similarity probability as the weight of the reference feature.

In some embodiments, the feature extraction module is further configured to perform modal feature extraction on each video frame in the video, respectively, to obtain a frame modal feature of each video frame; and performing feature fusion on the frame modal features to obtain the video modal features.

In some embodiments, the modal fusion module is further configured to perform modal fusion on the intermediate modal feature and the text modal feature to obtain a first fusion feature; performing modal fusion on the intermediate modal characteristics and the video modal characteristics to obtain second fusion characteristics; performing modal fusion on the first fusion feature and the second fusion feature to obtain a multi-modal feature of the video-text pair.

In some embodiments, the modality fusion is achieved by a cross-modality encoding network comprising a first fusion network and a second fusion network; the modality fusion module is further configured to invoke the first fusion network, and perform feature fusion on the intermediate modality features and the text modality features to obtain first multi-modality features; calling the second fusion network, and performing feature fusion on the intermediate modal features and the first multi-modal features to obtain second multi-modal features; and performing feature splicing on the first multi-modal feature and the second multi-modal feature to obtain the first fusion feature.

In some embodiments, the video-text pairs comprise a plurality of video frame-text pairs, the multi-modal features of the video-text pairs comprising frame multi-modal features of each of the video frame-text pairs; the device for acquiring multi-modal features further comprises: the cover determining module is used for performing cover prediction on each video frame-text pair in the video-text pairs based on the frame multi-modal characteristics of each video frame-text pair to obtain the prediction probability of each video frame-text pair as a cover frame-text pair; and determining the video frame-text pair corresponding to the maximum value in the prediction probability as the cover frame-text pair.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the multi-modal feature acquisition method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for acquiring the multi-modal features provided by the embodiment of the application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the method for acquiring the multi-modal features described in the embodiments of the present application.

The embodiment of the application has the following beneficial effects:

the method comprises the steps of splicing video modal characteristics and text modal characteristics in a video-text pair to obtain splicing characteristics, carrying out linear mapping on the splicing characteristics for multiple times to obtain multiple linear mapping characteristics, determining intermediate modal characteristics by combining the multiple linear mapping characteristics, and determining multi-modal characteristics of the video-text pair by combining the intermediate modal characteristics, the text modal characteristics and the video modal characteristics. In the process of generating the multi-modal characteristics, the intermediate modal characteristics and the single modal characteristics (namely the text modal characteristics and the video modal characteristics) are combined, so that the cross-modal characterization performance of the multi-modal characteristics is effectively improved while the single-modal characterization performance of each modal characteristic is kept to the maximum extent in the generated multi-modal characteristics.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture of an acquisition method of multi-modal features provided in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an apparatus for obtaining multi-modal features provided in an embodiment of the present application;

fig. 3A to 3F are schematic flow charts of a method for acquiring multi-modal features provided in an embodiment of the present application;

fig. 4A to 4C are schematic diagrams illustrating a method for obtaining multi-modal features according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like.

2) Natural Language Processing (NLP): is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

3) Convolutional Neural Networks (CNN), convolutional Neural Networks: is a type of Feed Forward Neural Networks (FNN) that includes convolution calculations and has a Deep structure, and is one of the representative algorithms of Deep Learning (Deep Learning). The convolutional neural network has a Representation Learning (Representation Learning) capability, and can perform Shift-Invariant Classification (Shift-Invariant Classification) on an input image according to a hierarchical structure thereof.

4) Multi-modal characterization: different forms of presence or sources of information may be referred to as a modality. Data composed of two or more modalities is called multi-modality data (multi-modality is used to represent data forms of different modalities or formats of the same modality, and generally represents a text modality, an image modality, an audio modality, a video modality, and the like). Multimodal features refer to data acquired through different domains or perspectives for the same descriptive object, and each domain or perspective describing the data is called a modality.

5) And (3) modality fusion: the method mainly refers to the comprehensive processing of multi-modal data by using a computer, and is responsible for fusing information of each modality to execute a target task. The modal fusion is responsible for effectively integrating the characteristics of a plurality of modalities, and the advantages of different modalities are drawn, so that the information integration is completed.

6) Linear Mapping (LM): is a mapping from one vector space V to another vector space W and holds addition operations and number multiplication operations, while a linear transformation is a linear mapping of a linear space V to itself.

In the implementation process of the embodiment of the present application, the applicant finds that the following problems exist in the related art:

in the related art, referring to fig. 4A (a), fig. 4A is a schematic diagram of a method for obtaining multi-modal features provided by an embodiment of the present application. For the generation of multi-modal features, by independently processing a video modality and a text modality, no interaction is performed before a downstream task, and although the method can reserve the characterization capability of each modality to the maximum extent, the method is often underperforming for the downstream task needing the video and text interaction details due to the lack of detail interaction between the modalities.

In the related art, referring to fig. 4A (b), for the generation of multi-modal features, a unified cross-modal encoder is used to perform unified cross-modal encoding on video modal features and text modal features, and a fall includes interaction information of the video modal features and the text modal features as input of a downstream task, which weakens the characterization capability of each modality alone and often results in poor performance.

Embodiments of the present application provide a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for obtaining multi-modal features, which can effectively improve cross-modal characterization performance of multi-modal features and simultaneously improve single-modal characterization performance of multi-modal features. In the following, an exemplary application will be explained when the device is implemented as a server.

Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a system 100 for obtaining multi-modal features provided in an embodiment of the present application, in order to implement an application scenario of obtaining multi-modal features, a terminal (an example of the terminal 400 is shown) is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.

The terminal 400 is configured for use by a user of the client 410 for display on a graphical interface 410-1 (graphical interface 410-1 is illustratively shown). The terminal 400 and the server 200 are connected to each other through a wired or wireless network.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal 400 may be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, etc., but is not limited thereto. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

In some embodiments, the server 200 retrieves the video-text pair from the terminal 400 and processes the video-text pair to obtain the multi-modal features of the video-text pair and sends the multi-modal features of the video-text pair to the terminal 400.

In other embodiments, the terminal 400 obtains the video-text pair and processes the video-text pair to obtain multi-modal features of the video-text pair, and sends the multi-modal features of the video-text pair to the server 200.

In other embodiments, the embodiments of the present application may be implemented by Cloud Technology (Cloud Technology), which refers to a hosting Technology for unifying resources of hardware, software, network, etc. in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data.

The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 of the method for obtaining multi-modal features according to the embodiment of the present application, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

The operating system 251, which includes system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., is used for implementing various basic services and for processing hardware-based tasks.

A network communication module 252 for communicating to other electronic devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the multi-modal feature obtaining device provided by the embodiments of the present application may be implemented in software, and fig. 2 shows the multi-modal feature obtaining device 255 stored in the memory 250, which may be software in the form of programs and plug-ins, and includes the following software modules: the feature extraction module 2551, the concatenation module 2552, the linear mapping module 2553, the determination module 2554, and the modality fusion module 2555 are logical, and thus may be arbitrarily combined or further split according to the functions implemented. The functions of the respective modules will be explained below.

In other embodiments, the multi-modal feature obtaining Device provided in the embodiments of the present Application may be implemented in hardware, and as an example, the multi-modal feature obtaining Device provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the multi-modal feature obtaining method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), or other electronic components.

The method for obtaining the multi-modal features provided by the embodiment of the present application will be described in conjunction with the exemplary application and implementation of the server or the terminal provided by the embodiment of the present application.

Referring to fig. 3A, fig. 3A is a schematic flowchart of a method for acquiring a multi-modal feature provided in an embodiment of the present application, and will be described with reference to steps 101 to 105 shown in fig. 3A, where an execution subject of steps 101 to 105 described below may be a server or a terminal, and the following description will take the execution subject as an example of the server.

In step 101, performing modal feature extraction on a video in the video-text pair to obtain a video modal feature, and performing modal feature extraction on a text in the video-text pair to obtain a text modal feature.

In some embodiments, a video-text pair refers to a set of videos and texts with corresponding relationships, the text in the video-text pair is text that matches the video in the video-text pair, for example, the video-text pair may be a video with subtitles, may be a movie with subtitles, etc., then the video in the video-text pair may be movie video screen content, and the text in the video-text pair may be subtitles corresponding to the movie video screen content.

In some embodiments, in the step 101, performing modality feature extraction on the video in the video-text pair to obtain video modality features may be implemented as follows: and calling a video modal feature extraction network, and carrying out modal feature extraction on the video in the video-text pair to obtain the video modal feature. A

In some embodiments, the above-mentioned feature extraction network of the Video modality may be implemented by a feature extraction network such as a joint scaling network (efficiency Net), a Video coding network (Video Swin), a three-dimensional convolution network (3 d computational network, c3 d), and the like.

In some embodiments, in step 101, performing modal feature extraction on the text in the video-text pair to obtain text modal features, which may be implemented as follows: and calling a text modal feature extraction network, and performing modal feature extraction on the text in the video-text pair to obtain text modal features.

In some embodiments, the Text modal feature extraction network may be a Text-coded network (Text-RCNN), a Bidirectional encoded network (BERT), or the like.

By way of example, referring to fig. 4B, fig. 4B is a schematic diagram of a method for obtaining multi-modal features provided by an embodiment of the present application. And performing modal feature extraction on the video in the video-text pair to obtain video modal features, and performing modal feature extraction on the text in the video-text pair to obtain text modal features.

In this way, the text modal characteristics and the video modal characteristics are correspondingly obtained by respectively extracting the modal characteristics of the video and the text in the video-text pair, so that the multi-modal characteristics corresponding to the video-text pair can be conveniently and accurately determined through the text modal characteristics and the video modal characteristics in the subsequent process.

In some embodiments, the extracting of the modal features of the video in the video-text pair in step 101 above may be implemented by: respectively extracting modal characteristics of each video frame in the video to obtain the frame modal characteristics of each video frame; and performing feature fusion on the modal features of the frames to obtain the modal features of the video.

In some embodiments, the frame modality features are encoded representations of video frames, and the video modality features include individual frame modality features.

In step 102, the video modal characteristics and the text modal characteristics are spliced to obtain splicing characteristics.

In some embodiments, the dimension of the stitching feature is equal to the sum of the dimension of the video modality feature and the dimension of the text modality feature.

In some embodiments, the splicing refers to a process of splicing any two features to obtain a spliced feature.

As an example, video modality feature v _base ∈R ^D And text modal characteristics t _base ∈R ^D Splicing is carried out to obtain splicing characteristic V = [ V = [ [ V ] _base ,t _base ]∈R ^2D Wherein R is ^D Dimension characterizing a video modality feature and a text modality feature, R ^2D Dimension, v, characterizing a stitching feature _base Characterizing videoModal characteristics, t _base Characterizing text modal characteristics, V = [ V = [) _base ,t _base ]And characterizing the splicing characteristics.

Therefore, the splicing characteristics are obtained by splicing the video modal characteristics and the text modal characteristics, so that the information of the video modal characteristics and the information of the text modal characteristics are effectively fused by the splicing characteristics, the subsequent determination of the intermediate modal characteristics is facilitated, the text modal characteristics and the video modal characteristics are effectively fused by the intermediate modal characteristics, and the subsequent determination of the multi-modal characteristics of the video-text pair can effectively represent the information of videos and texts.

In step 103, at least two linear mappings are performed on the spliced features to obtain at least two linear mapping features.

In some embodiments, the step 103 may be implemented as follows: and calling at least two linear mapping networks, and respectively carrying out linear mapping on the splicing characteristics to obtain at least two linear mapping characteristics, wherein the linear mapping characteristics and the linear mapping networks are in one-to-one correspondence.

In some embodiments, the feature dimensions of each linear mapping feature are the same, and the feature elements contained in different linear mapping features are different.

In some embodiments, the linear mapping is a mapping from one vector space V to another vector space W and holds addition operations and number multiplication operations, while the linear transformation is a linear mapping of linear space V to itself.

In some embodiments, the linear mapping network may be a multi-layer perceptron, which is a neural network comprising at least one hidden layer and consisting of fully-connected layers, and the output of each hidden layer is transformed by an activation function, wherein the hidden layer and the input layer are fully-connected, and the multi-layer perceptron is a multi-layer feedforward neural network.

In some embodiments, each linear mapping network corresponds to one linear mapping feature, and because the network parameters of each linear mapping network are different, the linear mapping features obtained by different linear mapping networks contain different feature elements.

In some embodiments, the linear mapping is configured to perform dimension reduction on the stitching features to obtain linear mapping features with the same dimension as the text modal features and the video modal features, where the linear mapping features can simultaneously reflect information of the text modal features and the video modal features, and feature elements included in different linear mapping features are different.

In this way, the dimension adjustment of the splicing features is realized by performing linear mapping on the splicing features to obtain linear mapping features, so that the dimension of the obtained linear mapping features is consistent with the dimension of the text modal features and the video modal features. The linear mapping characteristics corresponding to each linear mapping are obtained by carrying out linear mapping on the splicing characteristics for multiple times, so that the characteristic diversity of the obtained linear mapping characteristics is ensured, and the accuracy of the subsequently determined intermediate modal characteristics is effectively ensured.

In step 104, an intermediate modal characteristic is determined in combination with the at least two linear mapping characteristics.

In some embodiments, the intermediate modal characteristics are determined by combining the generated at least two linear mapping characteristics, and since different linear mapping characteristics contain different characteristic elements, the characteristic diversity is effectively ensured, so that the determined intermediate modal characteristics can accurately represent the cross-modal interaction information of the video modal characteristics and the text modal characteristics.

By way of example, referring to fig. 4C, fig. 4C is a schematic diagram of a multi-modal feature obtaining method provided by an embodiment of the present application. The method comprises the steps of splicing the video modal characteristics and the text modal characteristics to obtain spliced characteristics, performing linear mapping on the spliced characteristics at least twice to obtain at least two linear mapping characteristics, and determining intermediate modal characteristics by combining the at least two linear mapping characteristics.

In some embodiments, intermediate modality features, also known as bridge modality features, include features of the video and text, respectively, of the video-text pair, capable of characterizing the content of the video and text.

In some embodiments, referring to fig. 3B, fig. 3B is a flow chart diagram of a method for obtaining multi-modal features provided by an embodiment of the present application. Step 104 shown in fig. 3B may be implemented by performing the following steps 1041 to 1044.

In step 1041, a reference feature set comprising at least two reference features is obtained.

In some embodiments, the number of reference features in the set of reference features is N, the number of linear mapping features is M, the M linear mapping features exist in a sequence of linear mapping features, and M and N are positive integers greater than or equal to 2.

In some embodiments, the sequence of linear mapping features includes M sequentially arranged linear mapping features.

In some embodiments, the reference feature is a feature mean of at least one linear mapping feature, for example, the reference feature may be the linear mapping feature itself, a feature mean of two linear mapping features, a feature mean of three linear mapping features, and the like.

In some embodiments, the reference feature set is a collection of a plurality of reference features, the reference feature set comprising at least two reference features arranged in a sequence.

In some embodiments, referring to fig. 3C, fig. 3C is a flow chart diagram of a method for obtaining multi-modal features provided by an embodiment of the present application. Step 1041 illustrated in fig. 3C may be implemented by performing the following steps 10411 to 10413.

In step 10411, the feature mean values of the first i linear mapping features in the linear mapping feature sequence are obtained, and i is traversed to obtain M feature mean values.

In some embodiments, i is a positive integer less than or equal to M.

As an example, when i =1, the first 1 linear mapping features in the linear mapping feature sequence are obtained, and the first 1 linear mapping features are taken as the feature mean of the first 1 linear mapping features.

As an example, when i =2, a feature mean of the first 2 linear mapping features in the linear mapping feature sequence is obtained, where the feature mean of the first 2 linear mapping features may be a feature mean of the 1 st linear mapping feature and the 2 nd linear mapping feature, and the feature mean of the 1 st linear mapping feature and the 2 nd linear mapping feature may be an average of each feature element of the 1 st linear mapping feature and each feature element corresponding to the 2 nd linear mapping feature.

As an example, when i =3, a feature mean of the first 3 linear mapping features in the linear mapping feature sequence is obtained, where the feature mean of the first 3 linear mapping features may be a feature mean of the 1 st linear mapping feature, the 2 nd linear mapping feature and the 3 rd linear mapping feature, and the feature mean of the 1 st linear mapping feature, the 2 nd linear mapping feature and the 3 rd linear mapping feature may be an average of each feature element of the 1 st linear mapping feature, each feature element corresponding to the 2 nd linear mapping feature and each feature element corresponding to the 3 rd linear mapping feature.

As an example, the expression of the feature mean of the first 3 linear mapping features in the linear mapping feature sequence may be:

wherein x represents the feature mean value of the first 3 linear mapping features in the linear mapping feature sequence, x represents the feature element value of the 1 st linear mapping feature, y represents the feature element value of the 2 nd linear mapping feature, and z represents the feature element value of the 3 rd linear mapping feature.

In step 10412, when M is less than or equal to N, the M feature mean is determined as the reference feature, and a reference feature group including M reference features is constructed.

In some embodiments, when the number M of feature means is less than or equal to the number N of reference features, all M feature means are determined as reference features, and a reference feature group including M reference features is constructed.

As an example, when M =3,n =5, i.e., M is smaller than N, 3 feature mean values are each determined as a reference feature, and a reference feature group including 3 reference features is constructed.

In step 10413, when M is greater than N, N features are selected from the M feature mean values as reference features, and a reference feature group including the N reference features is constructed.

In some embodiments, when the number M of the feature mean values is greater than the number N of the reference features, N feature mean values are selected from M feature mean values as the reference features, and a reference feature group including 3 reference features is constructed.

As an example, when M =5,n =3, i.e., M is greater than N, 3 feature mean values are selected from the 5 feature mean values as reference features, and a reference feature group including the 3 reference features is constructed.

In some embodiments, the selecting N of the M feature averages as the reference features may be implemented as follows: sequencing the M characteristic mean values according to the sequence of the acquired characteristic mean values to obtain a characteristic mean value sequence; and selecting N characteristic mean values as reference characteristics from the last characteristic mean value in the characteristic mean value sequence.

In some embodiments, the order of obtaining the feature mean values may be, the obtaining time of the feature mean values of the first 1 linear mapping features is earlier than the feature mean values of the first 2 linear mapping features, is earlier than the feature mean values of the first 3 linear mapping features, and so on.

In some embodiments, the earlier the feature mean in the feature mean sequence is obtained, and the later the feature mean in the feature mean sequence is obtained. The sequence of the feature mean acquisition time is consistent with the sequence of the feature means in the feature mean sequence.

In some embodiments, the last feature mean in the feature mean sequence is the feature mean of the first M linear mapping features, that is, the last feature mean in the feature mean sequence is the mean of all linear mapping features, so that the accuracy of the last feature mean in the feature mean sequence is the highest.

In some embodiments, the method for obtaining multi-modal features provided by the embodiments of the present application may further store the reference features by: acquiring a memory cache area, wherein the memory cache area comprises N storage bits; and sequentially storing all the reference characteristics in the reference characteristic group into a memory cache area, wherein the storage bits in the memory cache area correspond to the reference characteristics one by one.

In some embodiments, the memory buffer may be an area of the memory for storing the reference feature, and the memory buffer includes N storage bits, each of which may store one reference feature.

It can be understood that the number of the reference features may be specifically set according to the capacity of the memory buffer, so as to ensure that each reference feature may be stored in the memory buffer, and facilitate subsequent comparison between the linear mapping feature and each reference feature in the memory buffer.

In step 1042, each linear mapping feature is compared with each reference feature in the reference feature group, so as to obtain the similarity between each linear mapping feature and each reference feature in the reference feature group.

In some embodiments, the similarity between features may be cosine similarity, or euclidean metric, where the cosine similarity is measured by the consistency of the values and directions between dimensions, and the euclidean metric emphasizes the difference between dimensions, but does not emphasize the difference in value.

In some embodiments, the comparing may be a similarity calculation process, and the similarity between each linear mapping feature and each reference feature in the reference feature group is obtained by comparing each linear mapping feature with each reference feature in the reference feature group.

In step 1043, the weight of each linear mapping feature is determined based on the similarity of each linear mapping feature to each reference feature in the reference feature set.

In some embodiments, the weight of the linear mapping feature characterizes a proportion of the plurality of linear mapping features corresponding to the weighted summation, and the larger the weight of the linear mapping feature is, the larger the proportion of the plurality of linear mapping features corresponding to the weighted summation is.

In some embodiments, referring to fig. 3D, fig. 3D is a flow chart diagram of a method for obtaining multi-modal features provided by an embodiment of the present application. Step 1043 shown in fig. 3D may be implemented by performing the following steps 10431 to 10433 for each linear mapping feature.

In step 10431, based on each similarity of the linear mapping features, a similarity index corresponding to each similarity is determined; and summing the similarity indexes to obtain a similarity index sum value.

In some embodiments, the expression of the similarity index corresponding to the similarity may be:

T _i ＝e ^t (2)

wherein, T _i And (3) representing the similarity index corresponding to the similarity, t representing the similarity, and e being a natural constant, an infinite endless decimal number and an overrun number.

In step 10432, the ratio of each similarity index to the sum of similarity indexes is determined as the similarity probability corresponding to each similarity index.

In some embodiments, the expression for the similarity probability may be:

wherein w represents the likelihood probability, T _i The similarity index, sigma T, corresponding to the characterization similarity _i The sum of each similarity index.

In step 10433, the maximum of the similar probabilities is determined as the weight of the linear mapping feature.

In some embodiments, the step 10433 may be implemented as follows: and sequencing the similar probabilities to obtain the maximum value in the similar probabilities, and determining the maximum value of the similar probabilities as the weight of the linear mapping characteristics.

In step 1044, at least two linear mapping features are weighted and summed based on the weight of each linear mapping feature to obtain an intermediate modal feature.

In some embodiments, the expression for the intermediate modality feature may be:

Y＝∑wY _i (4)

wherein Y characterizes an intermediate modal characteristic, Y _i And (5) characterizing each linear mapping feature, and w characterizing the weight corresponding to the linear mapping feature.

In this way, the weights corresponding to the linear mapping features are determined, the linear mapping features are subjected to weighted summation based on the weights, and the intermediate modal features are obtained.

In some embodiments, referring to fig. 3E, fig. 3E is a flow chart diagram of a method for obtaining multi-modal features provided by an embodiment of the present application. Step 104 shown in fig. 3E may be implemented by performing the following steps 1045 to 1048.

In step 1045, a reference feature set comprising at least two reference features is obtained.

In step 1046, each reference feature in the reference feature group is compared with each linear mapping feature, so as to obtain the similarity between each reference feature in the reference feature group and each linear mapping feature.

In some embodiments, the similarity between features may be cosine similarity or euclidean metric, and the cosine similarity is measured by the consistency of the value directions between dimensions, and emphasizes the difference between dimensions, but not the difference in value, whereas the euclidean metric is emphasized on the difference in value.

In some embodiments, the comparing may be a similarity calculation process, and the similarity between each reference feature in the reference feature group and each linear mapping feature is obtained by comparing each linear mapping feature with each reference feature in the reference feature group.

In step 1047, a weight of each reference feature in the reference feature set is determined based on the similarity between each reference feature in the reference feature set and each linear mapping feature.

In some embodiments, the weight of the reference feature characterizes a corresponding proportion of the plurality of reference features in the weighted sum, and the greater the weight of the reference feature, the greater the proportion of the plurality of reference features in the weighted sum.

In some embodiments, referring to fig. 3F, fig. 3F is a flow diagram of a method for obtaining multi-modal features provided by an embodiment of the present application. Step 1047 illustrated in fig. 3F may be implemented by performing steps 10471 to 10474 as follows.

In step 10471, based on the similarities of the reference features, similarity indexes corresponding to the similarities are determined.

Q _i ＝e ^q (5)

wherein Q _i And (3) representing the similarity index corresponding to the similarity, q representing the similarity, and e being a natural constant, an infinite endless fractional number and an overrun number.

In step 10472, the similarity indices are summed to obtain a similarity index sum value.

T _i ＝e ^t (6)

In step 10473, the ratio of each similarity index to the sum of similarity indexes is determined as the similarity probability corresponding to each similarity index.

In some embodiments, the expression for the similarity probability may be:

In step 10474, the maximum value of the similarity probabilities is determined as the weight of the reference feature.

In step 1048, at least two reference features are weighted and summed based on the weight of each reference feature to obtain an intermediate modal feature.

Y＝∑wQ _i (8)

wherein Y characterizes an intermediate modal characteristic, Q _i And characterizing each reference feature, and w characterizing the corresponding weight of the reference feature.

In this way, the weights corresponding to the reference features are determined, the reference features are subjected to weighted summation based on the weights, and the intermediate modal features are obtained.

In step 105, the intermediate modal characteristics, the text modal characteristics, and the video modal characteristics are subjected to modal fusion to obtain multi-modal characteristics of the video-text pair.

In some embodiments, the modality fusion mainly refers to comprehensive processing of multi-modality data by using a computer, and is responsible for fusing information of each modality to execute a target task. The modal fusion is responsible for effectively integrating the characteristics of a plurality of modalities, and the advantages of different modalities are drawn, so that the information integration is completed.

In some embodiments, the intermediate modal feature, the text modal feature and the video modal feature are subjected to modal fusion to obtain the multi-modal feature of the video-text pair, and the intermediate modal feature participates in a modal fusion process, so that the cross-modal feature characterization capability of the generated multi-modal feature can be effectively improved, and the respective characterization capabilities of the video modal feature and the single-modal feature of the text modal feature in the video-text pair are effectively reserved.

In some embodiments, referring to fig. 3B, fig. 3B is a flow chart diagram of a method for obtaining multi-modal features provided by an embodiment of the present application. Step 105 shown in fig. 3B may be implemented by performing the following steps 1051 to 1053.

In step 1051, modality fusion is performed on the intermediate modality feature and the text modality feature to obtain a first fusion feature.

In some embodiments, the first fusion feature is obtained by performing modal fusion on the intermediate modal feature and the text modal feature, and the first fusion feature effectively fuses the text modal feature and the video modal feature.

In some embodiments, the Modal fusion is implemented by a Cross-Modal Encoder network (Cross-Modal Encoder) comprising a first fusion network and a second fusion network.

In some embodiments, the above step 1051 can be implemented as follows: calling a first fusion network to perform feature fusion on the intermediate modal features and the text modal features to obtain first multi-modal features; calling a second fusion network to perform feature fusion on the intermediate modal features and the first multi-modal features to obtain second multi-modal features; and performing feature splicing on the first multi-modal feature and the second multi-modal feature to obtain a first fusion feature.

In some embodiments, the first converged network and the second converged network have the same network structure, the first converged network is used for feature fusion of the intermediate modality features and the textual modality features, and the second converged network is used for feature fusion of the intermediate modality features and the first multimodal features.

In step 1052, modality fusion is performed on the intermediate modality feature and the video modality feature to obtain a second fusion feature.

In some embodiments, the above step 1052 may be implemented as follows: calling the first fusion network to perform feature fusion on the intermediate modal features and the video modal features to obtain third multi-modal features; calling a second fusion network to perform feature fusion on the intermediate modal features and the third multimodal features to obtain fourth multimodal features; and performing feature splicing on the third multi-modal feature and the fourth multi-modal feature to obtain a second fusion feature.

In some embodiments, the first converged network and the second converged network have the same network structure, the first converged network is used for feature fusion of the intermediate modal features and the text modal features, and the second converged network is used for feature fusion of the intermediate modal features and the first multimodal features.

In step 1053, the first fusion feature and the second fusion feature are modality fused to obtain a multi-modal feature of the video-text pair.

In some embodiments, the step 1053 can be implemented as follows: calling the first fusion network, and performing feature fusion on the first fusion feature and the second fusion feature to obtain a fifth multi-modal feature; calling a second fusion network to perform feature fusion on the intermediate modal features and the fifth multimodal features to obtain sixth multimodal features; and performing feature splicing on the fifth multi-modal feature and the sixth multi-modal feature to obtain the multi-modal feature of the video-text pair.

In some embodiments, the video-text pairs comprise a plurality of video frame-text pairs, and the multi-modal features of the video-text pairs comprise frame multi-modal features of each video frame-text pair.

In some embodiments, after step 105 above, the cover frame of the video-text pair may be determined by: performing cover prediction on each video frame-text pair in the video-text pairs based on the frame multi-modal characteristics of each video frame-text pair to obtain the prediction probability of each video frame-text pair as a cover frame-text pair; and determining the video frame-text pair corresponding to the maximum value in the prediction probability as a cover frame-text pair.

In some embodiments, the performing cover prediction on each video frame-text pair in the video-text pair based on the frame multi-modal features of each video frame-text pair to obtain the prediction probability that each video frame-text pair is a cover frame-text pair may be implemented as follows: calling a full connection layer, and performing grading prediction on the multi-mode frame characteristics of each video frame-text pair to obtain a prediction score of each video frame-text pair as a cover frame-text pair; and normalizing the prediction score to obtain the prediction probability of each video frame-text pair as a cover frame-text pair.

In some embodiments, after step 105 above, the answer to the video question may be determined as follows: a problem of obtaining a video-text pair; and calling a question-answer model based on the multi-modal characteristics of the video-text pair to perform score calculation on the questions of the video-text pair to obtain probability values of the answers corresponding to the questions of the video-text pair, and determining the answer corresponding to the maximum probability value as the answer corresponding to the obtained questions of the video-text pair.

In some embodiments, a question-answer model is used to predict the probability that an acquired question corresponds to each answer.

In some embodiments, after step 105 above, the title of the video-text pair may be determined by: based on the frame multi-modal characteristics of each video frame-text pair, performing title prediction on each video frame-text pair in the video-text pair to obtain the prediction probability of each video frame-text pair as a title frame-text pair; and determining the text in the video frame-text pair corresponding to the maximum value in the prediction probability as the title of the video-text pair.

Thus, the video modal characteristics and the text modal characteristics in the video-text pair are spliced to obtain splicing characteristics, the splicing characteristics are subjected to linear mapping for multiple times to obtain multiple linear mapping characteristics, the intermediate modal characteristics are determined by combining the multiple linear mapping characteristics, and the multi-modal characteristics of the video-text pair are determined by combining the intermediate modal characteristics, the text modal characteristics and the video modal characteristics. In the process of generating the multi-modal characteristics, the intermediate modal characteristics and the single modal characteristics (namely the text modal characteristics and the video modal characteristics) are combined, so that the cross-modal characterization performance of the multi-modal characteristics is effectively improved while the single-modal characterization performance of each modal characteristic is kept to the maximum extent in the generated multi-modal characteristics.

In the following, an exemplary application of the embodiments of the present application in an application scenario of an actual video-text pair will be described.

In the field of Video-Text Pre-training (Video-Text Pre-training), related technologies generate effective multi-Modal representations (embedding) for Video-Text pairs by building a Cross-Modal Learning (Cross-Modal Learning) structure based on Video-Text interaction as a training means for subsequent downstream tasks such as Video-Text Retrieval (Video/Text Retrieval), video title generation (Video hosting), video Question Answering (Video Question Answering), and the like.

The method aims at the problem that the independent mode or the fusion mode of the video-text pair is difficult to effectively balance in cross-mode learning, and provides a bridge mode (namely the intermediate mode described above) capable of learning, which is used as the intermediate mode of video-text pair interaction. The mode not only can effectively learn cross-modal information, but also can effectively retain the characterization capability of each individual mode.

In order to implement a more effective cross-modal learning manner, the embodiment of the present application proposes a concept of a "bridge modality", see fig. 4A, where the bridge modality representation is generated based on the content of the video modality and the text modality and the past sample representation in the cache, and in the training process, the video modality and the text modality do not directly interact with each other, but interact with the bridge modality respectively, so as to learn cross-modal information. The method not only can effectively learn the cross-modal interaction information, but also can keep the single-modal characterization capability of the video modality and the text modality.

In some embodiments, referring to fig. 4B, fig. 4B is a schematic diagram of a method for obtaining multi-modal features provided by an embodiment of the present application.

Firstly, a video-text pair is obtained, and the video-text pair is decomposed to obtain a video of the video-text pair and a text of the video-text pair.

And then, performing modal feature extraction on the video of the video-text pair to obtain video modal features, and performing modal feature extraction on the text of the video-text pair to obtain text modal features.

By way of example, modal feature extraction for a video of a video-text pair may be implemented as follows: decomposing a video of the video-text pair to obtain a video frame sequence; and coding the video frame sequence to obtain the video modal characteristics. The encoding of the Video frame sequence may be implemented by a joint scaling network (Efficient Net), a Video coding network (Video swing), a three-dimensional convolutional network (3 d computational network, c3 d), and other coding networks.

By way of example, modal feature extraction of text of a video-text pair may be achieved by: decomposing the text of the video-text pair to obtain a word sequence; and coding the word sequence to obtain the text modal characteristics. The encoding of the word sequence may be implemented by a Text-RCNN (Text-RCNN) encoding network, a Bidirectional encoding network (BERT) or other encoding networks.

Referring to fig. 4B, intermediate modality features are generated based on the video modality features and the text modality features. In the process of modal fusion, the video modal characteristics and the text modal characteristics are not directly interacted any more, but are respectively interacted with the generated intermediate modal characteristics, and the dimension of the text modal characteristics is set to be t ∈ R ^L×D The dimensionality of the video modal characteristics is v ∈ R ^F×D The dimension of the intermediate modal characteristics is b ∈ R ^B×D . The length of the characteristic modal characteristics is represented by D, and the characteristic lengths of the text modal characteristics, the video modal characteristics and the intermediate modal characteristics are the same; and L, F, B represent the number of corresponding modal features, respectively.

Referring to fig. 4B, for the generation of the intermediate mode features, the generation may be implemented by means of a Memory buffer, and by establishing a Memory buffer (Memory bank) with a length N, M = [ M = [ ] ₁ ,M ₂ ,…,M _N ]∈R ^N×D The initial content is empty, and N represents the capacity of the buffer area; here, each M _j Can be understood as a token of a bridge modality, and the bridge modality of a specific video-text input sample is formed by combining the tokens by a certain weight.

Making video modal characteristics v E R ^F×D And text modal characteristics t ∈ R ^L×D Splicing is carried out to obtain splicing characteristic V = [ V = [ [ V ] _base ,t _base ]∈R ^2D Is provided with

Is a series of learnable linear mapping layers (Typically a multi-layer perceptron MLP), the stitching signature V = [ V ] may be calculated _base ,t _base ]From the 2D dimension to the D dimension. Here, the

The subscript i =1, \ 8230and B, indicating that B intermediate modal characteristics will be generated. Each one of which is

Will be respectively associated with M = [ M ] in the buffer ₁ ,M ₂ ,…,M _N ]Calculating cosine similarity, and performing normalization calculation (Softmax) after all cosine similarities are summarized to obtain each M _j The corresponding weight. The final intermediate modal characteristics are then equal to all M as a whole _j Weighted sum with its weight.

As an example, the expression of cosine similarity may be:

wherein s is _i,j Characterizing the similarity of the ith intermediate modal characteristic to the average characteristic in the jth buffer, S _c The cosine similarity function is characterized in that,

characterizing the ith intermediate modal feature, M _j Characterizing the jth average feature within the buffer.

As an example, the expression for the normalization calculation may be:

p _i,j ＝Softmax _j (s _i,j ) (10)

wherein, softmax _j Characterizing the normalization function, s _i,j Characterizing the similarity of the ith intermediate modal feature to the average feature in the jth buffer, p _i,j Characterization M _j The corresponding weight.

As an example, the expression for the intermediate modality feature may be:

b＝pM (11)

wherein b represents the intermediate modal characteristics, p represents the weight corresponding to the average characteristics in the buffer area, and M represents the average characteristics in the buffer area.

When the memory buffer is not full, each intermediate mode feature is directly adopted

All at the same time

Writing the average value into the memory buffer area, thereby establishing the memory buffer area M = [ M = [) ₁ ,M ₂ ,…,M _N ]Thereby establishing intermediate modal characteristics.

In some embodiments, referring to fig. 4C, fig. 4C is a schematic diagram of a method for obtaining multi-modal features provided by an embodiment of the present application. And performing linear mapping on the text modal characteristics and the video modal characteristics to obtain a plurality of fusion modal characteristics, writing the average value of the fusion modal characteristics into a data buffer area, and performing normalization calculation on the fusion modal characteristics and the average value in the data buffer area to obtain intermediate modal characteristics.

The bridge mode (i.e., the intermediate mode features described above) established in the embodiment of the application can be used as an intermediate medium, and can be respectively interacted and calculated with the video mode features and the text mode features to learn effective cross-mode information, and the establishment of the bridge mode can also avoid direct interaction between the video mode features and the text mode features, so that the individual characterization capabilities of each mode are retained to the maximum extent. The method and the device for multi-modal representation based on the multi-modal pre-training model can promote the multi-modal pre-training model to learn more effective multi-modal representations (embedding), so that better effects are brought to downstream tasks (such as video-text retrieval, video title generation, video question answering and the like).

In the embodiment of the present application, the single-mode encoder may use any model that meets the requirements, including but not limited to Text-RCNN, BERT, C3D, efficientNet, video runtime, and the like mentioned above, the bridge mode may have multiple building manners, the above mentioned manner based on the memory module is one of the better generating manners, and in practical applications, the single-mode encoder is not limited to the exemplified Text mode features and video mode features, and may also be extended to more modes (such as audio mode features and the like) as needed.

It is understood that in the embodiments of the present application, the data related to video-text peer-to-peer needs to be approved or approved by the user when the embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.

Continuing with the exemplary structure of the multi-modal feature retrieving device 255 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the multi-modal feature retrieving device 255 of the memory 250 may include: the feature extraction module 2551 is configured to perform modal feature extraction on a video in the video-text pair to obtain a video modal feature, and perform modal feature extraction on a text in the video-text pair to obtain a text modal feature; a splicing module 2552, configured to splice the video modal features and the text modal features to obtain splicing features; the linear mapping module 2553 is configured to perform linear mapping on the splicing features at least twice to obtain at least two linear mapping features; a determining module 2554 for determining intermediate modal characteristics in combination with the at least two linear mapping characteristics; and the modal fusion module 2555 is configured to perform modal fusion on the intermediate modal feature, the text modal feature and the video modal feature to obtain a multi-modal feature of the video-text pair.

In some embodiments, the linear mapping module 2553 is further configured to invoke at least two linear mapping networks, and perform linear mapping on the splicing features respectively to obtain at least two linear mapping features, where the linear mapping features and the linear mapping networks are in a one-to-one correspondence; the feature dimensions of each linear mapping feature are the same, and feature elements contained in different linear mapping features are different.

In some embodiments, the determining module 2554 is further configured to obtain a reference feature group including at least two reference features; comparing each linear mapping characteristic with each reference characteristic in the reference characteristic group respectively to obtain the similarity of each linear mapping characteristic and each reference characteristic in the reference characteristic group; determining the weight of each linear mapping feature based on the similarity of each linear mapping feature and each reference feature in the reference feature group; and weighting and summing at least two linear mapping characteristics based on the weight of each linear mapping characteristic to obtain the intermediate modal characteristic.

In some embodiments, the number of reference features in the reference feature set is N, the number of linear mapping features is M, the M linear mapping features exist as a sequence of linear mapping features, M and N are both positive integers greater than or equal to 2; the determining module 2554 is further configured to obtain a feature mean value of the first i linear mapping features in the linear mapping feature sequence, and traverse i to obtain M feature mean values; when M is smaller than or equal to N, determining the M feature mean values as reference features, and constructing a reference feature group comprising M reference features; when M is larger than N, selecting N from the M feature mean values as reference features, and constructing a reference feature group comprising N reference features; wherein i is a positive integer less than or equal to M.

In some embodiments, the apparatus for acquiring multi-modal characteristics further includes: the acquisition module is used for acquiring a memory cache region, and the memory cache region comprises N storage bits; and sequentially storing all the reference characteristics in the reference characteristic group into a memory cache area, wherein the storage bits in the memory cache area correspond to the reference characteristics one by one.

In some embodiments, the obtaining module is further configured to sort the M feature mean values according to the sequence of obtaining the feature mean values, so as to obtain a feature mean value sequence; and selecting N characteristic mean values as reference characteristics from the last characteristic mean value in the characteristic mean value sequence.

In some embodiments, the determining module 2554 is further configured to perform the following processing for each linear mapping feature: determining similarity indexes corresponding to the similarities based on the similarities of the linear mapping characteristics; summing the similarity indexes to obtain a similarity index sum value; determining the ratio of each similarity index to the similarity index sum value as the similarity probability corresponding to each similarity index; and determining the maximum value in the similar probabilities as the weight of the linear mapping characteristic.

In some embodiments, the determining module 2554 is further configured to obtain a reference feature group including at least two reference features; comparing each reference feature in the reference feature group with each linear mapping feature respectively to obtain the similarity of each reference feature in the reference feature group and each linear mapping feature; determining the weight of each reference feature in the reference feature group based on the similarity of each reference feature in the reference feature group and each linear mapping feature; and carrying out weighted summation on at least two reference characteristics based on the weight of each reference characteristic to obtain the intermediate modal characteristics.

In some embodiments, the determining module 2554 is further configured to perform the following processing for each reference feature in the reference feature group: determining similarity indexes corresponding to the similarities based on the similarities of the reference features; summing the similarity indexes to obtain a similarity index sum value; determining the ratio of each similarity index to the similarity index sum value as the similarity probability corresponding to each similarity index; and determining the maximum value of the similarity probability as the weight of the reference feature.

In some embodiments, the feature extraction module 2551 is further configured to perform modal feature extraction on each video frame in the video, to obtain a frame modal feature of each video frame; and performing feature fusion on the modal features of the frames to obtain the modal features of the video.

In some embodiments, the modal fusion module 2555 is further configured to perform modal fusion on the intermediate modal feature and the text modal feature to obtain a first fusion feature; performing modal fusion on the intermediate modal characteristics and the video modal characteristics to obtain second fusion characteristics; and performing modal fusion on the first fusion characteristic and the second fusion characteristic to obtain a multi-modal characteristic of the video-text pair.

In some embodiments, the modality fusion is achieved by a cross-modality encoding network, the cross-modality encoding network including a first fusion network and a second fusion network; the modal fusion module 2555 is further configured to invoke a first fusion network, perform feature fusion on the intermediate modal feature and the text modal feature, and obtain a first multi-modal feature; calling a second fusion network to perform feature fusion on the intermediate modal features and the first multi-modal features to obtain second multi-modal features; and performing feature splicing on the first multi-modal feature and the second multi-modal feature to obtain a first fusion feature.

In some embodiments, the video-text pairs comprise a plurality of video frame-text pairs, the multi-modal features of the video-text pairs comprising frame multi-modal features of each video frame-text pair; the device for acquiring multi-modal features further comprises: the cover determining module is used for performing cover prediction on each video frame-text pair in the video-text pairs based on the frame multi-mode characteristics of each video frame-text pair to obtain the prediction probability of each video frame-text pair as a cover frame-text pair; and determining the video frame-text pair corresponding to the maximum value in the prediction probability as a cover frame-text pair.

Embodiments of the present application provide a computer program product comprising a computer program or computer executable instructions stored in a computer readable storage medium. The processor of the computer device reads the computer-executable instructions from the computer-readable storage medium, and the processor executes the computer-executable instructions, so that the computer device executes the method for acquiring the multi-modal features described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, cause the processor to perform a method for obtaining multi-modal features provided by embodiments of the present application, for example, the method for obtaining multi-modal features as shown in fig. 3A.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

To sum up, the embodiment of the application has the following beneficial effects:

(1) The method comprises the steps of splicing video modal characteristics and text modal characteristics in a video-text pair to obtain splicing characteristics, carrying out linear mapping on the splicing characteristics for multiple times to obtain multiple linear mapping characteristics, determining intermediate modal characteristics by combining the multiple linear mapping characteristics, and determining multi-modal characteristics of the video-text pair by combining the intermediate modal characteristics, the text modal characteristics and the video modal characteristics. In the process of generating the multi-modal characteristics, the intermediate modal characteristics and the single modal characteristics (namely the text modal characteristics and the video modal characteristics) are combined, so that the cross-modal characterization performance of the multi-modal characteristics is effectively improved while the single-modal characterization performance of each modal characteristic is kept to the maximum extent in the generated multi-modal characteristics.

(2) The method comprises the steps of extracting modal characteristics of a video and a text in a video-text pair respectively to obtain text modal characteristics and video modal characteristics correspondingly, so that the multi-modal characteristics corresponding to the video-text pair can be accurately determined through the text modal characteristics and the video modal characteristics subsequently, and the multi-modal characteristics determined subsequently can accurately reflect the video and the text in the video-text pair due to the fact that the text modal characteristics and the video modal characteristics have the characterization capabilities of the text and the video which correspond to each other.

(3) The video modal characteristics and the text modal characteristics are spliced to obtain the splicing characteristics, so that the information of the video modal characteristics and the information of the text modal characteristics are effectively fused by the splicing characteristics, the subsequent determination of the intermediate modal characteristics is facilitated, the text modal characteristics and the video modal characteristics are effectively fused by the intermediate modal characteristics, and the subsequent determination of the multi-modal characteristics of the video-text pair can effectively represent the information of videos and texts.

(3) The dimension adjustment of the splicing features is realized by performing linear mapping on the splicing features to obtain linear mapping features, so that the dimension of the obtained linear mapping features is consistent with the dimension of the text modal features and the dimension of the video modal features. The linear mapping characteristics corresponding to each linear mapping are obtained by carrying out linear mapping on the splicing characteristics for multiple times, so that the characteristic diversity of the obtained linear mapping characteristics is ensured, and the accuracy of the subsequently determined intermediate modal characteristics is effectively ensured.

(4) The intermediate modal characteristics are determined by combining the generated at least two linear mapping characteristics, and because the characteristic elements contained in different linear mapping characteristics are different, the characteristic diversity is effectively ensured, so that the determined intermediate modal characteristics can accurately represent the cross-modal interaction information of the video modal characteristics and the text modal characteristics.

(5) The quantity of the reference features can be specifically set according to the capacity of the memory buffer area, so that each reference feature can be stored in the memory buffer area, the linear mapping features can be conveniently compared with the reference features in the memory buffer area subsequently, the reference features can be conveniently read from the memory buffer area subsequently through the setting of the memory buffer area, the middle modal features are determined by utilizing the reference features, and the accuracy of the determined middle modal features is guaranteed.

(6) The method comprises the steps of determining the weight corresponding to each linear mapping feature, carrying out weighted summation on each linear mapping feature based on the weight to obtain an intermediate modal feature, and carrying out weighted summation on each linear mapping feature according to the weight because each linear mapping feature fuses the video modal feature and the text modal feature from different angles, so that the obtained intermediate modal feature can accurately reflect the information of the video modal feature and the text modal feature.

(7) The intermediate modal characteristics, the text modal characteristics and the video modal characteristics are subjected to modal fusion to obtain the multi-modal characteristics of the video-text pair, and the intermediate modal characteristics participate in the modal fusion process, so that the cross-modal characteristic representation capability of the generated multi-modal characteristics can be effectively improved, and the respective representation capabilities of the single modal characteristics of the video modal characteristics and the text modal characteristics in the video-text pair are effectively reserved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for obtaining multi-modal features, the method comprising:

2. The method of claim 1, wherein the linearly mapping the stitching feature at least twice to obtain at least two linearly mapped features comprises:

calling at least two linear mapping networks, and respectively performing linear mapping on the splicing characteristics to obtain at least two linear mapping characteristics, wherein the linear mapping characteristics and the linear mapping networks are in one-to-one correspondence;

the feature dimensions of each linear mapping feature are the same, and feature elements contained in different linear mapping features are different.

3. The method according to claim 1, wherein said determining intermediate modal characteristics in combination with said at least two linear mapping characteristics comprises:

acquiring a reference feature group comprising at least two reference features;

comparing each linear mapping feature with each reference feature in the reference feature group respectively to obtain the similarity of each linear mapping feature and each reference feature in the reference feature group;

determining the weight of each linear mapping feature based on the similarity of each linear mapping feature and each reference feature in the reference feature group;

and performing weighted summation on the at least two linear mapping characteristics based on the weight of each linear mapping characteristic to obtain the intermediate modal characteristic.

4. The method of claim 3, wherein the number of reference features in the reference feature set is N, the number of linear mapping features is M, M of the linear mapping features exist in a linear mapping feature sequence, and M and N are both positive integers greater than or equal to 2;

obtaining a reference feature set comprising at least two reference features, comprising:

acquiring the feature mean values of the first i linear mapping features in the linear mapping feature sequence, and traversing the i to obtain M feature mean values;

when the M is smaller than or equal to the N, determining the M feature mean values as the reference features, and constructing a reference feature group comprising the M reference features;

when M is larger than N, selecting N from the M feature mean values as the reference features, and constructing a reference feature group comprising the N reference features;

wherein i is a positive integer less than or equal to M.

5. The method of claim 4, further comprising:

acquiring a memory cache area, wherein the memory cache area comprises N storage bits;

and sequentially storing each reference feature in the reference feature group into the memory cache area, wherein the storage bits in the memory cache area correspond to the reference features one to one.

6. The method according to claim 4, wherein said selecting N of the M feature means as the reference features comprises:

sequencing the M characteristic mean values according to the sequence of obtaining the characteristic mean values to obtain a characteristic mean value sequence;

and selecting N characteristic mean values as the reference characteristics from the last characteristic mean value in the characteristic mean value sequence.

7. The method of claim 3, wherein determining the weight of each of the linearly mapped features based on the similarity of each of the linearly mapped features to each of the reference features in the reference feature set comprises:

performing the following processing for each linear mapping feature respectively:

determining similarity indexes corresponding to the similarity degrees respectively based on the similarity degrees of the linear mapping characteristics; summing the similarity indexes to obtain a similarity index sum value;

determining the ratio of each similarity index to the similarity index sum value as a similarity probability corresponding to each similarity index;

and determining the maximum value in the similarity probability as the weight of the linear mapping characteristic.

8. The method according to claim 1, wherein said determining intermediate modal characteristics in combination with said at least two linear mapping characteristics comprises:

acquiring a reference feature group comprising at least two reference features;

comparing each reference feature in the reference feature group with each linear mapping feature respectively to obtain the similarity between each reference feature in the reference feature group and each linear mapping feature;

determining the weight of each reference feature in the reference feature group based on the similarity of each reference feature in the reference feature group and each linear mapping feature;

and carrying out weighted summation on the at least two reference features based on the weight of each reference feature to obtain the intermediate modal feature.

9. The method of claim 8, wherein determining the weight of each of the reference features in the reference feature set based on the similarity of each of the reference features in the reference feature set to each of the linear mapping features comprises:

performing the following processing for each of the reference features in the reference feature group:

determining similarity indexes corresponding to the similarity degrees respectively based on the similarity degrees of the reference features;

summing the similarity indexes to obtain a similarity index sum value;

and determining the maximum value of the similarity probability as the weight of the reference feature.

10. The method according to claim 1, wherein performing modal feature extraction on the video in the video-text pair to obtain video modal features comprises:

modal feature extraction is carried out on each video frame in the video respectively to obtain the frame modal feature of each video frame;

and performing feature fusion on the frame modal features to obtain the video modal features.

11. The method according to claim 1, wherein the performing modal fusion on the intermediate modal feature, the text modal feature, and the video modal feature to obtain a multi-modal feature of the video-text pair comprises:

performing modal fusion on the intermediate modal characteristics and the text modal characteristics to obtain first fusion characteristics;

performing modal fusion on the intermediate modal characteristics and the video modal characteristics to obtain second fusion characteristics;

and performing modal fusion on the first fusion characteristic and the second fusion characteristic to obtain a multi-modal characteristic of the video-text pair.

12. The method according to claim 11, wherein the modal fusion is achieved through a cross-modal encoding network, the cross-modal encoding network comprising a first fusion network and a second fusion network;

the performing modal fusion on the intermediate modal feature and the text modal feature to obtain a first fusion feature includes:

calling the first fusion network, and performing feature fusion on the intermediate modal features and the text modal features to obtain first multi-modal features;

calling the second fusion network, and performing feature fusion on the intermediate modal features and the first multi-modal features to obtain second multi-modal features;

and performing feature splicing on the first multi-modal feature and the second multi-modal feature to obtain the first fusion feature.

13. The method of claim 1, wherein the video-text pairs comprise a plurality of video frame-text pairs, and wherein the multi-modal features of the video-text pairs comprise frame multi-modal features of each of the video frame-text pairs; after the modality fusion is performed on the intermediate modality feature, the text modality feature and the video modality feature to obtain the multi-modality feature of the video-text pair, the method further includes:

performing cover prediction on each video frame-text pair in the video-text pairs based on frame multi-modal characteristics of each video frame-text pair to obtain prediction probability of each video frame-text pair being a cover frame-text pair;

and determining the video frame-text pair corresponding to the maximum value in the prediction probability as the cover frame-text pair.

14. An apparatus for obtaining multi-modal features, the apparatus comprising:

the linear mapping module is used for performing linear mapping on the splicing characteristics at least twice to obtain at least two linear mapping characteristics;

15. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the method of obtaining multimodal features of any of claims 1 to 13 when executing executable instructions or computer programs stored in the memory.