CN114612826A

CN114612826A - Video and text similarity determination method and device, electronic equipment and storage medium

Info

Publication number: CN114612826A
Application number: CN202210234257.8A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-10
Also published as: WO2023168818A1

Abstract

The invention belongs to the field of artificial intelligence, and provides a method and a device for determining similarity between a video and a text, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a video and corresponding text information, and coding the video and the text information to obtain coding characteristic information; inputting the coding characteristic information into an improved T-Transformer model to obtain global information and local information; respectively inputting the global information and the local information into corresponding attention on-FA modules to obtain global characteristics and local characteristics; taking the global features and the local features as common input, inputting the common input into a Contextua l Transformer model, and obtaining video features and text features through feature splicing; and determining the similarity between the video and the text information according to the video characteristics and the text characteristics. And converting the video and the text into the same contrast space, and performing similarity calculation on two different objects so as to obtain the target video according to text matching.

Description

Video and text similarity determination method and device, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for determining similarity between a video and a text, electronic equipment and a storage medium.

Background

The image video and the text information do not belong to the same expression space, so that the similarity comparison between the image video and the text information is difficult, for example, the video is difficult to search by inputting the text; the existing method is to use a video model such as a 3D ImageNet or the like to extract video features, use BERT (Bidirectional Encoder Representation from Transformers) to extract text features, then perform cosine similarity calculation, and evaluate the similarity of two video features and text features by calculating the cosine value of the included angle between them. However, this method is not interpretable and is wrong from a scientific point of view, so that a new video and text similarity calculation mode needs to be provided.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention provides a method, a device, electronic equipment and a storage medium for determining similarity of videos and texts, which can convert the videos and the texts into the same contrast space and realize similarity calculation between two different things.

In a first aspect, an embodiment of the present invention provides a method for determining similarity between a video and a text, including:

acquiring a video and corresponding text information, and coding the video and the text information to obtain coding characteristic information, wherein the coding characteristic information comprises video local coding information, video global coding information, text local coding information and text global coding information;

inputting the coding characteristic information into an improved T-transform model to obtain global information and local information, wherein the improved T-transform model is formed by stacking a dynamic mask attention network DMAN, a self-attention network SAN and a feedforward neural network FFN;

inputting the global information and the local information into corresponding Attention-FA modules respectively to obtain global characteristics and local characteristics;

the global features and the local features are used as common input and input into a Contextual Transformer model, and video features and text features are obtained through feature splicing processing, wherein the video features correspond to the video, and the text features correspond to the text information;

and determining the similarity between the video and the text information according to the video characteristics and the text characteristics.

In some embodiments, the encoding the video and the text information to obtain encoding characteristic information includes:

segmenting the video and the text information to obtain N video segments and N text segments, wherein each video segment corresponds to one text segment, and N is a positive integer;

respectively coding the video segments and the text segments to obtain video local coding information and text local coding information;

and respectively coding the video and the text information to obtain video global coding information and text global information.

In some embodiments, said segmenting said video and said textual information comprises:

the video is cut into N video segments according to a preset segmentation mode;

and extracting a plurality of text sentences in each video segment as text segments corresponding to the video segments.

In some embodiments, the separately encoding the video segment and the text segment to obtain video partial coding information and text partial coding information includes:

extracting image frames from the video clips, and coding the image frames through a video coder to obtain local video coding information corresponding to the video clips;

and inputting the text segment corresponding to the video segment into a text encoder for encoding to obtain text local encoding information corresponding to the text segment.

In some embodiments, the separately encoding the video and the text information to obtain video global encoding information and text global information includes:

inputting the video into a video encoder for encoding processing to obtain video global encoding information;

and inputting the text information into a text encoder for encoding to obtain text global encoding information.

In some embodiments, the improved T-Transformer model includes several attention networks, each of which is stacked sequentially from 1 DMAN, 1 SAN, and 1 FFN.

In some embodiments, the attention function A of the DMAN_M(Q, K, V) is defined as follows:

A_M(Q,K,V)＝S_M(Q,K)V

q represents query, K represents key, V represents Value, Q, K and V are consistent in vector dimension, M represents a dynamic mask matrix, S represents a softmax function, i represents the ith query in Q, j represents the jth key in K, and d_kRepresenting the vector dimension of the K vector.

In some embodiments, the improved T-fransformer model includes a video T-fransformer model and a text T-fransformer model, parameters of the video T-fransformer model and the text T-fransformer model are independent of each other, parameter sharing between a plurality of the video T-fransformer models, and parameter sharing between a plurality of the text T-fransformer models.

In some embodiments, the Attention-FA module includes a global processing module and a local processing module; the step of inputting the global information and the local information into corresponding Attention-FA modules respectively to obtain global features and local features includes:

inputting the global information into the global processing module to obtain global features, wherein the global features comprise video global features and text global features;

and inputting the local information into the local processing module to obtain local features, wherein the local features comprise video local features and text local features.

In some embodiments, the global processing module comprises a video global Attention-FA module and a text global Attention-FA module, and the local processing module comprises N video local Attention-FA modules and N text local Attention-FA modules, wherein N is a positive integer.

In some embodiments, the inputting the global feature and the local feature as a common input to a context Transformer model, and obtaining the video feature and the text feature through a feature splicing process includes:

inputting the Local features serving as Local Context into a preset transform model, and performing maximum pooling operation on output results to obtain Local feature vector F_local；

Inputting the Global features serving as Global Context into a preset transform model to obtain Global feature vectors F_cross；

For the F_localAnd F_crossAnd performing feature splicing to obtain video features and text features.

In some embodiments, further comprising:

constructing a loss function according to the video characteristics and the text characteristics output by the context Transformer model;

optimizing the improved T-Transformer model using the loss function, the loss function being represented as follows:

L(P,N,α)＝max(0,α+D(x,y)-D(x′,y))+max(0,α+D(x,y)-D(x,y′))

D(x,y)＝1-x^Ty/(‖x‖‖y‖)

wherein x represents the video characteristics output by the Contextual Transformer model, y represents the text characteristics output by the Contextual Transformer model, x 'and y' represent the negative samples of x and y, and alpha is a constant parameter.

In some embodiments, further comprising:

acquiring a text to be retrieved;

and inputting the text to be retrieved into the optimized improved T-transform model to obtain a target video matched with the text to be retrieved.

In some embodiments, before acquiring the video and the corresponding text information, the method further includes:

marking the video, and taking the marked video as the video for training;

and marking text information according to the marking mode of the video, and taking the marked text information as text information for training.

In a second aspect, an embodiment of the present invention provides an apparatus for determining similarity between a video and a text, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video and corresponding text information and coding the video and the text information to obtain coding characteristic information, and the coding characteristic information comprises video local coding information, video global coding information, text local coding information and text global coding information;

the first processing unit is used for inputting the coding characteristic information into an improved T-Transformer model to obtain global information and local information, wherein the improved T-Transformer model is formed by stacking a dynamic mask attention network DMAN, a self-attention network SAN and a feedforward neural network FFN;

the second processing unit is used for respectively inputting the global information and the local information into corresponding Attention-FA modules to obtain global characteristics and local characteristics;

the context processing unit is used for inputting the global features and the local features into a context Transformer model as common input, and obtaining video features and text features through feature splicing processing, wherein the video features correspond to videos, and the text features correspond to text information;

and the similarity calculation unit is used for determining the similarity between the video and the text information according to the video characteristics and the text characteristics.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the video and text similarity determination method according to the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program for executing the video and text similarity determination method according to the first aspect.

The embodiment of the invention at least has the following beneficial effects: acquiring a video and corresponding text information, and coding the video and the text information to obtain coding characteristic information, wherein the coding characteristic information comprises video local coding information, video global coding information, text local coding information and text global coding information; inputting the coding characteristic information into an improved T-transform model to obtain global information and local information, wherein the improved T-transform model is formed by stacking a dynamic mask attention network DMAN, a self-attention network SAN and a feedforward neural network FFN; respectively inputting the global information and the local information into corresponding Attention-FA modules to obtain global characteristics and local characteristics; the global features and the local features are used as common input and input into a context Transformer model, and video features and text features are obtained through feature splicing processing, wherein the video features correspond to videos, and the text features correspond to text information; and determining the similarity between the video and the text information according to the video characteristics and the text characteristics. In the above manner, the video and the text can be converted into the same contrast space, and the similarity calculation is performed on two different things, for example, even if the video does not have any label or title and the like, the video can still be retrieved by inputting the text.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

Fig. 1 is a flowchart of a method for determining similarity between video and text according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an encoding process provided by another embodiment of the present invention;

FIG. 3 is a flow chart for partitioning video and text information provided by another embodiment of the present invention;

FIG. 4 is a flow diagram of a process for encoding video segments and text segments according to another embodiment of the present invention;

FIG. 5 is a flow chart of a process for encoding video and text information according to another embodiment of the present invention;

FIG. 6 is a flow diagram of a model connection provided by another embodiment of the present invention;

FIG. 7 is a flow chart for obtaining global and local features provided by another embodiment of the present invention;

FIG. 8 is a block diagram of video features and text features obtained by feature stitching according to another embodiment of the present invention;

FIG. 9 is a flow diagram of optimizing an improved T-Transformer model provided by another embodiment of the present invention;

FIG. 10 is a flow chart of labeling a sample for training provided by another embodiment of the present invention;

fig. 11 is a block diagram of a similarity determination apparatus according to another embodiment of the present invention;

fig. 12 is a device diagram of an electronic apparatus according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms "first," "second," and the like in the description, in the claims, or in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The invention provides a method and a device for determining similarity between a video and a text, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a video and corresponding text information, and coding the video and the text information to obtain coding characteristic information, wherein the coding characteristic information comprises video local coding information, video global coding information, text local coding information and text global coding information; inputting the coding characteristic information into an improved T-transform model to obtain global information and local information, wherein the improved T-transform model is formed by stacking a dynamic Mask Attention Network DMAN (dynamic Mask Attention Network), a Self-Attention Network SAN (Self-Attention Network) and a feedforward Neural Network FFN (fed Neural Network); respectively inputting the global information and the local information into corresponding Attention-FA modules to obtain global characteristics and local characteristics; the global features and the local features are used as common input and input into a context Transformer model, and video features and text features are obtained through feature splicing processing, wherein the video features correspond to videos, and the text features correspond to text information; and determining the similarity between the video and the text information according to the video characteristics and the text characteristics. In the above manner, the video and the text can be converted into the same contrast space, and the similarity calculation is performed on two different things, for example, even if the video does not have any label or title and the like, the video can still be retrieved by inputting the text.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application device that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction devices, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The terminal mentioned in the embodiment of the present invention may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted computer, a smart home, a wearable electronic device, a VR (Virtual Reality)/AR (Augmented Reality) device, and the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform, and the like.

It should be noted that the data in the embodiment of the present invention may be stored in a server, and the server may be an independent server, or may be a cloud server that provides basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, a content distribution network, and a big data and artificial intelligence platform.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

As shown in fig. 1, fig. 1 is a flowchart of a video and text similarity determining method according to an embodiment of the present invention, where the video and text similarity determining method includes, but is not limited to, the following steps:

step S110, acquiring a video and corresponding text information, and coding the video and the text information to obtain coding characteristic information, wherein the coding characteristic information comprises video local coding information, video global coding information, text local coding information and text global coding information;

step S120, inputting the coding characteristic information into an improved T-transform model to obtain global information and local information, wherein the improved T-transform model is formed by stacking a dynamic mask attention network DMAN, a self-attention network SAN and a feedforward neural network FFN;

step S130, inputting the global information and the local information into corresponding Attention-FA modules respectively to obtain global characteristics and local characteristics;

step S140, inputting the global features and the local features into a context Transformer model as common input, and obtaining video features and text features through feature splicing, wherein the video features correspond to videos, and the text features correspond to text information;

and S150, determining the similarity between the video and the text information according to the video characteristics and the text characteristics.

The model idea of the method comprehensively considers global information and local information of videos and texts, uses a collaborative hierarchical Transformer model to perform feature coding on the two information, and is improved on the traditional Transformer model, namely a Temporal Transformer (T-Transformer) and a Contextual Transformer. The embodiment of the invention is applied to the similarity judgment of video and text information, when a model of the method is trained, the video used for training and the text information corresponding to the video are determined according to the step S110, the video and the text information are coded and then sequentially pass through a T-Transformer model, an Attention-FA module and a context Transformer model, and the video characteristics and the text characteristics are output, so that reference is provided for the similarity of the video and the text information, and in some cases, the trained model can be used for retrieving the associated video according to the text.

Specifically, referring to fig. 2, obtaining the encoding characteristic information in step S110 may be implemented by:

step S111, segmenting the video and the text information to obtain N video segments and N text segments, wherein each video segment corresponds to one text segment, and N is a positive integer;

step S112, respectively carrying out coding processing on the video segments and the text segments to obtain video local coding information and text local coding information;

and step S113, respectively coding the video and the text information to obtain video global coding information and text global information.

In order to obtain video local coding information, video global coding information, text local coding information and text global coding information, two kinds of processing are required to be carried out on the video and the text information, wherein one kind of processing is to carry out coding processing on the whole video to obtain the video global coding information and carry out coding processing on the whole text information to obtain the text global information, the other kind of processing is to divide the video into N video segments which are used as local information and respectively carry out coding processing to obtain N pieces of video local coding information, the text information and the video have a corresponding relation, so that after the video is segmented, the corresponding text information is also divided according to a video segmentation mode to obtain N text segments, each text segment is composed of a plurality of text sentences, and the N text segments are respectively subjected to coding treatment, so that N text local coding information is obtained.

It is understood that the video may be divided into different forms, such as N parts according to the average duration or total number of frames of the video, and further, for example, according to the difference of the video in the content (e.g., dividing N segments based on the content division such as the beginning, the first section, the second section, the third section, and the end of the movie); accordingly, after the video is segmented, a text sentence corresponding to the video segment is determined and treated as a text segment. Therefore, as shown in fig. 3, the step of determining text segments after segmenting the video (step S111) can be implemented by:

step 1111, editing the video into N video segments according to a preset segmentation mode;

in step S1112, a number of text sentences in each video segment are extracted as text segments corresponding to the video segments.

The preset segmentation mode can refer to the average division or the subdivision according to the content difference and the like; the preset segmentation mode can be obtained by marking after manual editing, and can also be obtained by automatic division based on corresponding software functions; after the video clip is obtained by manual clipping, the text segment corresponding to the video clip can be further extracted, and while the video clip is obtained by automatic division, the text segment corresponding to the video clip can be directly extracted (for example, according to the starting time and the ending time of the video clip, the text sentence between the starting time and the ending time is extracted in the time axis of the subtitle file).

The process of obtaining the video local coding information and the text local coding information by the coding processing and the process of obtaining the video global coding information and the text global information by the coding processing can be respectively carried out; referring to fig. 4, the encoding process for video clips and text segments is as follows:

step S1121, extracting image frames from the video clips, and coding the image frames through a video coder to obtain video local coding information corresponding to the video clips;

step S1122, inputting the text segment corresponding to the video segment into the text encoder for encoding, so as to obtain local text encoding information corresponding to the text segment.

Referring to fig. 5, the encoding process for video and text information is as follows:

step S1131, inputting the video into a video encoder for encoding processing to obtain video global encoding information;

step S1132, inputting the text information into a text encoder for encoding, so as to obtain global text encoding information.

For video segments and text segments, in order to obtain a video representation of the video segments, frames are first extracted from the video segments, and then encoded by a video encoder, converting the image representation into a video representation, to obtain video partial encoding information of each video segment. Similarly, in order to obtain text representation, text sentences corresponding to the text segments are input into a text encoder to be encoded, and text local encoding information of each text sentence is obtained, wherein the video encoder corresponds to a video encoding matrix, the text encoder corresponds to a text encoding matrix, and initial values of the two matrices are random.

For video and text information, in order to obtain video representation, inputting a complete video into a video encoder for encoding processing to obtain video global encoding information; and inputting the complete text information into an encoder for encoding to obtain the global text encoding information. The video encoder/text encoder that performs global encoding here and the video encoder/text encoder that performs local encoding described above may be multiplexed or may be independent of each other.

Notably, the improved T-Transformer model includes several attention networks, each attention network being stacked sequentially from 1 DMAN, 1 SAN, and 1 FFN. Specifically, for ongoing video and text processing, each layer of current transformers is composed of two parts, namely a self-attention network (SAN) and a feed-forward neural network (FFN), which most of the current research will take apart to perform the enhancement separately.

A feed-forward neural network is one type of artificial neural network. The feedforward neural network adopts a unidirectional multilayer structure. Where each layer contains a number of neurons. In such a neural network, each neuron may receive signals from neurons in a previous layer and generate outputs to the next layer. The 0 th layer is called an input layer, the last layer is called an output layer, and other intermediate layers are called hidden layers (or hidden layers and hidden layers). The hidden layer can be one layer or a plurality of layers.

Both SAN and FFN can be considered to belong essentially to a broader class of neural network architectures: mask Attention Networks (MANs) and Mask matrixes therein are static, but the static Mask mode limits modeling of a model on local information, and intuitively, because the Mask matrix of the FFN is a unit matrix, the FFN can only acquire own information but cannot acquire neighbor information. For a SAN, each token may obtain information about all other tokens in the sentence, and words that are not in the neighborhood may also get a significant attention score. Thus, the SAN may introduce noise during the course of semantic modeling, thereby ignoring valid signals in the local.

Obviously, the model can only consider words in a specific neighborhood through a static mask matrix, so as to achieve a better local modeling effect, but such a method lacks flexibility, and considering that the size of the neighborhood should change along with the query token, the embodiment of the present invention constructs the following strategy to dynamically adjust the size of the neighborhood:

wherein l is the current number of layers, i is the current attention head, t and s correspond to the positions of querytoken and keyytoken, respectively, σ is a constant, W^l，

Are all matrix variables that can be learned,

H^la collection of individual attention heads in a multi-head attention mechanism is shown.

In the stacking mode, MANs have various structures, the embodiment of the invention adopts a mode of sequentially stacking DMAN, SAN and FFN to perform modeling, wherein a Dynamic Mask Attention Network (DMAN) is in the SAN part, a Dynamic Mask matrix is adopted to dynamically adjust the neighborhood size of each feature position, the effect of modeling local information can be better achieved, and FFN maps the Attention result of each position to a feature space with larger dimension, then nonlinear screening is performed by combining with a gelu (gaussian Error Linear units) activation function, and finally the original dimension is recovered, so that the Attention condition of the feature information of the user can be improved. Specifically, the local coding information after coding is input into the SAN module to obtain a weighted feature vector Z, which is the attention function a of the MAN_M(Q, K, V), defined as follows:

A_M(Q,K,V)＝S_M(Q,K)V

Thus, according to S_M(Q, K) A set of attention weight values can be obtained, while for S_MThe solution of (Q, K) also corresponds to the special case where the fixed mask matrix in the mask attention network is all 1's.

The model structure obtained by construction is shown in fig. 6, and the model is mainly divided into two blocks, namely a video input module (the upper half part of fig. 6) and a text input module (the lower half part of fig. 6). The video and the text are in one-to-one correspondence in the training set, and the data are labeled and can be considered as information in the same space, so that the video and the text can be uniformly expressed in the same space information by model training.

The improved T-Transformer model comprises a video T-Transformer model and a text T-Transformer model, parameters of the video T-Transformer model and parameters of the text T-Transformer model are independent of each other, parameters of a plurality of video T-Transformer models are shared, and parameters of a plurality of text T-Transformer models are shared. As can be seen from fig. 6, there are N +1 video T-Transformer models, where 1 video T-Transformer model is used to receive global video coding information, the remaining N video T-Transformer models are used to receive local video coding information, and these N +1 models output processed information respectively, and similarly, there are N +1 text T-Transformer models, where 1 text T-Transformer model is used to receive global text coding information, the remaining N text T-Transformer models are used to receive local text coding information, and these N +1 models output processed information respectively.

The output of the butted T-transform models is the input of an Attention-FA (Attention-aware feature aggregation) module, and each T-transform model is connected with one Attention-FA module according to the diagram shown in FIG. 6, wherein the Attention-FA module comprises a global processing module and a local processing module; referring to fig. 7, the aforementioned step S130 includes, but is not limited to, the following steps:

step S131, inputting the global information into a global processing module to obtain global features, wherein the global features comprise video global features and text global features;

step S132, inputting the local information into a local processing module to obtain local features, wherein the local features comprise video local features and text local features.

The video T-Transformer model used for processing the global video coding information and the text T-Transformer model used for processing the global text coding information are respectively connected with the global processing module, and the video T-Transformer model used for processing the local video coding information and the text T-Transformer model used for processing the local text coding information are respectively connected with the local processing module. And obtaining corresponding video global features, text global features, video local features and text local features through the processing of the Attention-FA module.

For the aspect of video processing, the Attention-FA module is used for performing Attention processing on a video and a video clip respectively, and the specific method is as follows:

generating two random learnable matrices W₁And W₂And corresponding bias term b₁And b₂；

Output of the T-Transformer model

Which is expressed as K, is,

wherein

The representative video portion is a portion of the video,

representing the text part, calculating corresponding matrixes of the video and the text respectively by combining with a GELU formula:

Q＝GELU(W₁K^T+b₁)，K＝x

A＝softmax(W₂Q+b₂)^T

GELU is activation function Gaussian Error Linear Units, K^TIs a transposed matrix of K, let K be^TAnd learnable matrix W₁Multiplication, plus offset b₁Then as input, calculating Q matrix through GELU activation function, and combining Q matrix with learnable matrix W₂Multiplied by a corresponding offset b₂And obtaining the attention weight value A through a softmax function.

After the processing of the Attention-FA module, the above local features of the video can be expressed as

Where n is the number of segments, the video global feature may be represented as g_vLocal features of text can be expressed as

m is the number of sentences, and the global feature of the text can be expressed as g_pAnd inputting the video global feature, the text global feature, the video local feature and the text local feature into a context Transformer model for feature splicing.

It can be understood that the global processing module includes a video global Attention-FA module and a text global Attention-FA module, and the local processing module includes N video local Attention-FA modules and N text local Attention-FA modules, where N is a positive integer. The connection mode of the module can refer to fig. 6. The essence of the Attention mechanism is actually an addressing (addressing) process: giving a Query vector q related to a task, calculating Attention distribution of Key and attaching the Attention distribution to Value to calculate Attention Value, wherein the process is actually the embodiment of the Attention mechanism for relieving the complexity of a neural network model: all the N pieces of input information do not need to be input into the neural network for calculation, and only some information related to the task from the X is selected and input into the neural network.

In step S140, the global feature and the local feature are used as common inputs and input to a context Transformer model, and the video feature and the text feature are obtained through feature splicing processing, as shown in fig. 8, which may specifically include the following steps:

step S141, inputting the Local features into a preset Transformer model as Local Context, and performing maximum pooling operation on the output result to obtain a Local feature vector F_local；

Step S142, inputting the Global features as Global Context into a preset transform model to obtain Global feature vectors F_cross；

Step S143, for F_localAnd F_crossAnd performing feature splicing to obtain video features and text features.

In order to further enhance the capability of a Transformer for capturing different text features and video features, the global features of videos and text information are input into a Transformer model (obtained by presetting) of a traditional structure, the Transformer model consists of a self-attention network and a pre-feedback neural network of a multi-head attention system, a short-cut structure in a residual error network is adopted, and the degradation problem in deep learning is solved, and the method specifically comprises the following steps:

global feature g of video_vInputting the video global feature vector into a preset transform model to obtain a video global feature vector; global feature g of text_pInputting the global feature vector into a preset Transformer model to obtain a text global feature vector; the video Global feature vector and the text Global feature vector are used as Global feature vector F_cross。

Before the Transformer model is preset for the local information input, an additional vector Positional Encoding is added at the model input position, and the vector can determine the position of the current information or the distance between different words in a sentence, so that the constructed model can better explain the sequence in the input sequence.

Then, the video is locally characterized

Respectively inputting the input data into a preset Transformer model, and performing maximum pooling (Max Pooling, taking the maximum value of values on each dimension) on the output result to obtain a video local feature vector; locally characterizing text

Respectively inputting the input data into a preset Transformer model, and performing maximum pooling on output results to obtain text local feature vectors; the video Local feature vector and the text Local feature vector are used as Local feature vector F_local。

Will be paired with F_localAnd F_crossAnd performing feature splicing to obtain a video feature and a text feature, and respectively representing the video feature upsilon and the text feature delta.

Referring to fig. 9, after obtaining the video feature v and the text feature δ, an improved T-Transformer model may be optimized according to the output result, specifically:

step S161, constructing a loss function according to the video characteristics and the text characteristics output by the context Transformer model;

step S162, optimizing the improved T-Transformer model by using a loss function, wherein the loss function is expressed as follows:

L(P,N,α)＝max(0,α+D(x,y)-D(x′,y))+max(0,α+D(x,y)-D(x,y′))

D(x,y)＝1-x^Ty/(‖x‖‖y‖)

wherein x represents the video characteristics output by the context Transformer model, y represents the text characteristics output by the context Transformer model, x 'and y' represent negative samples of x and y, and the negative sample pair is represented by (x ', y) or (x, y'), which indicates that the video and the text in the current sample are not self-corresponding data, represents that the video is generated from other sample videos, and represents that the text is generated from other samples, and is represented by N in L (P, N, alpha); the positive sample pair is represented by (x, y), which indicates that the video and the text in the current sample are self-corresponding data, L (P, N, α) is represented by P here, α is a constant parameter and is a conversion factor, if the value is set to be large, the model learning parameter is not converged, and if the value is set to be small, the model learning efficiency is slow, and α in the embodiment is 0.2.

After the model is trained through the manually labeled data, similarity calculation can be directly performed by using the video characteristic upsilon and the text characteristic delta output by the model, for example, cosine similarity calculation is adopted, so that the similarity between the video and the text can be compared, and operations such as similar retrieval and the like can be performed. For example, if the video does not have any text or label, and only the video itself, the corresponding video can be retrieved by inputting the text. Namely:

acquiring a text to be retrieved;

Referring to fig. 10, before acquiring the video and the corresponding text information, the following steps may be further included:

step S171, labeling the video, and using the labeled video as the video for training;

and step S172, marking the text information according to the marking mode of the video, and taking the marked text information as the text information for training.

Through the steps, the video and the text can be converted into the same contrast space, and similarity calculation is carried out on two different objects. Even if the video does not have any label or title and the like, the video can be searched by inputting the text, so that the corresponding target video is obtained through matching.

In addition, referring to fig. 11, an embodiment of the present invention provides a video and text similarity determining apparatus, including:

the first processing unit is used for inputting the coding characteristic information into an improved T-transform model to obtain global information and local information, wherein the improved T-transform model is formed by stacking a dynamic mask attention network DMAN, a self-attention network SAN and a feedforward neural network FFN;

the context processing unit is used for inputting the global features and the local features into a context Transformer model as common input, and obtaining video features and text features through feature splicing processing, wherein the video features correspond to the video, and the text features correspond to the text information;

In addition, referring to fig. 12, an embodiment of the present invention also provides an electronic device 2000 including: a memory 2002, a processor 2001, and a computer program stored on the memory 2002 and executable on the processor 2001.

The processor 2001 and memory 2002 may be connected by a bus or other means.

Non-transitory software programs and instructions necessary to implement the video and text similarity determination method of the above-described embodiment are stored in the memory 2002, and when executed by the processor 2001, the video and text similarity determination method applied to the apparatus in the above-described embodiment is performed, for example, the method steps S110 to S140 in fig. 1, S111 to S112 in fig. 2, S1111 to S1112 in fig. 3, S1121 to S1122 in fig. 4, S1131 to S1132 in fig. 5, S131 to S132 in fig. 7, S141 to S143 in fig. 8, S161 to S162 in fig. 9, and S171 to S172 in fig. 10 described above are performed.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Further, an embodiment of the present invention also provides a computer-readable storage medium storing a computer program, the computer program is executed by a processor or controller, e.g., by a processor in an embodiment of the electronic device described above, the processor may be caused to execute the video and text similarity determination method in the above embodiment, for example, to execute the above-described method steps S110 to S140 in fig. 1, method steps S111 to S112 in fig. 2, method steps S1111 to S1112 in fig. 3, method steps S1121 to S1122 in fig. 4, method steps S1131 to S1132 in fig. 5, method steps S131 to S132 in fig. 7, method steps S141 to S143 in fig. 8, method steps S161 to S162 in fig. 9, and method steps S171 to S172 in fig. 10. It will be understood by those of ordinary skill in the art that all or some of the steps, means, and methods disclosed above may be implemented as software, firmware, hardware, or suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable storage media, which may include computer storage media (or non-transitory storage media) and communication storage media (or transitory storage media). The term computer storage media includes volatile and nonvolatile, removable and non-removable storage media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other storage medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication storage media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery storage media as is well known to those of ordinary skill in the art.

The application is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor devices, microprocessor-based devices, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above devices or equipment, and the like. The application may be described in the general context of computer programs, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more programs for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based apparatus that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. A method for determining similarity between video and text is characterized by comprising the following steps:

the global features and the local features are used as common input and input into a context transform model, and video features and text features are obtained through feature splicing processing, wherein the video features correspond to the video, and the text features correspond to the text information;

2. The method according to claim 1, wherein said encoding the video and the text information to obtain encoding characteristic information comprises:

3. The method according to claim 2, wherein the segmenting the video and the text information comprises:

the video is cut into N video segments according to a preset segmentation mode;

and extracting a plurality of text sentences in each video segment to serve as text segments corresponding to the video segments.

4. The method according to claim 2, wherein the encoding the video segment and the text segment respectively to obtain video partial coding information and text partial coding information comprises:

5. The method for determining similarity between video and text according to claim 2, wherein said encoding the video and the text information respectively to obtain global video encoding information and global text information comprises:

6. The video and text similarity determination method according to claim 1, wherein the Attention-FA module comprises a global processing module and a local processing module; the step of inputting the global information and the local information into corresponding Attention-FA modules respectively to obtain global features and local features includes:

7. The method for determining video and text similarity according to claim 1, wherein the step of inputting the global features and the local features as common inputs into a context Transformer model to obtain video features and text features through a feature stitching process comprises:

inputting the Local features into a preset transform model as Local Context, and performing maximum pooling operation on output results to obtain a Local feature vector F_local；

Making the global feature asInputting the Global Context into a preset Transformer model to obtain a Global feature vector F_cross；

8. A video and text similarity determination apparatus, comprising:

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the video and text similarity determination method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, storing a computer program, wherein the computer program is configured to execute the video and text similarity determination method according to any one of claims 1 to 7.