WO2023168818A1

WO2023168818A1 - Method and apparatus for determining similarity between video and text, electronic device, and storage medium

Info

Publication number: WO2023168818A1
Application number: PCT/CN2022/090656
Authority: WO
Inventors: 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-03-09
Filing date: 2022-04-29
Publication date: 2023-09-14
Also published as: CN114612826A

Abstract

The present application relates to the field of artificial intelligence, and provides a method and apparatus for determining the similarity between a video and text, an electronic device, and a storage medium. The method comprises: obtaining a video and corresponding text information, and performing encoding on the video and the text information to obtain encoding feature information; inputting the encoding feature information into an improved T-Transformer model to obtain global information and local information; respectively inputting the global information and the local information into corresponding Attention-FA modules to obtain global features and local features; inputting the global features and the local features as common input into a Contextual Transformer model, and obtaining a video feature and a text feature by means of feature merging; and determining the similarity between the video and the text information according to the video feature and the text feature. By converting videos and text into a same comparison space, the similarity between two different things is calculated, so that a target video is obtained according to text matching.

Description

Video and text similarity determination method, device, electronic equipment, storage medium

This application requests the priority of the Chinese patent application submitted to the China Patent Office on March 9, 2022, with the application number 202210234257.8, and the invention name is "Video and text similarity determination method, device, electronic equipment, storage medium", all of which The contents are incorporated into this application by reference.

Technical field

This application belongs to the field of artificial intelligence technology, and in particular relates to a video and text similarity determination method, device, electronic equipment, and storage medium.

Background technique

Image video and text information do not belong to the same expression space, so it is difficult to compare the similarity between the two. For example, it is difficult to retrieve a video by inputting text.

technical problem

The following are the technical problems of the existing technology that the inventor is aware of: use video models such as 3D ImageNet to extract video features, use BERT (Bidirectional Encoder Representation from Transformers, Transformer-based bidirectional encoder representation) to extract text features, and then Do cosine similarity calculation to evaluate the similarity between two video features and text features by calculating the cosine value of the angle between them. However, this method is not interpretable and is wrong from a rigorous scientific perspective. Therefore, a new method of calculating video and text similarity needs to be proposed.

Technical solutions

In the first aspect, embodiments of the present application provide a method for determining similarity between video and text, including:

Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;

The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;

Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;

The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;

The similarity between the video and the text information is determined based on the video features and the text features.

In the second aspect, embodiments of the present application provide a video and text similarity determination device, including:

An acquisition unit is used to acquire video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, and text local coding information. Coding Information and Text Global Coding Information;

The first processing unit is used to input the encoded feature information into an improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN and self-attention. The network SAN and the feedforward neural network FFN are stacked;

The second processing unit is used to input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;

Context processing unit, used to input the global features and the local features as common input to the Contextual Transformer model, and obtain video features and text features through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. ;

A similarity calculation unit, configured to determine the similarity between the video and the text information according to the video features and the text features.

In a third aspect, embodiments of the present application provide an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, a A method for determining similarity between video and text, wherein the method for determining similarity between video and text includes:

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program being used to execute a video and text similarity determination method, wherein the video and text similarity determination method Methods include:

beneficial effects

The embodiments of the present application at least have the following beneficial effects: obtain video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, and text local coding. information and text global encoding information; input the encoding feature information into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, self-attention network SAN and front It is formed by stacking feed neural network FFN; input global information and local information to the corresponding Attention-FA module respectively to obtain global features and local features; use global features and local features as common inputs and input them into the Contextual Transformer model, through feature splicing The video features and text features are obtained through processing, the video features correspond to the video, and the text features correspond to the text information; the similarity between the video and text information is determined based on the video features and text features. Through the above method, video and text can be converted into the same comparison space, and the similarity of two different things can be calculated. For example, even if the video does not have any tags or titles, you can still perform video retrieval by entering text.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the application. The objectives and other advantages of the application may be realized and obtained by the structure particularly pointed out in the specification, claims and appended drawings.

Description of the drawings

The drawings are used to provide a further understanding of the technical solution of the present application and constitute a part of the specification. They are used to explain the technical solution of the present application together with the embodiments of the present application and do not constitute a limitation of the technical solution of the present application.

Figure 1 is a flow chart of a method for determining video and text similarity provided by an embodiment of the present application;

Figure 2 is a flow chart of encoding processing provided by another embodiment of the present application;

Figure 3 is a flow chart for dividing video and text information provided by another embodiment of the present application;

Figure 4 is a flow chart of encoding video clips and text segments provided by another embodiment of the present application;

Figure 5 is a flow chart of encoding video and text information provided by another embodiment of the present application;

Figure 6 is a flow chart of model connection relationships provided by another embodiment of the present application;

Figure 7 is a flow chart for obtaining global features and local features provided by another embodiment of the present application;

Figure 8 is a structural diagram of video features and text features obtained through feature splicing provided by another embodiment of the present application;

Figure 9 is a flow chart of an optimized and improved T-Transformer model provided by another embodiment of the present application;

Figure 10 is a flow chart for labeling training samples provided by another embodiment of the present application;

Figure 11 is a structural diagram of a similarity determination device provided by another embodiment of the present application;

Figure 12 is a device diagram of an electronic device provided by another embodiment of the present application.

Embodiments of the invention

In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

It should be noted that although the functional modules are divided in the device schematic diagram and the logical sequence is shown in the flow chart, in some cases, the modules can be divided into different modules in the device or the order in the flow chart can be executed. The steps shown or described. The terms "first", "second", etc. in the description, claims or the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

This application provides a method, device, electronic device, and storage medium for determining similarity between video and text. The method includes: obtaining video and corresponding text information, and encoding the video and text information to obtain encoding feature information. The encoding feature The information includes video local coding information, video global coding information, text local coding information and text global coding information; the coding feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on The dynamic mask attention network DMAN (Dynamic Mask Attention Network), the self-attention network SAN (Self-Attention Network) and the feedforward neural network FFN (Feedforward Neural Network) are laminated; the global information and local information are input into the corresponding The Attention-FA module obtains global features and local features; the global features and local features are used as common inputs and input into the Contextual Transformer model. Video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the Text information correspondence; determine the similarity between video and text information based on video features and text features. Through the above method, the video and text can be converted into the same comparison space, and the similarity of two different things can be calculated. For example, even if the video does not have any tags or titles, the video can still be retrieved by entering text.

The embodiments of this application can obtain and process relevant data based on artificial intelligence technology. Among them, Artificial Intelligence (AI) is the theory, method, technology and application device that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction devices, mechatronics and other technologies. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometric technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

The terminal mentioned in the embodiment of this application can be a smartphone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted computer, a smart home, a wearable electronic device, VR (Virtual Reality, virtual reality)/AR (Augmented Reality, augmented reality) ) equipment, etc.; the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, and network services. , cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, etc.

It should be noted that the data in the embodiments of this application can be stored in a server. The server can be an independent server, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, and intermediate servers. Cloud servers for basic cloud computing services such as software services, domain name services, security services, content distribution networks, and big data and artificial intelligence platforms.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.

Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.

As shown in Figure 1, Figure 1 is a flow chart of a video and text similarity determination method provided by an embodiment of the present application. The video and text similarity determination method includes but is not limited to the following steps:

Step S110, obtain the video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information;

Step S120, input the encoded feature information into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feedforward neural network FFN. made up of layers;

Step S130, input the global information and local information to the corresponding Attention-FA module respectively to obtain global features and local features;

Step S140, use global features and local features as common inputs to the Contextual Transformer model, and obtain video features and text features through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information;

Step S150: Determine the similarity between the video and text information based on the video features and text features.

The model idea of this method comprehensively considers the global information and local information of video and text, uses a collaborative hierarchical transformer model to feature encode the two types of information, and improves on the traditional transformer model, namely Temporal Transformer (T -Transformer) and Contextual Transformer. The embodiment of the present application is applied to the similarity judgment between video and text information. When training the model of this method, the video used for training and the text information corresponding to the video are determined according to step S110. After encoding the video and text information, Through the T-Transformer model, Attention-FA module and Contextual Transformer model in sequence, video features and text features are output, thereby providing a reference for the similarity of video and text information. In some cases, the trained model can be used to analyze text based on Retrieve associated videos.

Specifically, referring to Figure 2, obtaining the encoding feature information in step S110 can be achieved through the following steps:

Step S111, segment the video and text information to obtain N video segments and N text segments. Each video segment corresponds to a text segment, and N is a positive integer;

Step S112: Encode the video clips and text segments respectively to obtain video partial encoding information and text partial encoding information;

Step S113: Code the video and text information respectively to obtain video global coding information and text global information.

In order to obtain video local coding information, video global coding information, text local coding information and text global coding information, two processes need to be performed on the video and text information. One process is to code the entire video to obtain the global coding information of the video. The text information is encoded as a whole to obtain the global information of the text. The other method is to divide the video into N video segments, which are used as local information and are encoded separately, thereby obtaining N pieces of video local encoding information. Since the text information and the video are There is a corresponding relationship, so after the video is segmented, the corresponding text information is also divided according to the video segmentation method, and N text segments are obtained. Each text segment is composed of several text sentences. The N text segments are Carry out encoding processing separately to obtain N pieces of text local encoding information.

It can be understood that the video can be divided into different forms, such as dividing the video into N parts evenly according to the duration or the total number of frames, or dividing the video according to the difference in content (such as at the beginning of the movie, the first section, second section, third section and ending, etc.); then correspondingly, after segmenting the video, determine the text sentences corresponding to the video segments and use them as text segments. Therefore, referring to Figure 3, the step of determining text segments after segmenting the video (step S111) can be implemented in the following manner:

Step S1111: Edit the video into N video segments according to the preset segmentation method;

Step S1112: Extract several text sentences in each video clip as text segments corresponding to the video clip.

The preset segmentation method can refer to the above-mentioned average division or further division according to content differences; the preset segmentation method can be annotated after manual editing, or can be automatically divided based on the corresponding software function; after manually editing the video clips, , you can further extract the text segments corresponding to the video clips, and while automatically dividing the video clips, you can directly extract the text segments corresponding to the video clips (for example, according to the start time and end time of the video clips, in the subtitle file Extract text sentences between the start time and end time in the timeline).

Among them, the process of encoding to obtain the video local encoding information and the text local encoding information, and the encoding process to obtain the video global encoding information and text global information can be performed separately. Referring to Figure 4, for video clips and text segmentation The encoding process of the segment is as follows:

Step S1121, extract image frames from the video clips, encode the image frames through the video encoder, and obtain video local coding information corresponding to the video clips;

Step S1122: Input the text segment corresponding to the video segment into the text encoder for encoding, and obtain the text local encoding information corresponding to the text segment.

Referring to Figure 5, the encoding process for video and text information is as follows:

Step S1131, input the video into the video encoder for encoding processing, and obtain the video global encoding information;

Step S1132, input the text information into the text encoder for encoding processing, and obtain the text global encoding information.

For video segments and text segments, in order to obtain the video representation of the video segment, frames are first extracted from the video segment and then encoded through a video encoder to convert the image representation into a video representation to obtain each video segment. video local coding information. In the same way, in order to obtain text representation, the text sentences corresponding to the text segments are input into the text encoder for encoding processing, and the text local encoding information of each text sentence is obtained. Among them, the video encoder corresponds to the video encoding matrix, and the text encoder corresponds to Text encoding matrix, and the initial values of both matrices are random.

For video and text information, in order to obtain video representation, the complete video is input into the video encoder for encoding processing to obtain the video global encoding information; the complete text information is input into the encoder for encoding to obtain the text global encoding information. The video encoder/text encoder for global encoding here and the video encoder/text encoder for local encoding mentioned above may be multiplexed or may be independent of each other.

It is worth noting that the improved T-Transformer model includes several attention networks, each of which is stacked in sequence by 1 DMAN, 1 SAN and 1 FFN. Specifically, in order to process videos and texts, each layer of the current Transformer is composed of two parts, namely the self-attention network (SAN) and the feedforward neural network (FFN). Most of the current research will disassemble Open these two parts to enhance them separately.

Feedforward neural network is a type of artificial neural network. Feedforward neural network adopts a unidirectional multi-layer structure. Each layer contains several neurons. In this kind of neural network, each neuron can receive the signal of the neuron of the previous layer and generate output to the next layer. The 0th layer is called the input layer, the last layer is called the output layer, and the other intermediate layers are called hidden layers (or hidden layers, hidden layers). The hidden layer can be one layer or multiple layers.

It can be considered that both SAN and FFN essentially belong to a broader class of neural network structures: Mask Attention Networks (MANs), and the mask matrices in them are static, but the static mask method limits the model. For the modeling of local information, intuitively speaking, because the mask matrix of FFN is a unit matrix, FFN can only obtain its own information but cannot obtain the information of its neighbors. For SAN, each token can obtain information about all other tokens in the sentence, so words that are not in the neighborhood may also get a considerable attention score. Therefore, SAN may introduce noise into the semantic modeling process, thus ignoring local effective signals.

Obviously, you can use a static mask matrix to make the model only consider words in a specific neighborhood, thereby achieving better local modeling effects. However, this method lacks flexibility, considering that the size of the neighborhood should change with the query. Token changes, so the embodiment of this application constructs the following strategy to dynamically adjust the size of the neighborhood:

Among them, l is the current layer number, i is the current attention head, t and s correspond to the positions of querytoken and keytoken respectively, σ is a constant, W ^l ,

are all learnable matrix variables,

H ^l represents the set of each attention head under the multi-head attention mechanism.

In terms of stacking method, MANs have various structures. The embodiment of this application adopts the method of stacking DMAN, SAN and FFN in sequence for modeling. Among them, the Dynamic Mask Attention Network (Dynamic Mask Attention Network, DMAN) is in the SAN part. , using a dynamic mask matrix to dynamically adjust the neighborhood size of each feature position, which can better achieve the effect of modeling local information, while FFN maps the attention results of each position to a larger dimension. Feature space, and then combined with the GELU (Gaussian Error Linear Units) activation function for non-linear screening, and finally restored to the original dimension, which can improve the attention to its own feature information. Specifically, the encoded local encoding information is input into the SAN module to obtain a weighted feature vector Z. This Z is the MAN's attention function A _M (Q, K, V), which is defined as follows:

A _M (Q,K,V)＝S _M (Q,K)V

Among them, Q represents query, K represents key, V represents Value, the vector dimensions of Q, K, and V are consistent, M represents the dynamic mask matrix, S represents the softmax function, i represents the i-th query in Q, and j represents the K The jth key, d _k represents the vector dimension of the K vector.

Therefore, a set of attention weight values can be obtained according to S _M (Q, K). At the same time, the solution to S _M (Q, K) is also equivalent to a special case where the fixed mask matrix in the mask attention network is all 1. Condition.

The constructed model structure is shown in Figure 6. The model is mainly divided into two parts, one is the video input module (the upper part of Figure 6), and the other is the text input module (the lower part of Figure 6). Video and text have a one-to-one correspondence in the training set. Since the data is well-labeled, it can be considered as information in the same space. Now it is necessary to use model training to enable video and text to be expressed uniformly in the same space information.

Among them, the improved T-Transformer model includes a video T-Transformer model and a text T-Transformer model. The parameters of the video T-Transformer model and the text T-Transformer model are independent of each other. The parameters of multiple video T-Transformer models are shared. Parameter sharing between multiple text T-Transformer models. According to Figure 6, we can see that there are N+1 video T-Transformer models, of which 1 video T-Transformer model is used to receive global video coding information, and the remaining N video T-Transformer models are used to receive local video coding information. N+1 all output processed information respectively. Similarly, there are N+1 text T-Transformer models, of which 1 text T-Transformer model is used to receive global text encoding information, and the remaining N text T-Transformer models They are respectively used to receive local text encoding information, and these N+1 output processed information respectively.

The output of the docked T-Transformer model is the input of the Attention-FA (attention-aware feature aggregation) module. As shown in Figure 6, each T-Transformer model is connected to an Attention-FA module, where the Attention-FA module includes the global Processing module and local processing module; Referring to Figure 7, the aforementioned step S130 includes but is not limited to the following steps:

Step S131, input the global information into the global processing module to obtain global features, which include video global features and text global features;

Step S132: Input the local information into the local processing module to obtain local features. The local features include video local features and text local features.

The video T-Transformer model used to process global video coding information and the text T-Transformer model used to process global text coding information are respectively connected to the global processing module, and the video T-Transformer model used to process local video coding information and the video T-Transformer model used to process local text coding information are respectively connected. The text T-Transformer model is connected to the local processing module respectively. Through the processing of the Attention-FA module, the corresponding video global features, text global features, video local features and text local features are obtained.

For video processing, the Attention-FA module is used to perform attention processing on videos and video clips respectively. The specific methods are as follows:

Generate two random learnable matrices W ₁ and W ₂ , as well as the corresponding bias terms b ₁ and b ₂ ;

Convert the output of the T-Transformer model to

Expressed as K,

in

Represents the video part,

Representing the text part, combine the GELU formula to calculate the matrices corresponding to the video and text respectively:

Q=GELU(W ₁ K ^T +b ₁ ), K=x

A＝softmax(W ₂ Q+b ₂ ) ^T

GELU is the activation function Gaussian Error Linear Units, K ^T is the transpose matrix of K, multiply K ^T by the learnable matrix W ₁ , add offset b ₁ and use it as input to calculate the Q matrix through the GELU activation function, and then The Q matrix is multiplied by the learnable matrix W ₂ , and after adding the corresponding bias b ₂ , the attention weight A is obtained through the softmax function.

After processing by the Attention-FA module, the above video local features can be expressed as

where n is the number of segments, the global features of the video can be expressed as g _v , and the local features of the text can be expressed as

m is the number of sentences, and the text global features can be expressed as g _p . Just input the video global features, text global features, video local features, and text local features into the Contextual Transformer model for feature splicing.

It can be understood that the global processing module includes a video global Attention-FA module and a text global Attention-FA module, and the local processing module includes N video local Attention-FA modules and N text local Attention-FA modules, and N is positive. integer. The connection mode of the module can be referred to Figure 6. The essence of the Attention mechanism is actually an addressing process: given a query vector q related to the task, the Attention Value is calculated by calculating the attention distribution with the Key and appending it to the Value. This process actually It is the embodiment of the Attention mechanism easing the complexity of the neural network model: it is not necessary to input all N input information to the neural network for calculation, but only needs to select some task-related information from X to input to the neural network.

In the above step S140, global features and local features are used as common inputs and input into the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. Refer to Figure 8, which may include the following steps:

Step S141, input the local features as Local Context to the preset Transformer model, and perform a maximum pooling operation on the output results to obtain the Local feature vector F _local ;

Step S142, input the global features as Global Context to the preset Transformer model to obtain the Global feature vector F _cross ;

Step S143, perform feature splicing on F _local and F _cross to obtain video features and text features.

In order to further enhance the ability of Transformer to capture different text features and video features, the global features of video and text information are input into the Transformer model of traditional structure (obtained through presets). The Transformer model consists of the self-attention network of the multi-head attention mechanism and the front-end It is composed of a feedback neural network and adopts the short-cut structure in the residual network to solve the degradation problem in deep learning, including:

Input the video global feature g _v into the preset Transformer model to obtain the video global feature vector; input the text global feature g _p into the preset Transformer model to obtain the text global feature vector; the video global feature vector and the text global feature vector are used as Global features Vector F _cross .

Before inputting the local information into the preset Transformer model, an additional vector Positional Encoding is added to the model input. This vector can determine the position of the current information or the distance between different words in a sentence, thus making the built model more accurate. Well interpret the order in the input sequence.

After that, the video local features are

Input to the preset Transformer model respectively, and perform Max pooling (Max pooling, taking the maximum value of the values in each dimension) on the output results to obtain the video local feature vector; convert the text local features

Input to the preset Transformer model respectively, and perform maximum pooling on the output results to obtain the text local feature vector; the video local feature vector and the text local feature vector are used as the Local feature vector F _local .

The features of F _local and F _cross will be spliced to obtain video features and text features, which are represented by video features υ and text features δ respectively.

Referring to Figure 9, after obtaining the above video features υ and text features δ, the improved T-Transformer model can be optimized based on the output results. Specifically:

Step S161, construct a loss function based on the video features and text features output by the Contextual Transformer model;

Step S162, use the loss function to optimize the improved T-Transformer model. The loss function is expressed as follows:

L(P,N,α)=mαx(0,α+D(x,y)-D(x′,y))+max(0,α+D(x,y)-D(x,y′ ))

D(x,y)＝1-x ^T y/(‖x‖‖y‖)

Among them, x represents the video feature output by the Contextual Transformer model, y represents the text feature output by the Contextual Transformer model, x′ and y′ represent the negative samples of x and y respectively, and the negative sample pair is (x′, y) or (x, y′) means that the video and text in the current sample do not come from the corresponding data, it means that the video comes from other sample videos, and it means that the text comes from the text of other samples. It is represented by N here in L (P, N, α); positive sample The pair is represented by (x, y), indicating that the video and text in the current sample are from the corresponding data. L (P, N, α) is represented by P here. α is a constant parameter and a conversion factor. If it is set too large, it will As a result, the model learning parameters do not converge. If the setting is small, the model learning efficiency will be slow. In this embodiment, α is 0.2.

After training the model with manually annotated data, the video features υ and text features δ output by the model can be directly used for similarity calculation. For example, cosine similarity calculation can be used to compare the similarity between the video and the text. Perform operations such as searching. For example, if a video does not have any text or tags, but only the video itself, the corresponding video can be retrieved by entering the text. Right now:

Get the text to be retrieved;

Input the text to be retrieved into the optimized and improved T-Transformer model to obtain the target video that matches the text to be retrieved.

Referring to Figure 10, before obtaining the video and corresponding text information, the following steps may also be included:

Step S171, label the video and use the labeled video as a training video;

Step S172: Label the text information according to the video labeling method, and use the labeled text information as text information for training.

Through the above steps, the video and text can be converted into the same comparison space, and the similarity of two different things can be calculated. Even if the video does not have any tags or titles, you can still search the video by entering text to match the corresponding target video.

In addition, referring to Figure 11, an embodiment of the present application provides a video and text similarity determination device, which includes:

A context processing unit, used to input the global features and the local features into the Contextual Transformer model as a common input, and obtain video features and text features through feature splicing processing. The video features correspond to the video, and the text The features correspond to the text information;

In addition, referring to Figure 12, an embodiment of the present application also provides an electronic device. The electronic device 2000 includes: a memory 2002, a processor 2001, and a computer program stored on the memory 2002 and executable on the processor 2001.

The processor 2001 and the memory 2002 may be connected through a bus or other means.

The non-transitory software programs and instructions required to implement the video and text similarity determination method of the above embodiment are stored in the memory 2002. When executed by the processor 2001, the video and text similarity applied to the device in the above embodiment are executed. The degree determination method, for example, performs the above-described method steps S110 to S140 in Figure 1, method steps S111 to S112 in Figure 2, method steps S1111 to S1112 in Figure 3, and method step S1121 in Figure 4. to step S1122, method steps S1131 to step S1132 in Figure 5, method steps S131 to step S132 in Figure 7, method steps S141 to step S143 in Figure 8, method steps S161 to step S162 in Figure 9, and Figure 10 Method steps S171 to S172 in .

The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, an embodiment of the present application also provides a computer-readable storage medium, which stores a computer program. The computer program is executed by a processor or a controller, for example, by the above-mentioned electronic device embodiment. Execution by one of the processors can cause the above processor to execute the video and text similarity determination method in the above embodiment, for example, execute the above-described method steps S110 to S140 in Figure 1 and method step S111 in Figure 2 to step S112, method steps S1111 to step S1112 in Figure 3, method steps S1121 to step S1122 in Figure 4, method steps S1131 to step S1132 in Figure 5, method steps S131 to step S132 in Figure 7, Figure 8 The method steps S141 to S143 in , the method steps S161 to S162 in FIG. 9 , and the method steps S171 to S172 in FIG. 10 . The computer-readable storage medium may be non-volatile or volatile.

Those of ordinary skill in the art can understand that all or some steps and devices in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable storage media, which may include computer storage media (or non-transitory storage media) and communication storage media (or transitory storage media). As is known to those of ordinary skill in the art, the term computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. removable, removable and non-removable storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other storage medium used to store desired information and that can be accessed by a computer. Furthermore, as is known to those of ordinary skill in the art, communications storage media typically embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery storage media.

The present application may be used in a variety of general purpose or special purpose computer device environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor devices, microprocessor-based devices, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, including Distributed computing environment for any of the above devices or equipment, etc. The application may be described in the general context of computer programs, such as program modules, executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present application may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functions and operations of possible implementations of devices, methods and computer program products according to various embodiments of the present application. Each block in the flow chart or block diagram may represent a module, program segment, or part of the code. The above module, program segment, or part of the code includes one or more programs for implementing specified logical functions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block in the block diagram or flowchart illustration, and combinations of blocks in the block diagram or flowchart illustration, can be implemented by special purpose hardware-based means for performing the specified functions or operations, or may be implemented by special purpose hardware-based means for performing the specified functions or operations. Achieved by a combination of specialized hardware and computer instructions.

The units involved in the embodiments of this application can be implemented in software or hardware, and the described units can also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.

It should be noted that although several modules or units of equipment for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present application, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into being embodied by multiple modules or units.

Through the above description of the embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by software combined with necessary hardware. Therefore, the technical solution according to the embodiment of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which can be a personal computer, server, touch terminal, or network device, etc.) to execute the method according to the embodiment of the present application.

Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. .

It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

The above is a detailed description of the preferred implementation of the present application, but the present application is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent modifications or substitutions without violating the spirit of the present application. Equivalent modifications or substitutions are included within the scope defined by the claims of this application.

Claims

A method for determining video and text similarity, which includes:

Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;

The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;

Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;

The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;

The similarity between the video and the text information is determined based on the video features and the text features.
The video and text similarity determination method according to claim 1, wherein said encoding the video and the text information to obtain encoding feature information includes:

Segment the video and the text information to obtain N video segments and N text segments, each of the video segments corresponds to one of the text segments, and N is a positive integer;

Perform coding processing on the video segments and the text segments respectively to obtain video local coding information and text local coding information;

The video and the text information are respectively encoded to obtain video global encoding information and text global information.
The video and text similarity determination method according to claim 2, wherein the segmenting the video and the text information includes:

Edit the video into N video segments according to a preset segmentation method;

Several text sentences in each video segment are extracted as text segments corresponding to the video segment.
The video and text similarity determination method according to claim 2, wherein said encoding the video segments and the text segments respectively to obtain video local coding information and text local coding information includes:

Extract image frames from the video clips, encode the image frames through a video encoder, and obtain video local coding information corresponding to the video clips;

The text segment corresponding to the video segment is input into a text encoder for encoding, and text local encoding information corresponding to the text segment is obtained.
The video and text similarity determination method according to claim 2, wherein said encoding the video and the text information respectively to obtain video global encoding information and text global information includes:

Input the video into a video encoder for encoding processing to obtain video global encoding information;

The text information is input into a text encoder for encoding processing to obtain text global encoding information.
The video and text similarity determination method according to claim 1, wherein the Attention-FA module includes a global processing module and a local processing module; the global information and the local information are respectively input to the corresponding Attention -FA module, obtains global features and local features, including:

Input the global information into the global processing module to obtain global features, where the global features include video global features and text global features;

The local information is input to the local processing module to obtain local features, which include video local features and text local features.
The video and text similarity determination method according to claim 1, wherein the global features and the local features are input into the Contextual Transformer model as a common input, and the video features and text features are obtained through feature splicing processing, include:

Input the local features into the preset Transformer model as Local Context, and perform a maximum pooling operation on the output results to obtain the Local feature vector F local ;

Input the global features as Global Context to the preset Transformer model to obtain the Global feature vector F cross ;

Feature splicing is performed on the F local and F cross to obtain video features and text features.
A video and text similarity determination device, which includes:

An acquisition unit is used to acquire video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, and text local coding information. Coding Information and Text Global Coding Information;

The first processing unit is used to input the encoded feature information into an improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN and self-attention. The network SAN and the feedforward neural network FFN are stacked;

The second processing unit is used to input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;

A context processing unit, used to input the global features and the local features into the Contextual Transformer model as a common input, and obtain video features and text features through feature splicing processing. The video features correspond to the video, and the text The features correspond to the text information;

A similarity calculation unit, configured to determine the similarity between the video and the text information according to the video features and the text features.
An electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements a video and text similarity determination method when executing the computer program, Wherein, the video and text similarity determination method includes:

Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;

The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;

Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;

The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;

The similarity between the video and the text information is determined based on the video features and the text features.
The electronic device according to claim 9, wherein said encoding the video and the text information to obtain encoding feature information includes:

Segment the video and the text information to obtain N video segments and N text segments, each of the video segments corresponds to one of the text segments, and N is a positive integer;

Perform coding processing on the video segments and the text segments respectively to obtain video local coding information and text local coding information;

The video and the text information are respectively encoded to obtain video global encoding information and text global information.
The electronic device according to claim 10, wherein the segmenting the video and the text information includes:

Edit the video into N video segments according to a preset segmentation method;

Several text sentences in each video segment are extracted as text segments corresponding to the video segment.
The electronic device according to claim 10, wherein said encoding the video segments and the text segments respectively to obtain video partial encoding information and text partial encoding information includes:

Extract image frames from the video clips, encode the image frames through a video encoder, and obtain video local coding information corresponding to the video clips;

The text segment corresponding to the video segment is input into a text encoder for encoding, and text local encoding information corresponding to the text segment is obtained.
The electronic device according to claim 10, wherein said encoding the video and the text information respectively to obtain video global encoding information and text global information includes:

Input the video into a video encoder for encoding processing to obtain video global encoding information;

The text information is input into a text encoder for encoding processing to obtain text global encoding information.
The electronic device according to claim 9, wherein the Attention-FA module includes a global processing module and a local processing module; the global information and the local information are respectively input to the corresponding Attention-FA module to obtain Global features and local features, including:

Input the global information into the global processing module to obtain global features, where the global features include video global features and text global features;

The local information is input to the local processing module to obtain local features, which include video local features and text local features.
A computer-readable storage medium stores a computer program, wherein the computer program is used to execute a video and text similarity determination method, wherein the video and text similarity determination method includes:

Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;

The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;

Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;

The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;

The similarity between the video and the text information is determined based on the video features and the text features.
The computer-readable storage medium according to claim 15, wherein said encoding said video and said text information to obtain encoded feature information includes:

Segment the video and the text information to obtain N video segments and N text segments, each of the video segments corresponds to one of the text segments, and N is a positive integer;

Perform coding processing on the video segments and the text segments respectively to obtain video local coding information and text local coding information;

The video and the text information are respectively encoded to obtain video global encoding information and text global information.
The computer-readable storage medium of claim 16, wherein the segmenting the video and the text information includes:

Edit the video into N video segments according to a preset segmentation method;

Several text sentences in each video segment are extracted as text segments corresponding to the video segment.
The computer-readable storage medium according to claim 16, wherein said encoding the video segments and the text segments respectively to obtain video partial encoding information and text partial encoding information includes:

Extract image frames from the video clips, encode the image frames through a video encoder, and obtain video local coding information corresponding to the video clips;

The text segment corresponding to the video segment is input into a text encoder for encoding, and the text local encoding information corresponding to the text segment is obtained.
The computer-readable storage medium according to claim 16, wherein said encoding the video and the text information respectively to obtain the video global encoding information and the text global information includes:

Input the video into a video encoder for encoding processing to obtain video global encoding information;

The text information is input into a text encoder for encoding processing to obtain text global encoding information.
The computer-readable storage medium according to claim 15, wherein the Attention-FA module includes a global processing module and a local processing module; the global information and the local information are respectively input to the corresponding Attention-FA module to obtain global features and local features, including:

Input the global information into the global processing module to obtain global features, where the global features include video global features and text global features;

The local information is input to the local processing module to obtain local features, which include video local features and text local features.