WO2023168818A1 - Method and apparatus for determining similarity between video and text, electronic device, and storage medium - Google Patents

Method and apparatus for determining similarity between video and text, electronic device, and storage medium Download PDF

Info

Publication number
WO2023168818A1
WO2023168818A1 PCT/CN2022/090656 CN2022090656W WO2023168818A1 WO 2023168818 A1 WO2023168818 A1 WO 2023168818A1 CN 2022090656 W CN2022090656 W CN 2022090656W WO 2023168818 A1 WO2023168818 A1 WO 2023168818A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
text
information
features
global
Prior art date
Application number
PCT/CN2022/090656
Other languages
French (fr)
Chinese (zh)
Inventor
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023168818A1 publication Critical patent/WO2023168818A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application belongs to the field of artificial intelligence technology, and in particular relates to a video and text similarity determination method, device, electronic equipment, and storage medium.
  • Image video and text information do not belong to the same expression space, so it is difficult to compare the similarity between the two. For example, it is difficult to retrieve a video by inputting text.
  • embodiments of the present application provide a method for determining similarity between video and text, including:
  • the coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;
  • the encoded feature information is input into the improved T-Transformer model to obtain global information and local information.
  • the improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;
  • the global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing.
  • the video features correspond to the video
  • the text features correspond to the text information. correspond;
  • the similarity between the video and the text information is determined based on the video features and the text features.
  • embodiments of the present application provide a video and text similarity determination device, including:
  • An acquisition unit is used to acquire video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information.
  • the coding feature information includes video local coding information, video global coding information, and text local coding information. Coding Information and Text Global Coding Information;
  • the first processing unit is used to input the encoded feature information into an improved T-Transformer model to obtain global information and local information.
  • the improved T-Transformer model is based on the dynamic mask attention network DMAN and self-attention.
  • the network SAN and the feedforward neural network FFN are stacked;
  • the second processing unit is used to input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
  • Context processing unit used to input the global features and the local features as common input to the Contextual Transformer model, and obtain video features and text features through feature splicing processing.
  • the video features correspond to the video
  • the text features correspond to the text information. ;
  • a similarity calculation unit configured to determine the similarity between the video and the text information according to the video features and the text features.
  • embodiments of the present application provide an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • a A method for determining similarity between video and text includes:
  • the coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;
  • the encoded feature information is input into the improved T-Transformer model to obtain global information and local information.
  • the improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;
  • the global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing.
  • the video features correspond to the video
  • the text features correspond to the text information. correspond;
  • the similarity between the video and the text information is determined based on the video features and the text features.
  • embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program being used to execute a video and text similarity determination method, wherein the video and text similarity determination method Methods include:
  • the coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;
  • the encoded feature information is input into the improved T-Transformer model to obtain global information and local information.
  • the improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;
  • the global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing.
  • the video features correspond to the video
  • the text features correspond to the text information. correspond;
  • the similarity between the video and the text information is determined based on the video features and the text features.
  • the embodiments of the present application at least have the following beneficial effects: obtain video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information.
  • the coding feature information includes video local coding information, video global coding information, and text local coding. information and text global encoding information; input the encoding feature information into the improved T-Transformer model to obtain global information and local information.
  • the improved T-Transformer model is based on the dynamic mask attention network DMAN, self-attention network SAN and front It is formed by stacking feed neural network FFN; input global information and local information to the corresponding Attention-FA module respectively to obtain global features and local features; use global features and local features as common inputs and input them into the Contextual Transformer model, through feature splicing
  • the video features and text features are obtained through processing, the video features correspond to the video, and the text features correspond to the text information; the similarity between the video and text information is determined based on the video features and text features.
  • video and text can be converted into the same comparison space, and the similarity of two different things can be calculated. For example, even if the video does not have any tags or titles, you can still perform video retrieval by entering text.
  • Figure 1 is a flow chart of a method for determining video and text similarity provided by an embodiment of the present application
  • Figure 2 is a flow chart of encoding processing provided by another embodiment of the present application.
  • Figure 3 is a flow chart for dividing video and text information provided by another embodiment of the present application.
  • Figure 4 is a flow chart of encoding video clips and text segments provided by another embodiment of the present application.
  • Figure 5 is a flow chart of encoding video and text information provided by another embodiment of the present application.
  • Figure 6 is a flow chart of model connection relationships provided by another embodiment of the present application.
  • Figure 7 is a flow chart for obtaining global features and local features provided by another embodiment of the present application.
  • Figure 8 is a structural diagram of video features and text features obtained through feature splicing provided by another embodiment of the present application.
  • FIG. 9 is a flow chart of an optimized and improved T-Transformer model provided by another embodiment of the present application.
  • Figure 10 is a flow chart for labeling training samples provided by another embodiment of the present application.
  • Figure 11 is a structural diagram of a similarity determination device provided by another embodiment of the present application.
  • Figure 12 is a device diagram of an electronic device provided by another embodiment of the present application.
  • This application provides a method, device, electronic device, and storage medium for determining similarity between video and text.
  • the method includes: obtaining video and corresponding text information, and encoding the video and text information to obtain encoding feature information.
  • the encoding feature The information includes video local coding information, video global coding information, text local coding information and text global coding information; the coding feature information is input into the improved T-Transformer model to obtain global information and local information.
  • the improved T-Transformer model is based on The dynamic mask attention network DMAN (Dynamic Mask Attention Network), the self-attention network SAN (Self-Attention Network) and the feedforward neural network FFN (Feedforward Neural Network) are laminated; the global information and local information are input into the corresponding
  • the Attention-FA module obtains global features and local features; the global features and local features are used as common inputs and input into the Contextual Transformer model.
  • Video features and text features are obtained through feature splicing processing.
  • the video features correspond to the video, and the text features correspond to the Text information correspondence; determine the similarity between video and text information based on video features and text features.
  • the video and text can be converted into the same comparison space, and the similarity of two different things can be calculated. For example, even if the video does not have any tags or titles, the video can still be retrieved by entering text.
  • AI Artificial Intelligence
  • the embodiments of this application can obtain and process relevant data based on artificial intelligence technology.
  • Artificial Intelligence is the theory, method, technology and application device that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction devices, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometric technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the terminal mentioned in the embodiment of this application can be a smartphone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted computer, a smart home, a wearable electronic device, VR (Virtual Reality, virtual reality)/AR (Augmented Reality, augmented reality) ) equipment, etc.;
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, and network services. , cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, etc.
  • the server can be an independent server, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, and intermediate servers.
  • Cloud servers for basic cloud computing services such as software services, domain name services, security services, content distribution networks, and big data and artificial intelligence platforms.
  • Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine Learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
  • Figure 1 is a flow chart of a video and text similarity determination method provided by an embodiment of the present application.
  • the video and text similarity determination method includes but is not limited to the following steps:
  • Step S110 obtain the video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information.
  • the coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information;
  • Step S120 input the encoded feature information into the improved T-Transformer model to obtain global information and local information.
  • the improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feedforward neural network FFN. made up of layers;
  • Step S130 input the global information and local information to the corresponding Attention-FA module respectively to obtain global features and local features;
  • Step S140 use global features and local features as common inputs to the Contextual Transformer model, and obtain video features and text features through feature splicing processing.
  • the video features correspond to the video
  • the text features correspond to the text information
  • Step S150 Determine the similarity between the video and text information based on the video features and text features.
  • the model idea of this method comprehensively considers the global information and local information of video and text, uses a collaborative hierarchical transformer model to feature encode the two types of information, and improves on the traditional transformer model, namely Temporal Transformer (T -Transformer) and Contextual Transformer.
  • T -Transformer Temporal Transformer
  • Contextual Transformer The embodiment of the present application is applied to the similarity judgment between video and text information.
  • the video used for training and the text information corresponding to the video are determined according to step S110.
  • the T-Transformer model, Attention-FA module and Contextual Transformer model in sequence, video features and text features are output, thereby providing a reference for the similarity of video and text information.
  • the trained model can be used to analyze text based on Retrieve associated videos.
  • obtaining the encoding feature information in step S110 can be achieved through the following steps:
  • Step S111 segment the video and text information to obtain N video segments and N text segments.
  • Each video segment corresponds to a text segment, and N is a positive integer;
  • Step S112 Encode the video clips and text segments respectively to obtain video partial encoding information and text partial encoding information
  • Step S113 Code the video and text information respectively to obtain video global coding information and text global information.
  • Video global coding information In order to obtain video local coding information, video global coding information, text local coding information and text global coding information, two processes need to be performed on the video and text information.
  • One process is to code the entire video to obtain the global coding information of the video.
  • the text information is encoded as a whole to obtain the global information of the text.
  • the other method is to divide the video into N video segments, which are used as local information and are encoded separately, thereby obtaining N pieces of video local encoding information. Since the text information and the video are There is a corresponding relationship, so after the video is segmented, the corresponding text information is also divided according to the video segmentation method, and N text segments are obtained.
  • Each text segment is composed of several text sentences.
  • the N text segments are Carry out encoding processing separately to obtain N pieces of text local encoding information.
  • the video can be divided into different forms, such as dividing the video into N parts evenly according to the duration or the total number of frames, or dividing the video according to the difference in content (such as at the beginning of the movie, the first section, second section, third section and ending, etc.); then correspondingly, after segmenting the video, determine the text sentences corresponding to the video segments and use them as text segments. Therefore, referring to Figure 3, the step of determining text segments after segmenting the video (step S111) can be implemented in the following manner:
  • Step S1111 Edit the video into N video segments according to the preset segmentation method
  • Step S1112 Extract several text sentences in each video clip as text segments corresponding to the video clip.
  • the preset segmentation method can refer to the above-mentioned average division or further division according to content differences; the preset segmentation method can be annotated after manual editing, or can be automatically divided based on the corresponding software function; after manually editing the video clips, , you can further extract the text segments corresponding to the video clips, and while automatically dividing the video clips, you can directly extract the text segments corresponding to the video clips (for example, according to the start time and end time of the video clips, in the subtitle file Extract text sentences between the start time and end time in the timeline).
  • the process of encoding to obtain the video local encoding information and the text local encoding information, and the encoding process to obtain the video global encoding information and text global information can be performed separately.
  • the encoding process of the segment is as follows:
  • Step S1121 extract image frames from the video clips, encode the image frames through the video encoder, and obtain video local coding information corresponding to the video clips;
  • Step S1122 Input the text segment corresponding to the video segment into the text encoder for encoding, and obtain the text local encoding information corresponding to the text segment.
  • Step S1131 input the video into the video encoder for encoding processing, and obtain the video global encoding information
  • Step S1132 input the text information into the text encoder for encoding processing, and obtain the text global encoding information.
  • video segments and text segments in order to obtain the video representation of the video segment, frames are first extracted from the video segment and then encoded through a video encoder to convert the image representation into a video representation to obtain each video segment. video local coding information.
  • the text sentences corresponding to the text segments are input into the text encoder for encoding processing, and the text local encoding information of each text sentence is obtained.
  • the video encoder corresponds to the video encoding matrix
  • the text encoder corresponds to Text encoding matrix
  • the initial values of both matrices are random.
  • the complete video is input into the video encoder for encoding processing to obtain the video global encoding information; the complete text information is input into the encoder for encoding to obtain the text global encoding information.
  • the video encoder/text encoder for global encoding here and the video encoder/text encoder for local encoding mentioned above may be multiplexed or may be independent of each other.
  • each layer of the current Transformer is composed of two parts, namely the self-attention network (SAN) and the feedforward neural network (FFN). Most of the current research will disassemble Open these two parts to enhance them separately.
  • SAN self-attention network
  • FNN feedforward neural network
  • Feedforward neural network is a type of artificial neural network.
  • Feedforward neural network adopts a unidirectional multi-layer structure. Each layer contains several neurons. In this kind of neural network, each neuron can receive the signal of the neuron of the previous layer and generate output to the next layer.
  • the 0th layer is called the input layer
  • the last layer is called the output layer
  • the other intermediate layers are called hidden layers (or hidden layers, hidden layers).
  • the hidden layer can be one layer or multiple layers.
  • SAN and FFN essentially belong to a broader class of neural network structures: Mask Attention Networks (MANs), and the mask matrices in them are static, but the static mask method limits the model.
  • MANs Mask Attention Networks
  • FFN can only obtain its own information but cannot obtain the information of its neighbors.
  • each token can obtain information about all other tokens in the sentence, so words that are not in the neighborhood may also get a considerable attention score. Therefore, SAN may introduce noise into the semantic modeling process, thus ignoring local effective signals.
  • l is the current layer number
  • i is the current attention head
  • t and s correspond to the positions of querytoken and keytoken respectively
  • is a constant
  • W l are all learnable matrix variables
  • H l represents the set of each attention head under the multi-head attention mechanism.
  • MANs have various structures.
  • the embodiment of this application adopts the method of stacking DMAN, SAN and FFN in sequence for modeling.
  • the Dynamic Mask Attention Network (Dynamic Mask Attention Network, DMAN) is in the SAN part. , using a dynamic mask matrix to dynamically adjust the neighborhood size of each feature position, which can better achieve the effect of modeling local information, while FFN maps the attention results of each position to a larger dimension.
  • Feature space and then combined with the GELU (Gaussian Error Linear Units) activation function for non-linear screening, and finally restored to the original dimension, which can improve the attention to its own feature information.
  • the encoded local encoding information is input into the SAN module to obtain a weighted feature vector Z.
  • This Z is the MAN's attention function A M (Q, K, V), which is defined as follows:
  • Q represents query
  • K represents key
  • V represents Value
  • M represents the dynamic mask matrix
  • S represents the softmax function
  • i represents the i-th query in Q
  • j represents the K
  • d k represents the vector dimension of the K vector.
  • the constructed model structure is shown in Figure 6.
  • the model is mainly divided into two parts, one is the video input module (the upper part of Figure 6), and the other is the text input module (the lower part of Figure 6).
  • Video and text have a one-to-one correspondence in the training set. Since the data is well-labeled, it can be considered as information in the same space. Now it is necessary to use model training to enable video and text to be expressed uniformly in the same space information.
  • the improved T-Transformer model includes a video T-Transformer model and a text T-Transformer model.
  • the parameters of the video T-Transformer model and the text T-Transformer model are independent of each other.
  • the parameters of multiple video T-Transformer models are shared. Parameter sharing between multiple text T-Transformer models.
  • N+1 video T-Transformer models of which 1 video T-Transformer model is used to receive global video coding information, and the remaining N video T-Transformer models are used to receive local video coding information. N+1 all output processed information respectively.
  • N+1 text T-Transformer models of which 1 text T-Transformer model is used to receive global text encoding information, and the remaining N text T-Transformer models They are respectively used to receive local text encoding information, and these N+1 output processed information respectively.
  • each T-Transformer model is connected to an Attention-FA module, where the Attention-FA module includes the global Processing module and local processing module;
  • the aforementioned step S130 includes but is not limited to the following steps:
  • Step S131 input the global information into the global processing module to obtain global features, which include video global features and text global features;
  • Step S132 Input the local information into the local processing module to obtain local features.
  • the local features include video local features and text local features.
  • the video T-Transformer model used to process global video coding information and the text T-Transformer model used to process global text coding information are respectively connected to the global processing module, and the video T-Transformer model used to process local video coding information and the video T-Transformer model used to process local text coding information are respectively connected.
  • the text T-Transformer model is connected to the local processing module respectively.
  • the Attention-FA module is used to perform attention processing on videos and video clips respectively.
  • the specific methods are as follows:
  • K T is the transpose matrix of K
  • K T is the transpose matrix of K
  • W 1 the learnable matrix W 1
  • offset b 1 the transpose matrix of K
  • the Q matrix is multiplied by the learnable matrix W 2 , and after adding the corresponding bias b 2 , the attention weight A is obtained through the softmax function.
  • the above video local features can be expressed as where n is the number of segments, the global features of the video can be expressed as g v , and the local features of the text can be expressed as m is the number of sentences, and the text global features can be expressed as g p .
  • n is the number of segments
  • the global features of the video can be expressed as g v
  • the local features of the text can be expressed as m is the number of sentences
  • the text global features can be expressed as g p .
  • the global processing module includes a video global Attention-FA module and a text global Attention-FA module
  • the local processing module includes N video local Attention-FA modules and N text local Attention-FA modules, and N is positive. integer.
  • the connection mode of the module can be referred to Figure 6.
  • the essence of the Attention mechanism is actually an addressing process: given a query vector q related to the task, the Attention Value is calculated by calculating the attention distribution with the Key and appending it to the Value. This process actually It is the embodiment of the Attention mechanism easing the complexity of the neural network model: it is not necessary to input all N input information to the neural network for calculation, but only needs to select some task-related information from X to input to the neural network.
  • step S140 global features and local features are used as common inputs and input into the Contextual Transformer model, and video features and text features are obtained through feature splicing processing.
  • Figure 8 may include the following steps:
  • Step S141 input the local features as Local Context to the preset Transformer model, and perform a maximum pooling operation on the output results to obtain the Local feature vector F local ;
  • Step S142 input the global features as Global Context to the preset Transformer model to obtain the Global feature vector F cross ;
  • Step S143 perform feature splicing on F local and F cross to obtain video features and text features.
  • the Transformer model consists of the self-attention network of the multi-head attention mechanism and the front-end It is composed of a feedback neural network and adopts the short-cut structure in the residual network to solve the degradation problem in deep learning, including:
  • an additional vector Positional Encoding is added to the model input.
  • This vector can determine the position of the current information or the distance between different words in a sentence, thus making the built model more accurate. Well interpret the order in the input sequence.
  • the video local features are Input to the preset Transformer model respectively, and perform Max pooling (Max pooling, taking the maximum value of the values in each dimension) on the output results to obtain the video local feature vector; convert the text local features Input to the preset Transformer model respectively, and perform maximum pooling on the output results to obtain the text local feature vector; the video local feature vector and the text local feature vector are used as the Local feature vector F local .
  • Max pooling Max pooling, taking the maximum value of the values in each dimension
  • the improved T-Transformer model can be optimized based on the output results. Specifically:
  • Step S161 construct a loss function based on the video features and text features output by the Contextual Transformer model
  • Step S162 use the loss function to optimize the improved T-Transformer model.
  • the loss function is expressed as follows:
  • x represents the video feature output by the Contextual Transformer model
  • y represents the text feature output by the Contextual Transformer model
  • x′ and y′ represent the negative samples of x and y respectively
  • the negative sample pair is (x′, y) or (x, y′) means that the video and text in the current sample do not come from the corresponding data, it means that the video comes from other sample videos, and it means that the text comes from the text of other samples.
  • N here in L (P, N, ⁇ ); positive sample The pair is represented by (x, y), indicating that the video and text in the current sample are from the corresponding data.
  • L (P, N, ⁇ ) is represented by P here.
  • is a constant parameter and a conversion factor. If it is set too large, it will As a result, the model learning parameters do not converge. If the setting is small, the model learning efficiency will be slow. In this embodiment, ⁇ is 0.2.
  • the video features ⁇ and text features ⁇ output by the model can be directly used for similarity calculation.
  • cosine similarity calculation can be used to compare the similarity between the video and the text. Perform operations such as searching. For example, if a video does not have any text or tags, but only the video itself, the corresponding video can be retrieved by entering the text.
  • Step S171 label the video and use the labeled video as a training video
  • Step S172 Label the text information according to the video labeling method, and use the labeled text information as text information for training.
  • the video and text can be converted into the same comparison space, and the similarity of two different things can be calculated. Even if the video does not have any tags or titles, you can still search the video by entering text to match the corresponding target video.
  • an embodiment of the present application provides a video and text similarity determination device, which includes:
  • An acquisition unit is used to acquire video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information.
  • the coding feature information includes video local coding information, video global coding information, and text local coding information. Coding Information and Text Global Coding Information;
  • the first processing unit is used to input the encoded feature information into an improved T-Transformer model to obtain global information and local information.
  • the improved T-Transformer model is based on the dynamic mask attention network DMAN and self-attention.
  • the network SAN and the feedforward neural network FFN are stacked;
  • the second processing unit is used to input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
  • a context processing unit used to input the global features and the local features into the Contextual Transformer model as a common input, and obtain video features and text features through feature splicing processing.
  • the video features correspond to the video
  • the text correspond to the text information
  • a similarity calculation unit configured to determine the similarity between the video and the text information according to the video features and the text features.
  • an embodiment of the present application also provides an electronic device.
  • the electronic device 2000 includes: a memory 2002, a processor 2001, and a computer program stored on the memory 2002 and executable on the processor 2001.
  • the processor 2001 and the memory 2002 may be connected through a bus or other means.
  • the non-transitory software programs and instructions required to implement the video and text similarity determination method of the above embodiment are stored in the memory 2002.
  • the video and text similarity applied to the device in the above embodiment are executed.
  • the degree determination method for example, performs the above-described method steps S110 to S140 in Figure 1, method steps S111 to S112 in Figure 2, method steps S1111 to S1112 in Figure 3, and method step S1121 in Figure 4. to step S1122, method steps S1131 to step S1132 in Figure 5, method steps S131 to step S132 in Figure 7, method steps S141 to step S143 in Figure 8, method steps S161 to step S162 in Figure 9, and Figure 10 Method steps S171 to S172 in .
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • an embodiment of the present application also provides a computer-readable storage medium, which stores a computer program.
  • the computer program is executed by a processor or a controller, for example, by the above-mentioned electronic device embodiment. Execution by one of the processors can cause the above processor to execute the video and text similarity determination method in the above embodiment, for example, execute the above-described method steps S110 to S140 in Figure 1 and method step S111 in Figure 2 to step S112, method steps S1111 to step S1112 in Figure 3, method steps S1121 to step S1122 in Figure 4, method steps S1131 to step S1132 in Figure 5, method steps S131 to step S132 in Figure 7, Figure 8 The method steps S141 to S143 in , the method steps S161 to S162 in FIG. 9 , and the method steps S171 to S172 in FIG. 10 .
  • the computer-readable storage medium may be non-volatile or volatile.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other storage medium used to store desired information and that can be accessed by a computer.
  • Computer storage media typically embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery storage media.
  • the present application may be used in a variety of general purpose or special purpose computer device environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor devices, microprocessor-based devices, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, including Distributed computing environment for any of the above devices or equipment, etc.
  • the application may be described in the general context of computer programs, such as program modules, executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • the present application may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
  • each block in the flow chart or block diagram may represent a module, program segment, or part of the code.
  • the above module, program segment, or part of the code includes one or more programs for implementing specified logical functions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block in the block diagram or flowchart illustration, and combinations of blocks in the block diagram or flowchart illustration can be implemented by special purpose hardware-based means for performing the specified functions or operations, or may be implemented by special purpose hardware-based means for performing the specified functions or operations. Achieved by a combination of specialized hardware and computer instructions.
  • the units involved in the embodiments of this application can be implemented in software or hardware, and the described units can also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
  • the example embodiments described here can be implemented by software, or can be implemented by software combined with necessary hardware. Therefore, the technical solution according to the embodiment of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which can be a personal computer, server, touch terminal, or network device, etc.) to execute the method according to the embodiment of the present application.
  • a non-volatile storage medium which can be a CD-ROM, U disk, mobile hard disk, etc.
  • a computing device which can be a personal computer, server, touch terminal, or network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The present application relates to the field of artificial intelligence, and provides a method and apparatus for determining the similarity between a video and text, an electronic device, and a storage medium. The method comprises: obtaining a video and corresponding text information, and performing encoding on the video and the text information to obtain encoding feature information; inputting the encoding feature information into an improved T-Transformer model to obtain global information and local information; respectively inputting the global information and the local information into corresponding Attention-FA modules to obtain global features and local features; inputting the global features and the local features as common input into a Contextual Transformer model, and obtaining a video feature and a text feature by means of feature merging; and determining the similarity between the video and the text information according to the video feature and the text feature. By converting videos and text into a same comparison space, the similarity between two different things is calculated, so that a target video is obtained according to text matching.

Description

视频和文本相似度确定方法、装置、电子设备、存储介质Video and text similarity determination method, device, electronic equipment, storage medium
本申请要求于2022年3月9日提交中国专利局、申请号为202210234257.8,发明名称为“视频和文本相似度确定方法、装置、电子设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requests the priority of the Chinese patent application submitted to the China Patent Office on March 9, 2022, with the application number 202210234257.8, and the invention name is "Video and text similarity determination method, device, electronic equipment, storage medium", all of which The contents are incorporated into this application by reference.
技术领域Technical field
本申请属于人工智能技术领域,尤其涉及一种视频和文本相似度确定方法、装置、电子设备、存储介质。This application belongs to the field of artificial intelligence technology, and in particular relates to a video and text similarity determination method, device, electronic equipment, and storage medium.
背景技术Background technique
图像视频和文本信息并不属于同一个表达空间,所以很难把两者进行相似度比较,例如想要通过输入文本对视频进行检索,是比较困难的。Image video and text information do not belong to the same expression space, so it is difficult to compare the similarity between the two. For example, it is difficult to retrieve a video by inputting text.
技术问题technical problem
以下是发明人意识到的现有技术的技术问题:使用视频模型比如3D ImageNet等模型提取视频特征,使用BERT(Bidirectional Encoder Representation from Transformers,基于Transformer的双向编码器表征)去提取文本特征,然后再做余弦相似度计算,通过计算两个视频特征和文本特征的夹角余弦值来评估它们的相似度。但是这种方法不具备可解释性,并且从严谨科学角度上是错误的,因此需要提出一种新的视频和文本相似度计算方式。The following are the technical problems of the existing technology that the inventor is aware of: use video models such as 3D ImageNet to extract video features, use BERT (Bidirectional Encoder Representation from Transformers, Transformer-based bidirectional encoder representation) to extract text features, and then Do cosine similarity calculation to evaluate the similarity between two video features and text features by calculating the cosine value of the angle between them. However, this method is not interpretable and is wrong from a rigorous scientific perspective. Therefore, a new method of calculating video and text similarity needs to be proposed.
技术解决方案Technical solutions
第一方面,本申请实施例提供了一种视频和文本相似度确定方法,包括:In the first aspect, embodiments of the present application provide a method for determining similarity between video and text, including:
获取视频和对应的文本信息,并对所述视频和所述文本信息进行编码处理以得到编码特征信息,所述编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;
将所述编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,所述改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;
将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,所述视频特征与所述视频对应,所述文本特征与所述文本信息对应;The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;
根据所述视频特征和所述文本特征确定所述视频和所述文本信息之间的相似度。The similarity between the video and the text information is determined based on the video features and the text features.
第二方面,本申请实施例提供了一种视频和文本相似度确定装置,包括:In the second aspect, embodiments of the present application provide a video and text similarity determination device, including:
获取单元,用于获取视频和对应的文本信息,并对所述视频和所述文本信息进行编码处理以得到编码特征信息,所述编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;An acquisition unit is used to acquire video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, and text local coding information. Coding Information and Text Global Coding Information;
第一处理单元,用于将所述编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,所述改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;The first processing unit is used to input the encoded feature information into an improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN and self-attention. The network SAN and the feedforward neural network FFN are stacked;
第二处理单元,用于将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;The second processing unit is used to input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
上下文处理单元,用于将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,视频特征与视频对应,文本特征与所文本信息对应;Context processing unit, used to input the global features and the local features as common input to the Contextual Transformer model, and obtain video features and text features through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. ;
相似度计算单元,用于根据所述视频特征和所述文本特征确定所述视频和所述文本信息之间的相似度。A similarity calculation unit, configured to determine the similarity between the video and the text information according to the video features and the text features.
第三方面,本申请实施例提供了一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种视频和文本相似度确定方法,其中,所述视频和文本相似度确定方法包括:In a third aspect, embodiments of the present application provide an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, a A method for determining similarity between video and text, wherein the method for determining similarity between video and text includes:
获取视频和对应的文本信息,并对所述视频和所述文本信息进行编码处理以得到编码特征信息,所述编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;
将所述编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,所述改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;
将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,所述视频特征与所述视频对应,所述文本特征与所述文本信息对应;The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;
根据所述视频特征和所述文本特征确定所述视频和所述文本信息之间的相似度。The similarity between the video and the text information is determined based on the video features and the text features.
第四方面,本申请实施例提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序用于执行一种视频和文本相似度确定方法,其中,所述视频和文本相似度确定方法包括:In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program being used to execute a video and text similarity determination method, wherein the video and text similarity determination method Methods include:
获取视频和对应的文本信息,并对所述视频和所述文本信息进行编码处理以得到编码特征信息,所述编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;
将所述编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,所述改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;
将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,所述视频特征与所述视频对应,所述文本特征与所述文本信息对应;The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;
根据所述视频特征和所述文本特征确定所述视频和所述文本信息之间的相似度。The similarity between the video and the text information is determined based on the video features and the text features.
有益效果beneficial effects
本申请实施例至少具有如下有益效果:获取视频和对应的文本信息,并对视频和文本信息进行编码处理以得到编码特征信息,编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;将编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;将全局信息和局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;将全局特征和局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,视频特征与视频对应,文本特征与所文本信息对应;根据视频特征和文本特征确定视频和文本信息之间的相似度。通过上述方式,可以将视频和文本转换到同一对比空间,将两 个不同事物进行相似度计算,例如,即使视频没有任何标签或标题等内容,依旧可以通过输入文本来进行视频检索。The embodiments of the present application at least have the following beneficial effects: obtain video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, and text local coding. information and text global encoding information; input the encoding feature information into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, self-attention network SAN and front It is formed by stacking feed neural network FFN; input global information and local information to the corresponding Attention-FA module respectively to obtain global features and local features; use global features and local features as common inputs and input them into the Contextual Transformer model, through feature splicing The video features and text features are obtained through processing, the video features correspond to the video, and the text features correspond to the text information; the similarity between the video and text information is determined based on the video features and text features. Through the above method, video and text can be converted into the same comparison space, and the similarity of two different things can be calculated. For example, even if the video does not have any tags or titles, you can still perform video retrieval by entering text.
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the application. The objectives and other advantages of the application may be realized and obtained by the structure particularly pointed out in the specification, claims and appended drawings.
附图说明Description of the drawings
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。The drawings are used to provide a further understanding of the technical solution of the present application and constitute a part of the specification. They are used to explain the technical solution of the present application together with the embodiments of the present application and do not constitute a limitation of the technical solution of the present application.
图1是本申请一个实施例提供的视频和文本相似度确定方法的流程图;Figure 1 is a flow chart of a method for determining video and text similarity provided by an embodiment of the present application;
图2是本申请另一个实施例提供的编码处理的流程图;Figure 2 is a flow chart of encoding processing provided by another embodiment of the present application;
图3是本申请另一个实施例提供的对视频和文本信息进行划分的流程图;Figure 3 is a flow chart for dividing video and text information provided by another embodiment of the present application;
图4是本申请另一个实施例提供的对视频片段和文本分段进行编码处理的流程图;Figure 4 is a flow chart of encoding video clips and text segments provided by another embodiment of the present application;
图5是本申请另一个实施例提供的对视频和文本信息进行编码处理的流程图;Figure 5 is a flow chart of encoding video and text information provided by another embodiment of the present application;
图6是本申请另一个实施例提供的模型连接关系的流程图;Figure 6 is a flow chart of model connection relationships provided by another embodiment of the present application;
图7是本申请另一个实施例提供的得到全局特征和局部特征的流程图;Figure 7 is a flow chart for obtaining global features and local features provided by another embodiment of the present application;
图8是本申请另一个实施例提供的通过特征拼接得到视频特征和文本特征的结构图;Figure 8 is a structural diagram of video features and text features obtained through feature splicing provided by another embodiment of the present application;
图9是本申请另一个实施例提供的优化改进的T-Transformer模型的流程图;Figure 9 is a flow chart of an optimized and improved T-Transformer model provided by another embodiment of the present application;
图10是本申请另一个实施例提供的标注训练用样本的流程图;Figure 10 is a flow chart for labeling training samples provided by another embodiment of the present application;
图11是本申请另一个实施例提供的相似度确定装置的结构图;Figure 11 is a structural diagram of a similarity determination device provided by another embodiment of the present application;
图12是本申请另一个实施例提供的电子设备的装置图。Figure 12 is a device diagram of an electronic device provided by another embodiment of the present application.
本发明的实施方式Embodiments of the invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书、权利要求书或上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the device schematic diagram and the logical sequence is shown in the flow chart, in some cases, the modules can be divided into different modules in the device or the order in the flow chart can be executed. The steps shown or described. The terms "first", "second", etc. in the description, claims or the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
本申请提供了一种视频和文本相似度确定方法、装置、电子设备、存储介质,方法包括:获取视频和对应的文本信息,并对视频和文本信息进行编码处理以得到编码特征信息,编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;将编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,改进的T-Transformer模型基于动态遮罩注意力网络DMAN(Dynamic Mask Attention Network)、自注意力网络SAN(Self-Attention Network)和前馈神经网络FFN(Feedforward Neural Network)层叠而成;将全局信息和局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;将全局特征和局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,视频特征与视频对应,文本特征与所文本信息对应;根据视频特征和文本特征确定视频和文本信息之间的相似度。通过上述方式,可以将视频和文本转换到同一对比空间,将两个不同事物进行相似度计算,例如,即使视频没有任何标签或标题等内容,依旧可以通过输入文本来进行视频检索。This application provides a method, device, electronic device, and storage medium for determining similarity between video and text. The method includes: obtaining video and corresponding text information, and encoding the video and text information to obtain encoding feature information. The encoding feature The information includes video local coding information, video global coding information, text local coding information and text global coding information; the coding feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on The dynamic mask attention network DMAN (Dynamic Mask Attention Network), the self-attention network SAN (Self-Attention Network) and the feedforward neural network FFN (Feedforward Neural Network) are laminated; the global information and local information are input into the corresponding The Attention-FA module obtains global features and local features; the global features and local features are used as common inputs and input into the Contextual Transformer model. Video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the Text information correspondence; determine the similarity between video and text information based on video features and text features. Through the above method, the video and text can be converted into the same comparison space, and the similarity of two different things can be calculated. For example, even if the video does not have any tags or titles, the video can still be retrieved by entering text.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用装置。The embodiments of this application can obtain and process relevant data based on artificial intelligence technology. Among them, Artificial Intelligence (AI) is the theory, method, technology and application device that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互装置、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction devices, mechatronics and other technologies. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometric technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
本申请实施例所提及的终端可以是智能手机、平板电脑、笔记本电脑、台式电脑、车载计算机、智能家居、可穿戴电子设备、VR(Virtual Reality,虚拟现实)/AR(Augmented Reality,增强现实)设备等等;服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器,等等。The terminal mentioned in the embodiment of this application can be a smartphone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted computer, a smart home, a wearable electronic device, VR (Virtual Reality, virtual reality)/AR (Augmented Reality, augmented reality) ) equipment, etc.; the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, and network services. , cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, etc.
需要说明的是,本申请实施例的数据可以保存在服务器中,服务器可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络、以及大数据和人工智能平台等基础云计算服务的云服务器。It should be noted that the data in the embodiments of this application can be stored in a server. The server can be an independent server, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, and intermediate servers. Cloud servers for basic cloud computing services such as software services, domain name services, security services, content distribution networks, and big data and artificial intelligence platforms.
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
如图1所示,图1是本申请一个实施例提供的一种视频和文本相似度确定方法的流程图,该视频和文本相似度确定方法,包括但不限于有以下步骤:As shown in Figure 1, Figure 1 is a flow chart of a video and text similarity determination method provided by an embodiment of the present application. The video and text similarity determination method includes but is not limited to the following steps:
步骤S110,获取视频和对应的文本信息,并对视频和文本信息进行编码处理以得到编码特征信息,编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;Step S110, obtain the video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information;
步骤S120,将编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;Step S120, input the encoded feature information into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feedforward neural network FFN. made up of layers;
步骤S130,将全局信息和局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;Step S130, input the global information and local information to the corresponding Attention-FA module respectively to obtain global features and local features;
步骤S140,将全局特征和局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,视频特征与视频对应,文本特征与所文本信息对应;Step S140, use global features and local features as common inputs to the Contextual Transformer model, and obtain video features and text features through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information;
步骤S150,根据视频特征和文本特征确定视频和文本信息之间的相似度。Step S150: Determine the similarity between the video and text information based on the video features and text features.
本方法的模型思想将视频和文本全局化的信息和局部信息综合考虑,使用协同层级化的transformer模型对两种信息进行特征编码,并且在传统的transformer模型上进行了改进,即Temporal Transformer(T-Transformer)和Contextual Transformer。本申请实施例应用于视频和文本信息的相似度判断,在训练本方法的模型时,按照步骤S110确定用于训练的视频和该视频对应的文本信息,将视频和文本信息进行编码处理后,依次经过T-Transformer模型、Attention-FA模块和Contextual Transformer模型,输出视频特征和文本特征,从而为视频和文本信息的相似度提供参考,在一些情况下,训练完成后的模型可以用于根据文 本检索相关联的视频。The model idea of this method comprehensively considers the global information and local information of video and text, uses a collaborative hierarchical transformer model to feature encode the two types of information, and improves on the traditional transformer model, namely Temporal Transformer (T -Transformer) and Contextual Transformer. The embodiment of the present application is applied to the similarity judgment between video and text information. When training the model of this method, the video used for training and the text information corresponding to the video are determined according to step S110. After encoding the video and text information, Through the T-Transformer model, Attention-FA module and Contextual Transformer model in sequence, video features and text features are output, thereby providing a reference for the similarity of video and text information. In some cases, the trained model can be used to analyze text based on Retrieve associated videos.
具体来说,参照图2所示,步骤S110中得到编码特征信息可以通过以下步骤实现:Specifically, referring to Figure 2, obtaining the encoding feature information in step S110 can be achieved through the following steps:
步骤S111,对视频和文本信息进行分段,得到N个视频片段和N个文本分段,每个视频片段与一个文本分段相对应,N为正整数;Step S111, segment the video and text information to obtain N video segments and N text segments. Each video segment corresponds to a text segment, and N is a positive integer;
步骤S112,分别对视频片段和文本分段进行编码处理,得到视频局部编码信息和文本局部编码信息;Step S112: Encode the video clips and text segments respectively to obtain video partial encoding information and text partial encoding information;
步骤S113,分别对视频和文本信息进行编码处理,得到视频全局编码信息和文本全局信息。Step S113: Code the video and text information respectively to obtain video global coding information and text global information.
为了得到视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息,需要对视频和文本信息进行两种处理,一种处理是将视频整体进行编码处理,得到视频全局编码信息,将文本信息整体进行编码处理,得到文本全局信息,另一种是将视频划分成N个视频片段,作为局部信息,分别进行编码处理,从而得到N个视频局部编码信息,由于文本信息和视频是具有对应关系的,因此将视频分段之后,相应的文本信息也根据视频分段方式划分,得到N个文本分段,每个文本分段中由若干个文本句子组成,将N个文本分段分别进行编码处理,从而得到N个文本局部编码信息。In order to obtain video local coding information, video global coding information, text local coding information and text global coding information, two processes need to be performed on the video and text information. One process is to code the entire video to obtain the global coding information of the video. The text information is encoded as a whole to obtain the global information of the text. The other method is to divide the video into N video segments, which are used as local information and are encoded separately, thereby obtaining N pieces of video local encoding information. Since the text information and the video are There is a corresponding relationship, so after the video is segmented, the corresponding text information is also divided according to the video segmentation method, and N text segments are obtained. Each text segment is composed of several text sentences. The N text segments are Carry out encoding processing separately to obtain N pieces of text local encoding information.
可以理解的是,对视频进行划分可以采用不同的形式,例如按照视频的时长或者总帧数平均划分成N份,又例如按照视频在内容上的差异进行划分(如在电影的开头、第一节、第二节、第三节和结尾等内容划分的基础上再划分出N个片段);那么相应地,将视频分段之后,确定视频片段对应的文本句子并将其作为文本分段。因此,参照图3所示,将视频分段后确定文本分段的步骤(步骤S111)可以通过以下方式实现:It can be understood that the video can be divided into different forms, such as dividing the video into N parts evenly according to the duration or the total number of frames, or dividing the video according to the difference in content (such as at the beginning of the movie, the first section, second section, third section and ending, etc.); then correspondingly, after segmenting the video, determine the text sentences corresponding to the video segments and use them as text segments. Therefore, referring to Figure 3, the step of determining text segments after segmenting the video (step S111) can be implemented in the following manner:
步骤S1111,按照预设分段方式将视频剪辑成N个视频片段;Step S1111: Edit the video into N video segments according to the preset segmentation method;
步骤S1112,提取每个视频片段中的若干个文本句子,作为与视频片段对应的文本分段。Step S1112: Extract several text sentences in each video clip as text segments corresponding to the video clip.
预设分段方式可以参照上述平均划分或根据内容差异再划分等方式;预设分段方式可以由人工剪辑后标注得到,也可以基于相应的软件功能自动划分得到;在人工剪辑得到视频片段后,可以进一步提取出视频片段对应的文本分段,而自动划分得到视频片段的同时,可以直接提取出视频片段对应的文本分段(例如按照视频片段的起始时间和结束时间,在字幕文件的时间轴中提取起始时间和结束时间之间的文本句子)。The preset segmentation method can refer to the above-mentioned average division or further division according to content differences; the preset segmentation method can be annotated after manual editing, or can be automatically divided based on the corresponding software function; after manually editing the video clips, , you can further extract the text segments corresponding to the video clips, and while automatically dividing the video clips, you can directly extract the text segments corresponding to the video clips (for example, according to the start time and end time of the video clips, in the subtitle file Extract text sentences between the start time and end time in the timeline).
其中,编码处理得到视频局部编码信息和文本局部编码信息的过程,和编码处理得到视频全局编码信息和文本全局信息的过程,可以是分别进行的;参照图4所示,对于视频片段和文本分段的编码处理过程如下:Among them, the process of encoding to obtain the video local encoding information and the text local encoding information, and the encoding process to obtain the video global encoding information and text global information can be performed separately. Referring to Figure 4, for video clips and text segmentation The encoding process of the segment is as follows:
步骤S1121,从视频片段中提取图像帧,通过视频编码器对图像帧进行编码,得到视频片段对应的视频局部编码信息;Step S1121, extract image frames from the video clips, encode the image frames through the video encoder, and obtain video local coding information corresponding to the video clips;
步骤S1122,将与视频片段对应的文本片段输入到文本编码器中进行编码,得到文本片段对应的文本局部编码信息。Step S1122: Input the text segment corresponding to the video segment into the text encoder for encoding, and obtain the text local encoding information corresponding to the text segment.
参照图5所示,对于视频和文本信息的编码处理过程如下:Referring to Figure 5, the encoding process for video and text information is as follows:
步骤S1131,将视频输入到视频编码器中进行编码处理,得到视频全局编码信息;Step S1131, input the video into the video encoder for encoding processing, and obtain the video global encoding information;
步骤S1132,将文本信息输入到文本编码器中进行编码处理,得到文本全局编码信息。Step S1132, input the text information into the text encoder for encoding processing, and obtain the text global encoding information.
对于视频片段和文本分段,为了获得视频分段的视频表示,首先从视频片段中提取帧,然后通过视频编码器对其进行编码,将图像表示转化为视频表示,以得到每个视频分段的视频局部编码信息。同理,为了获得文本表示,将文本分段对应的文本句子输入文本编码器中进行编码处理,得到每个文本句子的文本局部编码信息,其中,视频编码器对应视频编码矩阵,文本编码器对应文本编码矩阵,且这两个矩阵的初始值都是随机的。For video segments and text segments, in order to obtain the video representation of the video segment, frames are first extracted from the video segment and then encoded through a video encoder to convert the image representation into a video representation to obtain each video segment. video local coding information. In the same way, in order to obtain text representation, the text sentences corresponding to the text segments are input into the text encoder for encoding processing, and the text local encoding information of each text sentence is obtained. Among them, the video encoder corresponds to the video encoding matrix, and the text encoder corresponds to Text encoding matrix, and the initial values of both matrices are random.
对于视频和文本信息,为了获得视频表示,将完整的视频输入视频编码器中进行编码处理,得到视频全局编码信息;将完整的文本信息输入编码器中进行编码,得到文本全局编码信息。此处进行全局编码的视频编码器/文本编码器与前述进行局部编码的视频编码器/文本编码器可以是复用的,也可以是互相独立的。For video and text information, in order to obtain video representation, the complete video is input into the video encoder for encoding processing to obtain the video global encoding information; the complete text information is input into the encoder for encoding to obtain the text global encoding information. The video encoder/text encoder for global encoding here and the video encoder/text encoder for local encoding mentioned above may be multiplexed or may be independent of each other.
值得注意的是,改进的T-Transformer模型包括若干个注意力网络,每个注意力网络由1个DMAN、1个SAN和1个FFN依次堆叠而成。具体来说,为了进行中视频和文本的处理,当前Transformer的每一层都由两部分构成,分别是自注意力网络(SAN)和前馈神经网络(FFN),当前的大部分研究会拆开这两份部分来分别进行增强。It is worth noting that the improved T-Transformer model includes several attention networks, each of which is stacked in sequence by 1 DMAN, 1 SAN and 1 FFN. Specifically, in order to process videos and texts, each layer of the current Transformer is composed of two parts, namely the self-attention network (SAN) and the feedforward neural network (FFN). Most of the current research will disassemble Open these two parts to enhance them separately.
前馈神经网络是人工神经网络的一种。前馈神经网络采用一种单向多层结构。其中每一层包含若干个神经元。在此种神经网络中,各神经元可以接收前一层神经元的信号,并产生输出到下一层。第0层叫输入层,最后一层叫输出层,其他中间层叫做隐含层(或隐藏层、隐层)。隐层可以是一层,也可以是多层。Feedforward neural network is a type of artificial neural network. Feedforward neural network adopts a unidirectional multi-layer structure. Each layer contains several neurons. In this kind of neural network, each neuron can receive the signal of the neuron of the previous layer and generate output to the next layer. The 0th layer is called the input layer, the last layer is called the output layer, and the other intermediate layers are called hidden layers (or hidden layers, hidden layers). The hidden layer can be one layer or multiple layers.
可以认为SAN和FFN本质上都属于一类更广泛的神经网络结构:遮罩注意力网络(Mask Attention Networks,MANs),并且其中的遮罩矩阵都是静态的,但是静态遮罩方式限制了模型对于局部信息的建模的,直观上来说,因为FFN的遮罩矩阵是一个单位阵,所以FFN只能获取自身的信息而无法获知邻居的信息。对于SAN,每一个token都可以获取到句子其它的所有token的信息,那么,不在邻域当中的单词也有可能得到一个相当大的注意力得分。因此,SAN可能在语义建模的过程当中引入噪声,进而忽视了局部当中的有效信号。It can be considered that both SAN and FFN essentially belong to a broader class of neural network structures: Mask Attention Networks (MANs), and the mask matrices in them are static, but the static mask method limits the model. For the modeling of local information, intuitively speaking, because the mask matrix of FFN is a unit matrix, FFN can only obtain its own information but cannot obtain the information of its neighbors. For SAN, each token can obtain information about all other tokens in the sentence, so words that are not in the neighborhood may also get a considerable attention score. Therefore, SAN may introduce noise into the semantic modeling process, thus ignoring local effective signals.
显然地,可以通过静态的遮罩矩阵来使模型只考虑特定邻域内的单词,从而达到更好的局部建模的效果,但是这样的方式欠缺灵活性,考虑到邻域的大小应该随着query token来变化,所以本申请实施例构建了下面的策略来动态地调节邻域的大小:Obviously, you can use a static mask matrix to make the model only consider words in a specific neighborhood, thereby achieving better local modeling effects. However, this method lacks flexibility, considering that the size of the neighborhood should change with the query. Token changes, so the embodiment of this application constructs the following strategy to dynamically adjust the size of the neighborhood:
Figure PCTCN2022090656-appb-000001
Figure PCTCN2022090656-appb-000001
其中,l是当前的层数,i是当前的注意力head,t和s分别对应querytoken和keytoken的位置,σ是常数,W l
Figure PCTCN2022090656-appb-000002
Figure PCTCN2022090656-appb-000003
都是可学习的矩阵变量,
Figure PCTCN2022090656-appb-000004
H l表示多头注意力机制下的各个注意力head的集合。
Among them, l is the current layer number, i is the current attention head, t and s correspond to the positions of querytoken and keytoken respectively, σ is a constant, W l ,
Figure PCTCN2022090656-appb-000002
Figure PCTCN2022090656-appb-000003
are all learnable matrix variables,
Figure PCTCN2022090656-appb-000004
H l represents the set of each attention head under the multi-head attention mechanism.
在堆叠方式上,MANs具有多种结构,本申请实施例采用DMAN、SAN和FFN依次堆叠的方式来进行建模,其中,动态遮罩注意力网络(Dynamic Mask Attention Network,DMAN)是在SAN部分,采用动态遮罩矩阵得到动态的调节每个特征位置的邻域大小,能够更好地达到对局部信息建模的效果,而FFN是将每个位置的注意力结果映射到一个更大维度的特征空间,然后结合GELU(Gaussian Error Linear Units)激活函数进行非线性筛选,最后恢复回原始维度,能够提高对自身特征信息的关注情况。具体是,将编码后的局部编码信息输入SAN模块,得到一个加权的特征向量Z,这个Z就是MAN的注意力函数A M(Q,K,V),定义如下: In terms of stacking method, MANs have various structures. The embodiment of this application adopts the method of stacking DMAN, SAN and FFN in sequence for modeling. Among them, the Dynamic Mask Attention Network (Dynamic Mask Attention Network, DMAN) is in the SAN part. , using a dynamic mask matrix to dynamically adjust the neighborhood size of each feature position, which can better achieve the effect of modeling local information, while FFN maps the attention results of each position to a larger dimension. Feature space, and then combined with the GELU (Gaussian Error Linear Units) activation function for non-linear screening, and finally restored to the original dimension, which can improve the attention to its own feature information. Specifically, the encoded local encoding information is input into the SAN module to obtain a weighted feature vector Z. This Z is the MAN's attention function A M (Q, K, V), which is defined as follows:
A M(Q,K,V)=S M(Q,K)V A M (Q,K,V)=S M (Q,K)V
Figure PCTCN2022090656-appb-000005
Figure PCTCN2022090656-appb-000005
其中,Q代表query,K代表key,V代表Value,Q,K,V的向量维度一致,M代表动态遮罩矩阵,S代表softmax函数,i代表Q中的第i个query,j代表K中第j个key,d k代表K向量的向量维度。 Among them, Q represents query, K represents key, V represents Value, the vector dimensions of Q, K, and V are consistent, M represents the dynamic mask matrix, S represents the softmax function, i represents the i-th query in Q, and j represents the K The jth key, d k represents the vector dimension of the K vector.
因此,根据S M(Q,K)可以得到一组注意力权重值,同时,对S M(Q,K)的求解也相当于遮罩注意力网络中的固定遮罩矩阵为全1的特例情况。 Therefore, a set of attention weight values can be obtained according to S M (Q, K). At the same time, the solution to S M (Q, K) is also equivalent to a special case where the fixed mask matrix in the mask attention network is all 1. Condition.
构建得到的模型结构参照图6所示,模型主要分为两块,一个是视频输入模块(图6的上半部分),一个是文本输入模块(图6的下半部分)。视频和文本在训练集中是一一对应的,由于数据被标注好,可以认为是同一空间的信息,现在需要通过模型训练来使得视频和文本能够在同一空间信息中统一被表达。The constructed model structure is shown in Figure 6. The model is mainly divided into two parts, one is the video input module (the upper part of Figure 6), and the other is the text input module (the lower part of Figure 6). Video and text have a one-to-one correspondence in the training set. Since the data is well-labeled, it can be considered as information in the same space. Now it is necessary to use model training to enable video and text to be expressed uniformly in the same space information.
其中,改进的T-Transformer模型包括视频T-Transformer模型和文本T-Transformer模型,视频T-Transformer模型和文本T-Transformer模型的参数互相独立,多个视频T-Transformer模型之间的参数共享,多个文本T-Transformer模型之间的参数共享。根据图6可知,视频T-Transformer模型有N+1个,其中1个视频T-Transformer模型用于接收全局视频编码信息,余下N个视频T-Transformer模型分别用于接收局部视频编码信息,这N+1个都分别输出处理后的信息,同理,文本T-Transformer模型有N+1个,其中1个文本 T-Transformer模型用于接收全局文本编码信息,余下N个文本T-Transformer模型分别用于接收局部文本编码信息,这N+1个都分别输出处理后的信息。Among them, the improved T-Transformer model includes a video T-Transformer model and a text T-Transformer model. The parameters of the video T-Transformer model and the text T-Transformer model are independent of each other. The parameters of multiple video T-Transformer models are shared. Parameter sharing between multiple text T-Transformer models. According to Figure 6, we can see that there are N+1 video T-Transformer models, of which 1 video T-Transformer model is used to receive global video coding information, and the remaining N video T-Transformer models are used to receive local video coding information. N+1 all output processed information respectively. Similarly, there are N+1 text T-Transformer models, of which 1 text T-Transformer model is used to receive global text encoding information, and the remaining N text T-Transformer models They are respectively used to receive local text encoding information, and these N+1 output processed information respectively.
对接T-Transformer模型的输出为Attention-FA(attention-aware feature aggregation)模块的输入,根据图6所示,每个T-Transformer模型都连接一个Attention-FA模块,其中,Attention-FA模块包括全局处理模块和局部处理模块;参照图7所示,前述步骤S130包括但不限于以下步骤:The output of the docked T-Transformer model is the input of the Attention-FA (attention-aware feature aggregation) module. As shown in Figure 6, each T-Transformer model is connected to an Attention-FA module, where the Attention-FA module includes the global Processing module and local processing module; Referring to Figure 7, the aforementioned step S130 includes but is not limited to the following steps:
步骤S131,将全局信息输入到全局处理模块,得到全局特征,全局特征包括视频全局特征和文本全局特征;Step S131, input the global information into the global processing module to obtain global features, which include video global features and text global features;
步骤S132,将局部信息输入到局部处理模块,得到局部特征,局部特征包括视频局部特征和文本局部特征。Step S132: Input the local information into the local processing module to obtain local features. The local features include video local features and text local features.
用于处理全局视频编码信息的视频T-Transformer模型和处理全局文本编码信息的文本T-Transformer模型分别连接全局处理模块,用于处理局部视频编码信息的视频T-Transformer模型和处理局部文本编码信息的文本T-Transformer模型分别连接局部处理模块。通过Attention-FA模块的处理得到对应的视频全局特征、文本全局特征、视频局部特征和文本局部特征。The video T-Transformer model used to process global video coding information and the text T-Transformer model used to process global text coding information are respectively connected to the global processing module, and the video T-Transformer model used to process local video coding information and the video T-Transformer model used to process local text coding information are respectively connected. The text T-Transformer model is connected to the local processing module respectively. Through the processing of the Attention-FA module, the corresponding video global features, text global features, video local features and text local features are obtained.
对于视频处理方面,Attention-FA模块用于对视频和视频片段分别做注意力处理,具体方法如下:For video processing, the Attention-FA module is used to perform attention processing on videos and video clips respectively. The specific methods are as follows:
生成随机的两个可学习的矩阵W 1和W 2,以及对应的偏置项b 1和b 2Generate two random learnable matrices W 1 and W 2 , as well as the corresponding bias terms b 1 and b 2 ;
将T-Transformer模型的输出
Figure PCTCN2022090656-appb-000006
表示成K,
Figure PCTCN2022090656-appb-000007
其中
Figure PCTCN2022090656-appb-000008
代表视频部分,
Figure PCTCN2022090656-appb-000009
代表文本部分,则结合GELU公式分别计算到视频和文本对应的矩阵:
Convert the output of the T-Transformer model to
Figure PCTCN2022090656-appb-000006
Expressed as K,
Figure PCTCN2022090656-appb-000007
in
Figure PCTCN2022090656-appb-000008
Represents the video part,
Figure PCTCN2022090656-appb-000009
Representing the text part, combine the GELU formula to calculate the matrices corresponding to the video and text respectively:
Figure PCTCN2022090656-appb-000010
Figure PCTCN2022090656-appb-000010
Q=GELU(W 1K T+b 1),K=x Q=GELU(W 1 K T +b 1 ), K=x
A=softmax(W 2Q+b 2) T A=softmax(W 2 Q+b 2 ) T
GELU为激活函数Gaussian Error Linear Units,K T为K的转置矩阵,将K T与可学习的矩阵W 1相乘,加上偏置b 1后作为输入经过GELU激活函数算出Q矩阵,再将Q矩阵与可学习的矩阵W 2相乘,加上对应偏置b 2后,经过softmax函数得到注意力权值A。 GELU is the activation function Gaussian Error Linear Units, K T is the transpose matrix of K, multiply K T by the learnable matrix W 1 , add offset b 1 and use it as input to calculate the Q matrix through the GELU activation function, and then The Q matrix is multiplied by the learnable matrix W 2 , and after adding the corresponding bias b 2 , the attention weight A is obtained through the softmax function.
经过Attention-FA模块的处理后,上述视频局部特征可以表示为
Figure PCTCN2022090656-appb-000011
其中n是分段数,视频全局特征可以表示为g v,文本局部特征可以表示为
Figure PCTCN2022090656-appb-000012
Figure PCTCN2022090656-appb-000013
m是句子数,文本全局特征可以表示为g p,把视频全局特征、文本全局特征、视频局部特征和文本局部特征输入至Contextual Transformer模型进行特征拼接即可。
After processing by the Attention-FA module, the above video local features can be expressed as
Figure PCTCN2022090656-appb-000011
where n is the number of segments, the global features of the video can be expressed as g v , and the local features of the text can be expressed as
Figure PCTCN2022090656-appb-000012
Figure PCTCN2022090656-appb-000013
m is the number of sentences, and the text global features can be expressed as g p . Just input the video global features, text global features, video local features, and text local features into the Contextual Transformer model for feature splicing.
可以理解的是,全局处理模块包括一个视频全局Attention-FA模块和一个文本全局Attention-FA模块,局部处理模块包括N个视频局部Attention-FA模块和N份文本局部Attention-FA模块,N为正整数。模块的连接模式可以参考图6。Attention机制的实质其实就是一个寻址(addressing)的过程:给定一个和任务相关的查询Query向量q,通过计算与Key的注意力分布并附加在Value上,从而计算Attention Value,这个过程实际上是Attention机制缓解神经网络模型复杂度的体现:不需要将所有的N个输入信息都输入到神经网络进行计算,只需要从X中选择一些和任务相关的信息输入给神经网络。It can be understood that the global processing module includes a video global Attention-FA module and a text global Attention-FA module, and the local processing module includes N video local Attention-FA modules and N text local Attention-FA modules, and N is positive. integer. The connection mode of the module can be referred to Figure 6. The essence of the Attention mechanism is actually an addressing process: given a query vector q related to the task, the Attention Value is calculated by calculating the attention distribution with the Key and appending it to the Value. This process actually It is the embodiment of the Attention mechanism easing the complexity of the neural network model: it is not necessary to input all N input information to the neural network for calculation, but only needs to select some task-related information from X to input to the neural network.
上述步骤S140中,将全局特征和局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,参照图8所示,具体可以包括如下步骤:In the above step S140, global features and local features are used as common inputs and input into the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. Refer to Figure 8, which may include the following steps:
步骤S141,将局部特征作为Local Context输入到预设Transformer模型,并对输出结果进行最大池化操作,得到Local特征向量F localStep S141, input the local features as Local Context to the preset Transformer model, and perform a maximum pooling operation on the output results to obtain the Local feature vector F local ;
步骤S142,将全局特征作为Global Context输入到预设Transformer模型,得到Global特征向量F crossStep S142, input the global features as Global Context to the preset Transformer model to obtain the Global feature vector F cross ;
步骤S143,对F local和F cross进行特征拼接,得到视频特征和文本特征。 Step S143, perform feature splicing on F local and F cross to obtain video features and text features.
为了进一步增强Transformer捕捉不同文本特征和视频特征的能力,将视频和文本信息的全局特征输入传统结构的Transformer模型(通过预设得到),该Transformer模型由多头注意力机制的自注意力网络和前反馈神经网络组成,采用了残差网络中的short-cut结构,解决深度学习中的退化问题,具体包括:In order to further enhance the ability of Transformer to capture different text features and video features, the global features of video and text information are input into the Transformer model of traditional structure (obtained through presets). The Transformer model consists of the self-attention network of the multi-head attention mechanism and the front-end It is composed of a feedback neural network and adopts the short-cut structure in the residual network to solve the degradation problem in deep learning, including:
将视频全局特征g v输入到预设Transformer模型,得到视频全局特征向量;将文本全局特征g p输入到预设Transformer模型,得到文本全局特征向量;视频全局特征向量和文本全局特征向量作为Global特征向量F crossInput the video global feature g v into the preset Transformer model to obtain the video global feature vector; input the text global feature g p into the preset Transformer model to obtain the text global feature vector; the video global feature vector and the text global feature vector are used as Global features Vector F cross .
在对局部信息输入预设Transformer模型前,在模型输入处添加了一个额外的向量Positional Encoding,这个向量能决定当前信息的位置或一个句子中不同的词之间的距离,进而让构建的模型更好地解释输入序列中的顺序。Before inputting the local information into the preset Transformer model, an additional vector Positional Encoding is added to the model input. This vector can determine the position of the current information or the distance between different words in a sentence, thus making the built model more accurate. Well interpret the order in the input sequence.
之后,将视频局部特征
Figure PCTCN2022090656-appb-000014
分别输入到预设Transformer模型,并将输出结果执行最大池化(Max pooling,取各个维度上的值的最大值),得到视频局部特征向量;将文本局部特征
Figure PCTCN2022090656-appb-000015
分别输入到预设Transformer模型,并将输出结果执行最大池化,得到文本局部特征向量;视频局部特征向量和文本局部特征向量作为Local特征向量F local
After that, the video local features are
Figure PCTCN2022090656-appb-000014
Input to the preset Transformer model respectively, and perform Max pooling (Max pooling, taking the maximum value of the values in each dimension) on the output results to obtain the video local feature vector; convert the text local features
Figure PCTCN2022090656-appb-000015
Input to the preset Transformer model respectively, and perform maximum pooling on the output results to obtain the text local feature vector; the video local feature vector and the text local feature vector are used as the Local feature vector F local .
将对F local和F cross进行特征拼接,得到视频特征和文本特征,分别以视频特征υ和文本特征δ表示。 The features of F local and F cross will be spliced to obtain video features and text features, which are represented by video features υ and text features δ respectively.
参照图9所示,得到上述视频特征υ和文本特征δ之后,可以根据输出结果对改进的T-Transformer模型进行优化,具体来说:Referring to Figure 9, after obtaining the above video features υ and text features δ, the improved T-Transformer model can be optimized based on the output results. Specifically:
步骤S161,根据Contextual Transformer模型输出的视频特征和文本特征构建损失函数;Step S161, construct a loss function based on the video features and text features output by the Contextual Transformer model;
步骤S162,利用损失函数对改进的T-Transformer模型进行优化,损失函数表示如下:Step S162, use the loss function to optimize the improved T-Transformer model. The loss function is expressed as follows:
Figure PCTCN2022090656-appb-000016
Figure PCTCN2022090656-appb-000016
L(P,N,α)=mαx(0,α+D(x,y)-D(x′,y))+max(0,α+D(x,y)-D(x,y′))L(P,N,α)=mαx(0,α+D(x,y)-D(x′,y))+max(0,α+D(x,y)-D(x,y′ ))
D(x,y)=1-x Ty/(‖x‖‖y‖) D(x,y)=1-x T y/(‖x‖‖y‖)
其中,x表示Contextual Transformer模型输出的视频特征,y表示Contextual Transformer模型输出的文本特征,x′和y′表示x和y各自的负样本,负样本对用(x′,y)或(x,y′)表示,说明当前样本里的视频和文本不是出自对应的数据,表示视频出自其他样本视频,表示文本出自其他样本的文本,在L(P,N,α)这里用N表示;正样本对用(x,y)表示,说明当前样本里的视频和文本是出自对应的数据,L(P,N,α)这里用P表示,α为常数参数,为折算因子,如果设大了会导致模型学习参数不收敛,设小了会导致模型学习效率慢,本实施例中α用0.2。Among them, x represents the video feature output by the Contextual Transformer model, y represents the text feature output by the Contextual Transformer model, x′ and y′ represent the negative samples of x and y respectively, and the negative sample pair is (x′, y) or (x, y′) means that the video and text in the current sample do not come from the corresponding data, it means that the video comes from other sample videos, and it means that the text comes from the text of other samples. It is represented by N here in L (P, N, α); positive sample The pair is represented by (x, y), indicating that the video and text in the current sample are from the corresponding data. L (P, N, α) is represented by P here. α is a constant parameter and a conversion factor. If it is set too large, it will As a result, the model learning parameters do not converge. If the setting is small, the model learning efficiency will be slow. In this embodiment, α is 0.2.
通过人工标注好的数据在训练好模型后,可以直接使用模型输出的视频特征υ和文本特征δ进行相似度计算,如采用cosine相似度计算,这样就能比较视频和文本之间的相似程度,进行类似检索等操作。例如视频在没有任何文本或者标签,只有视频本身的情况下,通过输入文本便可以检索出相应的视频了。即:After training the model with manually annotated data, the video features υ and text features δ output by the model can be directly used for similarity calculation. For example, cosine similarity calculation can be used to compare the similarity between the video and the text. Perform operations such as searching. For example, if a video does not have any text or tags, but only the video itself, the corresponding video can be retrieved by entering the text. Right now:
获取待检索文本;Get the text to be retrieved;
将待检索文本输入到优化完成的改进的T-Transformer模型,得到与待检索文本相匹配的目标视频。Input the text to be retrieved into the optimized and improved T-Transformer model to obtain the target video that matches the text to be retrieved.
参照图10所示,在获取视频和对应的文本信息之前,还可以包括以下步骤:Referring to Figure 10, before obtaining the video and corresponding text information, the following steps may also be included:
步骤S171,对视频进行标注,将标注后的视频作为训练用的视频;Step S171, label the video and use the labeled video as a training video;
步骤S172,按照对视频的标注方式标注文本信息,将标注后的文本信息作为训练用的文本信息。Step S172: Label the text information according to the video labeling method, and use the labeled text information as text information for training.
通过上述各个步骤,可以将视频和文本转换到同一对比空间,将两个不同事物进行相似度计算。即使视频没有任何标签或标题等内容,依旧可以通过输入文本来进行视频检索,从而匹配得到对应的目标视频。Through the above steps, the video and text can be converted into the same comparison space, and the similarity of two different things can be calculated. Even if the video does not have any tags or titles, you can still search the video by entering text to match the corresponding target video.
另外,参照图11,本申请实施例提供了视频和文本相似度确定装置,该装置包括:In addition, referring to Figure 11, an embodiment of the present application provides a video and text similarity determination device, which includes:
获取单元,用于获取视频和对应的文本信息,并对所述视频和所述文本信息进行编码处理以得到编码特征信息,所述编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;An acquisition unit is used to acquire video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, and text local coding information. Coding Information and Text Global Coding Information;
第一处理单元,用于将所述编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,所述改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;The first processing unit is used to input the encoded feature information into an improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN and self-attention. The network SAN and the feedforward neural network FFN are stacked;
第二处理单元,用于将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;The second processing unit is used to input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
上下文处理单元,用于将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,所述视频特征与所述视频对应,所述文本特征与所述文本信息对应;A context processing unit, used to input the global features and the local features into the Contextual Transformer model as a common input, and obtain video features and text features through feature splicing processing. The video features correspond to the video, and the text The features correspond to the text information;
相似度计算单元,用于根据所述视频特征和所述文本特征确定所述视频和所述文本信息之间的相似度。A similarity calculation unit, configured to determine the similarity between the video and the text information according to the video features and the text features.
另外,参照图12,本申请的一个实施例还提供了一种电子设备,该电子设备2000包括:存储器2002、处理器2001及存储在存储器2002上并可在处理器2001上运行的计算机程序。In addition, referring to Figure 12, an embodiment of the present application also provides an electronic device. The electronic device 2000 includes: a memory 2002, a processor 2001, and a computer program stored on the memory 2002 and executable on the processor 2001.
处理器2001和存储器2002可以通过总线或者其他方式连接。The processor 2001 and the memory 2002 may be connected through a bus or other means.
实现上述实施例的视频和文本相似度确定方法所需的非暂态软件程序以及指令存储在存储器2002中,当被处理器2001执行时,执行上述实施例中的应用于设备的视频和文本相似度确定方法,例如,执行以上描述的图1中的方法步骤S110至步骤S140、图2中的方法步骤S111至步骤S112、图3中的方法步骤S1111至步骤S1112、图4中的方法步骤S1121至步骤S1122、图5中的方法步骤S1131至步骤S1132、图7中的方法步骤S131至步骤S132、图8中的方法步骤S141至步骤S143、图9中的方法步骤S161至步骤S162以及图10中的方法步骤S171至步骤S172。The non-transitory software programs and instructions required to implement the video and text similarity determination method of the above embodiment are stored in the memory 2002. When executed by the processor 2001, the video and text similarity applied to the device in the above embodiment are executed. The degree determination method, for example, performs the above-described method steps S110 to S140 in Figure 1, method steps S111 to S112 in Figure 2, method steps S1111 to S1112 in Figure 3, and method step S1121 in Figure 4. to step S1122, method steps S1131 to step S1132 in Figure 5, method steps S131 to step S132 in Figure 7, method steps S141 to step S143 in Figure 8, method steps S161 to step S162 in Figure 9, and Figure 10 Method steps S171 to S172 in .
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
此外,本申请的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被一个处理器或控制器执行,例如,被上述电子设备实施例中的一个处理器执行,可使得上述处理器执行上述实施例中的视频和文本相似度确定方法,例如,执行以上描述的图1中的方法步骤S110至步骤S140、图2中的方法步骤S111至步骤S112、图3中的方法步骤S1111至步骤S1112、图4中的方法步骤S1121至步骤S1122、图5中的方法步骤S1131至步骤S1132、图7中的方法步骤S131至步骤S132、图8中的方法步骤S141至步骤S143、图9中的方法步骤S161至步骤S162以及图10中的方法步骤S171至步骤S172。所述计算机可读存储介质可以是非易失性,也可以是易失性。In addition, an embodiment of the present application also provides a computer-readable storage medium, which stores a computer program. The computer program is executed by a processor or a controller, for example, by the above-mentioned electronic device embodiment. Execution by one of the processors can cause the above processor to execute the video and text similarity determination method in the above embodiment, for example, execute the above-described method steps S110 to S140 in Figure 1 and method step S111 in Figure 2 to step S112, method steps S1111 to step S1112 in Figure 3, method steps S1121 to step S1122 in Figure 4, method steps S1131 to step S1132 in Figure 5, method steps S131 to step S132 in Figure 7, Figure 8 The method steps S141 to S143 in , the method steps S161 to S162 in FIG. 9 , and the method steps S171 to S172 in FIG. 10 . The computer-readable storage medium may be non-volatile or volatile.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、装置可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读存储介质上,计算机可读存储介质可以包括计算机存储介质(或非暂时性存储介质)和通信存储介质(或暂时性存储介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除存储介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存 或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的存储介质。此外,本领域普通技术人员公知的是,通信存储介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送存储介质。Those of ordinary skill in the art can understand that all or some steps and devices in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable storage media, which may include computer storage media (or non-transitory storage media) and communication storage media (or transitory storage media). As is known to those of ordinary skill in the art, the term computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. removable, removable and non-removable storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other storage medium used to store desired information and that can be accessed by a computer. Furthermore, as is known to those of ordinary skill in the art, communications storage media typically embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery storage media.
本申请可用于众多通用或专用的计算机装置环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器装置、基于微处理器的装置、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何装置或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机程序的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The present application may be used in a variety of general purpose or special purpose computer device environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor devices, microprocessor-based devices, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, including Distributed computing environment for any of the above devices or equipment, etc. The application may be described in the general context of computer programs, such as program modules, executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present application may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
附图中的流程图和框图,图示了按照本申请各种实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。其中,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的程序。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的装置来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functions and operations of possible implementations of devices, methods and computer program products according to various embodiments of the present application. Each block in the flow chart or block diagram may represent a module, program segment, or part of the code. The above module, program segment, or part of the code includes one or more programs for implementing specified logical functions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block in the block diagram or flowchart illustration, and combinations of blocks in the block diagram or flowchart illustration, can be implemented by special purpose hardware-based means for performing the specified functions or operations, or may be implemented by special purpose hardware-based means for performing the specified functions or operations. Achieved by a combination of specialized hardware and computer instructions.
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元也可以设置在处理器中。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments of this application can be implemented in software or hardware, and the described units can also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of equipment for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present application, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into being embodied by multiple modules or units.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本申请实施方式的方法。Through the above description of the embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by software combined with necessary hardware. Therefore, the technical solution according to the embodiment of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which can be a personal computer, server, touch terminal, or network device, etc.) to execute the method according to the embodiment of the present application.
本领域技术人员在考虑说明书及实践这里公开的实施方式后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. .
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
以上是对本申请的较佳实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a detailed description of the preferred implementation of the present application, but the present application is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent modifications or substitutions without violating the spirit of the present application. Equivalent modifications or substitutions are included within the scope defined by the claims of this application.

Claims (20)

  1. 一种视频和文本相似度确定方法,其中,包括:A method for determining video and text similarity, which includes:
    获取视频和对应的文本信息,并对所述视频和所述文本信息进行编码处理以得到编码特征信息,所述编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;
    将所述编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,所述改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;
    将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
    将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,所述视频特征与所述视频对应,所述文本特征与所述文本信息对应;The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;
    根据所述视频特征和所述文本特征确定所述视频和所述文本信息之间的相似度。The similarity between the video and the text information is determined based on the video features and the text features.
  2. 根据权利要求1所述的视频和文本相似度确定方法,其中,所述对所述视频和所述文本信息进行编码处理以得到编码特征信息,包括:The video and text similarity determination method according to claim 1, wherein said encoding the video and the text information to obtain encoding feature information includes:
    对所述视频和所述文本信息进行分段,得到N个视频片段和N个文本分段,每个所述视频片段与一个所述文本分段相对应,N为正整数;Segment the video and the text information to obtain N video segments and N text segments, each of the video segments corresponds to one of the text segments, and N is a positive integer;
    分别对所述视频片段和所述文本分段进行编码处理,得到视频局部编码信息和文本局部编码信息;Perform coding processing on the video segments and the text segments respectively to obtain video local coding information and text local coding information;
    分别对所述视频和所述文本信息进行编码处理,得到视频全局编码信息和文本全局信息。The video and the text information are respectively encoded to obtain video global encoding information and text global information.
  3. 根据权利要求2所述的视频和文本相似度确定方法,其中,所述对所述视频和所述文本信息进行分段,包括:The video and text similarity determination method according to claim 2, wherein the segmenting the video and the text information includes:
    按照预设分段方式将所述视频剪辑成N个视频片段;Edit the video into N video segments according to a preset segmentation method;
    提取每个所述视频片段中的若干个文本句子,作为与所述视频片段对应的文本分段。Several text sentences in each video segment are extracted as text segments corresponding to the video segment.
  4. 根据权利要求2所述的视频和文本相似度确定方法,其中,所述分别对所述视频片段和所述文本分段进行编码处理,得到视频局部编码信息和文本局部编码信息,包括:The video and text similarity determination method according to claim 2, wherein said encoding the video segments and the text segments respectively to obtain video local coding information and text local coding information includes:
    从所述视频片段中提取图像帧,通过视频编码器对所述图像帧进行编码,得到所述视频片段对应的视频局部编码信息;Extract image frames from the video clips, encode the image frames through a video encoder, and obtain video local coding information corresponding to the video clips;
    将与所述视频片段对应的所述文本片段输入到文本编码器中进行编码,得到所述文本片段对应的文本局部编码信息。The text segment corresponding to the video segment is input into a text encoder for encoding, and text local encoding information corresponding to the text segment is obtained.
  5. 根据权利要求2所述的视频和文本相似度确定方法,其中,所述分别对所述视频和所述文本信息进行编码处理,得到视频全局编码信息和文本全局信息,包括:The video and text similarity determination method according to claim 2, wherein said encoding the video and the text information respectively to obtain video global encoding information and text global information includes:
    将所述视频输入到视频编码器中进行编码处理,得到视频全局编码信息;Input the video into a video encoder for encoding processing to obtain video global encoding information;
    将所述文本信息输入到文本编码器中进行编码处理,得到文本全局编码信息。The text information is input into a text encoder for encoding processing to obtain text global encoding information.
  6. 根据权利要求1所述的视频和文本相似度确定方法,其中,所述Attention-FA模块包括全局处理模块和局部处理模块;所述将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征,包括:The video and text similarity determination method according to claim 1, wherein the Attention-FA module includes a global processing module and a local processing module; the global information and the local information are respectively input to the corresponding Attention -FA module, obtains global features and local features, including:
    将所述全局信息输入到所述全局处理模块,得到全局特征,所述全局特征包括视频全局特征和文本全局特征;Input the global information into the global processing module to obtain global features, where the global features include video global features and text global features;
    将所述局部信息输入到所述局部处理模块,得到局部特征,所述局部特征包括视频局部特征和文本局部特征。The local information is input to the local processing module to obtain local features, which include video local features and text local features.
  7. 根据权利要求1所述的视频和文本相似度确定方法,其中,所述将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,包括:The video and text similarity determination method according to claim 1, wherein the global features and the local features are input into the Contextual Transformer model as a common input, and the video features and text features are obtained through feature splicing processing, include:
    将所述局部特征作为Local Context输入到预设Transformer模型,并对输出结果进行最大池化操作,得到Local特征向量F localInput the local features into the preset Transformer model as Local Context, and perform a maximum pooling operation on the output results to obtain the Local feature vector F local ;
    将所述全局特征作为Global Context输入到预设Transformer模型,得到Global特征向量F crossInput the global features as Global Context to the preset Transformer model to obtain the Global feature vector F cross ;
    对所述F local和F cross进行特征拼接,得到视频特征和文本特征。 Feature splicing is performed on the F local and F cross to obtain video features and text features.
  8. 一种视频和文本相似度确定装置,其中,包括:A video and text similarity determination device, which includes:
    获取单元,用于获取视频和对应的文本信息,并对所述视频和所述文本信息进行编码处理以得到编码特征信息,所述编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;An acquisition unit is used to acquire video and corresponding text information, and perform coding processing on the video and text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, and text local coding information. Coding Information and Text Global Coding Information;
    第一处理单元,用于将所述编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,所述改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;The first processing unit is used to input the encoded feature information into an improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN and self-attention. The network SAN and the feedforward neural network FFN are stacked;
    第二处理单元,用于将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;The second processing unit is used to input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
    上下文处理单元,用于将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,所述视频特征与所述视频对应,所述文本特征与所述文本信息对应;A context processing unit, used to input the global features and the local features into the Contextual Transformer model as a common input, and obtain video features and text features through feature splicing processing. The video features correspond to the video, and the text The features correspond to the text information;
    相似度计算单元,用于根据所述视频特征和所述文本特征确定所述视频和所述文本信息之间的相似度。A similarity calculation unit, configured to determine the similarity between the video and the text information according to the video features and the text features.
  9. 一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现一种视频和文本相似度确定方法,其中,所述视频和文本相似度确定方法包括:An electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements a video and text similarity determination method when executing the computer program, Wherein, the video and text similarity determination method includes:
    获取视频和对应的文本信息,并对所述视频和所述文本信息进行编码处理以得到编码特征信息,所述编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;
    将所述编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,所述改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;
    将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
    将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,所述视频特征与所述视频对应,所述文本特征与所述文本信息对应;The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;
    根据所述视频特征和所述文本特征确定所述视频和所述文本信息之间的相似度。The similarity between the video and the text information is determined based on the video features and the text features.
  10. 根据权利要求9所述的电子设备,其中,所述对所述视频和所述文本信息进行编码处理以得到编码特征信息,包括:The electronic device according to claim 9, wherein said encoding the video and the text information to obtain encoding feature information includes:
    对所述视频和所述文本信息进行分段,得到N个视频片段和N个文本分段,每个所述视频片段与一个所述文本分段相对应,N为正整数;Segment the video and the text information to obtain N video segments and N text segments, each of the video segments corresponds to one of the text segments, and N is a positive integer;
    分别对所述视频片段和所述文本分段进行编码处理,得到视频局部编码信息和文本局部编码信息;Perform coding processing on the video segments and the text segments respectively to obtain video local coding information and text local coding information;
    分别对所述视频和所述文本信息进行编码处理,得到视频全局编码信息和文本全局信息。The video and the text information are respectively encoded to obtain video global encoding information and text global information.
  11. 根据权利要求10所述的电子设备,其中,所述对所述视频和所述文本信息进行分段,包括:The electronic device according to claim 10, wherein the segmenting the video and the text information includes:
    按照预设分段方式将所述视频剪辑成N个视频片段;Edit the video into N video segments according to a preset segmentation method;
    提取每个所述视频片段中的若干个文本句子,作为与所述视频片段对应的文本分段。Several text sentences in each video segment are extracted as text segments corresponding to the video segment.
  12. 根据权利要求10所述的电子设备,其中,所述分别对所述视频片段和所述文本分段 进行编码处理,得到视频局部编码信息和文本局部编码信息,包括:The electronic device according to claim 10, wherein said encoding the video segments and the text segments respectively to obtain video partial encoding information and text partial encoding information includes:
    从所述视频片段中提取图像帧,通过视频编码器对所述图像帧进行编码,得到所述视频片段对应的视频局部编码信息;Extract image frames from the video clips, encode the image frames through a video encoder, and obtain video local coding information corresponding to the video clips;
    将与所述视频片段对应的所述文本片段输入到文本编码器中进行编码,得到所述文本片段对应的文本局部编码信息。The text segment corresponding to the video segment is input into a text encoder for encoding, and text local encoding information corresponding to the text segment is obtained.
  13. 根据权利要求10所述的电子设备,其中,所述分别对所述视频和所述文本信息进行编码处理,得到视频全局编码信息和文本全局信息,包括:The electronic device according to claim 10, wherein said encoding the video and the text information respectively to obtain video global encoding information and text global information includes:
    将所述视频输入到视频编码器中进行编码处理,得到视频全局编码信息;Input the video into a video encoder for encoding processing to obtain video global encoding information;
    将所述文本信息输入到文本编码器中进行编码处理,得到文本全局编码信息。The text information is input into a text encoder for encoding processing to obtain text global encoding information.
  14. 根据权利要求9所述的电子设备,其中,所述Attention-FA模块包括全局处理模块和局部处理模块;所述将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征,包括:The electronic device according to claim 9, wherein the Attention-FA module includes a global processing module and a local processing module; the global information and the local information are respectively input to the corresponding Attention-FA module to obtain Global features and local features, including:
    将所述全局信息输入到所述全局处理模块,得到全局特征,所述全局特征包括视频全局特征和文本全局特征;Input the global information into the global processing module to obtain global features, where the global features include video global features and text global features;
    将所述局部信息输入到所述局部处理模块,得到局部特征,所述局部特征包括视频局部特征和文本局部特征。The local information is input to the local processing module to obtain local features, which include video local features and text local features.
  15. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序用于执行一种视频和文本相似度确定方法,其中,所述视频和文本相似度确定方法包括:A computer-readable storage medium stores a computer program, wherein the computer program is used to execute a video and text similarity determination method, wherein the video and text similarity determination method includes:
    获取视频和对应的文本信息,并对所述视频和所述文本信息进行编码处理以得到编码特征信息,所述编码特征信息包括视频局部编码信息、视频全局编码信息、文本局部编码信息和文本全局编码信息;Obtain the video and corresponding text information, and perform coding processing on the video and the text information to obtain coding feature information. The coding feature information includes video local coding information, video global coding information, text local coding information and text global coding information. encoded information;
    将所述编码特征信息输入到经过改进的T-Transformer模型,得到全局信息和局部信息,所述改进的T-Transformer模型基于动态遮罩注意力网络DMAN、自注意力网络SAN和前馈神经网络FFN层叠而成;The encoded feature information is input into the improved T-Transformer model to obtain global information and local information. The improved T-Transformer model is based on the dynamic mask attention network DMAN, the self-attention network SAN and the feed-forward neural network. FFN is layered;
    将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征;Input the global information and the local information to the corresponding Attention-FA module respectively to obtain global features and local features;
    将所述全局特征和所述局部特征作为共同输入,输入到Contextual Transformer模型,通过特征拼接处理得到视频特征和文本特征,所述视频特征与所述视频对应,所述文本特征与所述文本信息对应;The global features and the local features are used as common inputs to the Contextual Transformer model, and video features and text features are obtained through feature splicing processing. The video features correspond to the video, and the text features correspond to the text information. correspond;
    根据所述视频特征和所述文本特征确定所述视频和所述文本信息之间的相似度。The similarity between the video and the text information is determined based on the video features and the text features.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述视频和所述文本信息进行编码处理以得到编码特征信息,包括:The computer-readable storage medium according to claim 15, wherein said encoding said video and said text information to obtain encoded feature information includes:
    对所述视频和所述文本信息进行分段,得到N个视频片段和N个文本分段,每个所述视频片段与一个所述文本分段相对应,N为正整数;Segment the video and the text information to obtain N video segments and N text segments, each of the video segments corresponds to one of the text segments, and N is a positive integer;
    分别对所述视频片段和所述文本分段进行编码处理,得到视频局部编码信息和文本局部编码信息;Perform coding processing on the video segments and the text segments respectively to obtain video local coding information and text local coding information;
    分别对所述视频和所述文本信息进行编码处理,得到视频全局编码信息和文本全局信息。The video and the text information are respectively encoded to obtain video global encoding information and text global information.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述对所述视频和所述文本信息进行分段,包括:The computer-readable storage medium of claim 16, wherein the segmenting the video and the text information includes:
    按照预设分段方式将所述视频剪辑成N个视频片段;Edit the video into N video segments according to a preset segmentation method;
    提取每个所述视频片段中的若干个文本句子,作为与所述视频片段对应的文本分段。Several text sentences in each video segment are extracted as text segments corresponding to the video segment.
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述分别对所述视频片段和所述文本分段进行编码处理,得到视频局部编码信息和文本局部编码信息,包括:The computer-readable storage medium according to claim 16, wherein said encoding the video segments and the text segments respectively to obtain video partial encoding information and text partial encoding information includes:
    从所述视频片段中提取图像帧,通过视频编码器对所述图像帧进行编码,得到所述视频片段对应的视频局部编码信息;Extract image frames from the video clips, encode the image frames through a video encoder, and obtain video local coding information corresponding to the video clips;
    将与所述视频片段对应的所述文本片段输入到文本编码器中进行编码,得到所述文本片 段对应的文本局部编码信息。The text segment corresponding to the video segment is input into a text encoder for encoding, and the text local encoding information corresponding to the text segment is obtained.
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述分别对所述视频和所述文本信息进行编码处理,得到视频全局编码信息和文本全局信息,包括:The computer-readable storage medium according to claim 16, wherein said encoding the video and the text information respectively to obtain the video global encoding information and the text global information includes:
    将所述视频输入到视频编码器中进行编码处理,得到视频全局编码信息;Input the video into a video encoder for encoding processing to obtain video global encoding information;
    将所述文本信息输入到文本编码器中进行编码处理,得到文本全局编码信息。The text information is input into a text encoder for encoding processing to obtain text global encoding information.
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述Attention-FA模块包括全局处理模块和局部处理模块;所述将所述全局信息和所述局部信息分别输入到对应的Attention-FA模块,得到全局特征和局部特征,包括:The computer-readable storage medium according to claim 15, wherein the Attention-FA module includes a global processing module and a local processing module; the global information and the local information are respectively input to the corresponding Attention-FA module to obtain global features and local features, including:
    将所述全局信息输入到所述全局处理模块,得到全局特征,所述全局特征包括视频全局特征和文本全局特征;Input the global information into the global processing module to obtain global features, where the global features include video global features and text global features;
    将所述局部信息输入到所述局部处理模块,得到局部特征,所述局部特征包括视频局部特征和文本局部特征。The local information is input to the local processing module to obtain local features, which include video local features and text local features.
PCT/CN2022/090656 2022-03-09 2022-04-29 Method and apparatus for determining similarity between video and text, electronic device, and storage medium WO2023168818A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210234257.8A CN114612826A (en) 2022-03-09 2022-03-09 Video and text similarity determination method and device, electronic equipment and storage medium
CN202210234257.8 2022-03-09

Publications (1)

Publication Number Publication Date
WO2023168818A1 true WO2023168818A1 (en) 2023-09-14

Family

ID=81862202

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090656 WO2023168818A1 (en) 2022-03-09 2022-04-29 Method and apparatus for determining similarity between video and text, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN114612826A (en)
WO (1) WO2023168818A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493608B (en) * 2023-12-26 2024-04-12 西安邮电大学 Text video retrieval method, system and computer storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200334468A1 (en) * 2018-01-08 2020-10-22 Samsung Electronics Co., Ltd. Display apparatus, server, system and information-providing methods thereof
CN113204674A (en) * 2021-07-05 2021-08-03 杭州一知智能科技有限公司 Video-paragraph retrieval method and system based on local-overall graph inference network
CN113784199A (en) * 2021-09-10 2021-12-10 中国科学院计算技术研究所 System and method for generating video description text
CN114048351A (en) * 2021-11-08 2022-02-15 湖南大学 Cross-modal text-video retrieval method based on space-time relationship enhancement

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108419094B (en) * 2018-03-05 2021-01-29 腾讯科技(深圳)有限公司 Video processing method, video retrieval method, device, medium and server
US11238093B2 (en) * 2019-10-15 2022-02-01 Adobe Inc. Video retrieval based on encoding temporal relationships among video frames
CN113094550B (en) * 2020-01-08 2023-10-24 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN112241468A (en) * 2020-07-23 2021-01-19 哈尔滨工业大学(深圳) Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
CN113779361A (en) * 2021-08-27 2021-12-10 华中科技大学 Construction method and application of cross-modal retrieval model based on multi-layer attention mechanism
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model
CN114037945A (en) * 2021-12-10 2022-02-11 浙江工商大学 Cross-modal retrieval method based on multi-granularity feature interaction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200334468A1 (en) * 2018-01-08 2020-10-22 Samsung Electronics Co., Ltd. Display apparatus, server, system and information-providing methods thereof
CN113204674A (en) * 2021-07-05 2021-08-03 杭州一知智能科技有限公司 Video-paragraph retrieval method and system based on local-overall graph inference network
CN113784199A (en) * 2021-09-10 2021-12-10 中国科学院计算技术研究所 System and method for generating video description text
CN114048351A (en) * 2021-11-08 2022-02-15 湖南大学 Cross-modal text-video retrieval method based on space-time relationship enhancement

Also Published As

Publication number Publication date
CN114612826A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
CN111554268B (en) Language identification method based on language model, text classification method and device
Surís et al. Cross-modal embeddings for video and audio retrieval
Garcia et al. A dataset and baselines for visual question answering on art
CN111079532A (en) Video content description method based on text self-encoder
CN110569359B (en) Training and application method and device of recognition model, computing equipment and storage medium
CN114565104A (en) Language model pre-training method, result recommendation method and related device
Cascianelli et al. Full-GRU natural language video description for service robotics applications
CN113836992B (en) Label identification method, label identification model training method, device and equipment
CN110795944A (en) Recommended content processing method and device, and emotion attribute determining method and device
CN113204633B (en) Semantic matching distillation method and device
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN114048351A (en) Cross-modal text-video retrieval method based on space-time relationship enhancement
CN113705315A (en) Video processing method, device, equipment and storage medium
CN114155477B (en) Semi-supervised video paragraph positioning method based on average teacher model
Cornia et al. A unified cycle-consistent neural model for text and image retrieval
WO2023168818A1 (en) Method and apparatus for determining similarity between video and text, electronic device, and storage medium
CN114330514A (en) Data reconstruction method and system based on depth features and gradient information
Jia et al. Semantic association enhancement transformer with relative position for image captioning
Cornia et al. Towards cycle-consistent models for text and image retrieval
CN112132075A (en) Method and medium for processing image-text content
CN116663523A (en) Semantic text similarity calculation method for multi-angle enhanced network
Chen et al. Video captioning via sentence augmentation and spatio-temporal attention
Hammad et al. Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models
Gutiérrez et al. Bimodal neural style transfer for image generation based on text prompts
Li et al. Text-guided dual-branch attention network for visual question answering

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22930447

Country of ref document: EP

Kind code of ref document: A1