CN114579803B

CN114579803B - Video retrieval method, device and storage medium based on dynamic convolution and shortcuts

Info

Publication number: CN114579803B
Application number: CN202210223064.2A
Authority: CN
Inventors: 刘志; 张萌萌
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2024-04-12
Anticipated expiration: 2042-03-09
Also published as: CN114579803A

Abstract

A video retrieval framework is presented herein that includes a video encoder that processes a plurality of modalities extracted in an input video with an MMT comprising: a plurality of inputs corresponding to a plurality of modalities, a plurality of outputs for receiving a plurality of video embeddings corresponding to the plurality of modalities, for outputting a video feature representation of an input video, and a fully connected network between the plurality of inputs and the plurality of outputs, at a node of a transcoder, wherein the transcoder comprises a multi-headed attention module that receives as inputs a query (Q), a key (K), and a value (V), and further comprises: a sub-attention sub-module that receives the query (Q), the key (K), and the value (V) and maps the query and a set of key-value pairs to an output; a span-based dynamic convolution sub-module that receives a query (Q), a key (K), and a value (V), applies a convolution to the key (K) to obtain a convolution key (Ks), and maps the query and a set of convolution key-value pairs to an output; and a coupler for connecting the output of the sub-attention sub-module and the output of the span-based dynamic convolution sub-module.

Description

Video retrieval method, device and storage medium based on dynamic convolution and shortcuts

Technical Field

The present invention relates to video processing technology and application of neural networks in the field of video processing, and more particularly, to a method, apparatus and medium for neural network-based video retrieval. The invention is particularly suitable for the retrieval of online videos.

Background

Video is one of the most commonly used media formats due to the ability of video to capture dynamic events and provide direct visual and audio perception. Currently, online video occupies an increasing proportion in video applications. There are hundreds of hundred million hours of video (or short video) in each online video platform that cannot be effectively utilized if we cannot access them efficiently. Therefore, how to retrieve relevant videos by retrieval becomes critical.

For millions of videos, it is obviously not possible to attach reasonable titles and content descriptions to the video entirely manually. Even if each video is authored with a title and content description added at the time of production, such title and content description may not fully profile the video content for subsequent video retrieval. Accordingly, much research is currently focused on how to use neural networks for efficient video retrieval.

For video retrieval, there are currently two tasks: "title to video" and "video to title". "title to video" refers to a search given in the form of a title (e.g., "how to cover a house"), the search target then being a video that the title can best describe (e.g., a video that explains how to cover a house). The term "title" herein shall mean various texts associated with video content such as video titles, video presentation letters, and the like. The term "video" as used herein is intended to include within its narrow sense a collection of pictures (i.e., visual video) taken over time, and broadly includes visual video, audio, speech, subtitles (embedded or separate subtitle files), various audio tracks (embedded or separate audio track files), related covers (e.g., movie covers used in DVD discs), time tags, location tags, video clips (e.g., video clips used in DVD and blu-ray discs), various information related to video clips (e.g., covers for video clips, time tags, subtitles, content descriptions, etc.), and the like, which are capable of forming components of existing various video content. Examples of online videos may be various short videos on YouTube, tremble, tiktok, curry (bilibili).

For the "title to video" task, for each particular search, this is accomplished by giving a set of "title-video" pairs and ordering all video candidates so that the video most relevant to that title is highest in order. On the other hand, the purpose of the "video-to-title" task is to find a title (search target) that best describes the searched video among a set of title candidates.

A common approach to both video searches above is similarity learning, i.e., how we learn a function that best describes the similarity between two elements (i.e., query and candidate). We can then rank the candidates (video or title) according to their similarity (similarity estimate) to the query.

Thus, the mainstream framework of current video retrieval includes three parts: video encoder, text encoder, similarity estimation. A video encoder obtains a video feature representation of an input video; a text encoder obtains a text feature representation of the input text (i.e., the title, video presentation text, etc., text associated with the video content); similarity calculation finds matching video and text by calculating the similarity between the video feature representation and the text feature representation. This splits the similarity learning into the learning of the video encoder and the text encoder and the similarity estimation function.

For example, in similarity learning (i.e., training phase), assume that X represents a set of videos for training and Y represents the relevant titles (also referred to herein as "text") of all videos. Given B data { (v) ₁ ,c ₁ ),…,(v _i ,c _i ),…,(v _B ,c _B ) A learning database of }, where v _i ∈X，c _i E Y, similarity learning is to find the video feature representation F _v And text feature representation F _c Matching videos and texts are found by comparing similarity scores. The formula is as follows:

s＝d(F _v (v _i ),F _c (c _j )) (1)

where d represents a learned similarity function (or distance function); s is a similarity estimate for matching ordering. In one particular embodiment, d may employ a cosine similarity function, i.e., given two attribute vectors, A and B, the remaining chordal similarity θ is given by the dot product and the vector length, as follows:

a herein _i And B _i Representing the components of vectors A and B, respectively。

Since text analysis is relatively simple and research for text retrieval has been conducted for decades, text editors for generating text feature representations have been relatively sophisticated and efficient. For example, radford et al in "Learning transferable visual models from natural language supervision" arXiv preprint arXiv:2103.00020 (hereinafter referred to as "document 1") used a CLIP text encoder and a Transformer structure-based image encoder, the authors gathered 4 hundred million pairs of text image data pairs, the images and text pairs were semantically matched, similarity was calculated after features were obtained by the text encoder and the image encoder, and then training was performed using a contrast loss function, eventually achieving high similarity between matched text and picture, and low non-matched similarity. The above trained CLIP text encoder is available on the web.

Current research is focused mainly on the design of video encoders. Due to the diversity of video content (as described above), how to enable a video encoder to make full use of multiple modalities (visual video, audio, speech, subtitles, time tags, position tags, text, etc.) in video content and correlate the various modalities to output a video feature representation with information that can adequately represent the video content is one of the cores of current research. One effective prior art for the application of multiple modalities in video content is the Multi-modal transformation (MMT) proposed by google's Gabeur, v. et al based on the transducer (transducer) technique ("Multi-modal Transformer for video retrieval," In Computer Vision-ECCV 2020:16th European Conference,Glasgow,UK,August 23-28,2020,Proceedings,Part IV 16 (pp. 214-229) Springer International Publishing) (hereinafter "document 2"), which uses N experts (expertise) F trained in advance ⁿ N= … N to extract a sequence containing K features from the video and obtain its cumulative embedding for each expert's feature sequence, so that a valid representation can be learned from different modalities in the video; to process cross-modal information, N embedded E's are learned ⁿ N= … N to distinguish between different expert embeddings; finally providing a time embedding T; feature F (v), expert embeddingE (v) and time-embedded T (v), a video-embedded Ω (v) =f (v) +e (v) +t (v) is obtained. The video embedding is input into the MMT converter, resulting in a video representation. The converter framework used by MMT is google's Vaswani, a et al (Vaswani, a., shazer, n., paramr, n., uszkorait, j., jones, l., gomez, a.n.,&polosukhin, I., "Attention is all you need" In Advances in neural information processing systems (pp.5998-6008)) (hereinafter "document 3") proposes a attention-based (attention) -converter architecture.

There remains a great need for improvements in the video retrieval framework of the prior art, particularly in the design of video encoders. For example, although the converter (transducer) framework of document 3 is widely used, the framework is inefficient in learning local dependence due to lack of inductive bias; on the other hand, as the depth of the model increases, the learned features are likely to generate learning bias, mainly because the deeper the model layer number is, the more easily the gradient diverges, and the larger the error is, which results in difficulty in training, thereby affecting the performance of the model.

Disclosure of Invention

In order to solve the above technical problems, a novel video encoder is proposed for video retrieval.

In one aspect, the above-described drawbacks in the converter (transducer) framework are considered herein to employ a combination of sub-Attention sub-modules and span-based dynamic convolution sub-modules in a Multi-Head Attention (Multi-Head Attention) module of a converter. The sub-attention sub-module is used for focusing on all the input videos to be embedded for each layer in the fully connected network, and extracting the semantics of events occurring in a plurality of modes in the input videos. The span-based dynamic convolution sub-module uses depth-wise separable convolution to collect information about the span of tokens (which represent individual elements on the matrix), and then dynamically generates a convolution kernel to enable collection of the local relationships of the input tokens based on their local contexts. The span-based dynamic convolution sub-module can solve the problem of lack of inductive bias in self-attention, and such span-based convolution can collect context information more efficiently. The final output is sent to the feed forward layer for further processing.

In another aspect, in the Multi-Head Attention (Multi-Head Attention) module of the converter, in addition to using shortcuts of the present layer, shortcuts from the previous layer are additionally used as enhanced shortcuts in consideration of the above-described drawbacks in the converter (converter) framework. Learning bias is mitigated by parallelizing multi-headed attention modules that combine convolution and self-attention with more identity projections.

According to one aspect of the present invention, a video retrieval device is presented, which may include:

a video encoder that obtains a video feature representation of an input video, the video encoder processing a plurality of modalities extracted from the input video with a multi-modality converter (MMT), the MMT comprising:

a plurality of inputs corresponding to the plurality of modalities for receiving a plurality of video embeddings (E1-EN) corresponding to the plurality of modalities,

a plurality of outputs for outputting a video characteristic representation (T1-TN) of said input video, an

A fully connected network between the plurality of inputs and the plurality of outputs, at a node of a transcoder encoder (Trm);

a text encoder that obtains a text feature representation of the input text;

A similarity calculation unit that calculates a similarity between the video feature representation and the text feature representation for determining a match of video and text,

wherein the converter encoder (Trm) comprises a multi-headed attention module that receives as inputs a query (Q), a key (K), and a value (V), and the multi-headed attention module further comprises:

a sub-attention sub-module that receives the query (Q), the key (K), and the value (V), and maps the query and a set of key-value pairs to an output;

a span-based dynamic convolution sub-module that receives the query (Q), the key (K), and the value (V), applies a convolution to the key (K) to obtain a convolved key (Ks), and maps the query and a set of convolved key-value pairs to an output; and

a coupler for connecting the output of the sub-attention sub-module and the output of the span-based dynamic convolution sub-module.

According to a further aspect, the sub-attention sub-module comprises:

a first, a second and a third linear layer for performing a linear combination of the query (Q), the key (K) and the value (V), respectively;

a first multiplier for tensor matrix multiplying outputs of the first and second linear layers;

A scaler for scaling the output of the first multiplier;

the softMax unit is used for carrying out softMax function processing on the output of the scaler;

a second multiplier for tensor matrix multiplying the output of the SoftMax unit and the third linear layer; and

and a fourth linear layer receiving the output of the second multiplier, performing linear combination, and outputting as an output of the sub-attention sub-module.

According to a further aspect, the span-based dynamic convolution sub-module further comprises

A fifth and a sixth linear layer for performing a linear combination of the query (Q) and the value (V), respectively;

a convolution layer for performing vector convolution on the key (K) and outputting the convolved key (K);

a third multiplier for performing a tensor matrix point-wise multiplication of the outputs of the fifth linear layer and the convolutional layer;

a seventh linear layer for receiving the output of the third multiplier and performing a linear process;

the SoftMax unit is used for carrying out SoftMax function processing on the output of the seventh linear layer;

a light weight convolution unit, configured to perform light weight vector convolution on the output of the SoftMax unit and the output of the sixth linear layer; and

And an eighth linear layer for receiving the outputs of the light weight convolution units, performing linear combination, and outputting.

According to a further aspect, the converter encoder (Trm) comprises:

the multi-headed attention module receiving an input of the transducer encoder (Trm);

a first adder that adds the output of the multi-headed attention module to the input of the converter encoder (Trm);

a first layer normalization module, configured to perform layer normalization on an output of the first adder;

a forward feedback module for receiving the output of the first layer normalization module and performing forward feedback;

a second adder that adds the output of the feed-forward module to the input of the first layer normalization module; and

and the second layer normalization module is used for carrying out layer normalization on the output of the second adder.

According to a further aspect, the converter encoder (Trm) comprises:

a first adder that adds the output of the multi-headed attention module, the input of the converter encoder (Trm), and N enhanced shortcuts;

a second adder that adds the output of the feed-forward module, the input of the first layer normalization module, and N enhanced shortcuts; and

According to a further aspect, the enhanced shortcut is obtained by parameterized projection of the shortcut connection from the previous layer.

According to a further aspect, the parameterized projection is a sequence of linear projections and activation functions.

According to a further aspect, the video encoder further comprises: and the convolutional neural network is used for receiving the input video and extracting the plurality of modes from the input video.

According to a further aspect, the text encoder employs a CLIP text encoder.

According to a further aspect, cosine similarity is used to calculate the similarity.

According to another aspect of the present invention, there is provided a method of retrieving video using the video retrieving apparatus, comprising:

Obtaining a video feature representation of an input video using a video encoder that processes a plurality of modalities extracted from the input video with a multi-modality converter (MMT), the MMT comprising:

obtaining a text feature representation of the input text using a text encoder;

calculating a similarity between the video feature representation and the text feature representation,

According to another aspect of the invention, the video retrieval device is a software module implemented by code which, when executed, implements the video retrieval device or performs the method for retrieving video.

According to another aspect of the invention, the video retrieval device is a hardware module implemented by a processor dedicated to the neural network.

According to another aspect of the invention, the video retrieval device is a software and hardware architecture implemented by executable code in combination with a hardware processor dedicated to a neural network, wherein the hardware processor dedicated to a neural network may implement a portion of the architecture of the video retrieval device in the form of hardware functional modules, such as a fully connected network, a Convolutional Neural Network (CNN), and so on.

Drawings

Fig. 1 shows a schematic diagram of a video retrieval framework for video retrieval according to one embodiment of the invention.

Fig. 2 shows a schematic diagram of a video encoder in a video retrieval framework for video retrieval according to one embodiment of the invention.

Fig. 3 shows a schematic diagram of an MMT in a video encoder for video retrieval according to an embodiment of the invention.

Fig. 4 shows a schematic structural diagram of one embodiment of a transducer encoder (Trm) according to one embodiment of the present invention.

Fig. 5 shows a schematic structural diagram of one embodiment of a sub-attention (SA) sub-module in a transducer encoder (Trm) according to one embodiment of the invention.

Fig. 6 shows a schematic structural diagram of one embodiment of a span-based dynamic convolution (SDC) sub-module in a converter encoder (Trm) according to one embodiment of the present invention.

Fig. 7 shows a schematic structural diagram of another embodiment of a transducer encoder (Trm) according to an embodiment of the present invention.

Fig. 8 shows a schematic diagram of a video retrieval device for implementing an embodiment of the invention.

Detailed Description

Various aspects are now described with reference to the drawings. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more aspects. It may be evident, however, that such aspect(s) may be practiced without these specific details.

As used in this application, references herein to "devices," "frameworks," "encoders," "modules," "units," "sub-modules," and the like are intended to refer to computer-related entities, such as, but not limited to, hardware, software executed by a general-purpose or special-purpose processor, firmware, software, or any combination thereof. For example, these "devices", "frames", "encoders", "modules", "units", "sub-modules" may be, but are not limited to: a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of example, both an application running on a computing device and the computing device can be such "devices", "frameworks", "encoders", "modules", "units". One or more of these "devices," "frameworks," "encoders," "modules," "units," and "sub-modules" may be located in an executing process and/or thread of execution, and these "devices," "frameworks," "encoders," "modules," "units" may be located on one computer and/or distributed across two or more computers. In addition, these "devices," "frameworks," "encoders," "modules," "units," "sub-modules" may be executed from various computer-readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets, e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal.

When implemented in hardware, these "devices," "frameworks," "encoders," "modules," "units," "sub-modules" may be implemented or performed with general purpose processors, neural network processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or any combinations thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Additionally, at least one processor may include one or more modules operable to perform one or more of the steps and/or operations described above. The neural network processor may be used to implement various basic modules of the neural network, such as a fully connected network, a Convolutional Neural Network (CNN), and so on.

When implemented in hardware, these "devices", "frames", "encoders", "modules", "units", "sub-modules" are implemented on a system on a chip (SOC).

When implemented in hardware circuitry such as an ASIC, FPGA, or the like, these "devices", "frames", "encoders", "modules", "units", "sub-modules" may include various circuit blocks configured to perform various functions. Those skilled in the art can design and implement the circuits in various ways to implement the various functions disclosed herein based on various constraints imposed on the overall system.

When implemented in software, these "devices," frames, "" encoders, "" modules, "" units, "" sub-modules "may be processor or computer executable code that may be stored on a computer or processor readable storage medium or in the cloud (server farm) of a network, and which when executed may implement these" devices, "" frames, "" encoders, "" modules, "" units, "" sub-modules, "or methods of using these" devices, "" frames, "" encoders, "" modules, "" units, "" sub-modules.

The invention relates to video retrieval based on neural networks. The video retrieval framework herein can be applied to two tasks: "title to video" and "video to title". In both tasks, the general difference is how the (video, title) pair is composed. In the "title-to-video" task, a specific search is an input title, and the search target is related video, so that the input of the video search framework is a plurality of (video candidate, title) pairs; in the "video to title" task, the specific search is the input video, and the search target is the related title or related text description, so the input of the video search framework is a plurality of (video, title candidate) pairs.

Reference herein to "video retrieval framework," "video retrieval device," "video retrieval module," "video retrieval unit," all refer to any video retrieval function that may be implemented as software, hardware, or a combination of software and hardware, and these terms may be used interchangeably herein. In addition, "similarity estimation," "similarity calculation," "similarity determination" are used interchangeably to refer to computing two inputs (e.g., a video feature representation and a "text feature representation") implemented by hardware, software, or a combination of hardware and software. In addition, the terms "training" and "learning" are generally used interchangeably herein.

As described above, the present invention has been made mainly for the technical problems existing in document 2. The MMT framework is an encoder for video retrieval, which is proposed based on a attention-based (transition) -architecture proposed in document 3. Accordingly, both of these papers and the various documents cited therein are incorporated by reference herein as part of this disclosure to complete the disclosure herein.

As described above, the converter framework used in MMT is inefficient in learning local dependencies due to lack of inductive bias; on the other hand, as the depth of the model increases, the learned features are likely to generate learning bias, mainly because the deeper the model layer number is, the more easily the gradient diverges, and the larger the error is, which results in difficulty in training, thereby affecting the performance of the model.

In one aspect, the above-described drawbacks in the converter framework are considered herein to employ a combination of sub-attention sub-modules and span-based dynamic convolution sub-modules in a multi-head attention module of a converter.

The sub-attention sub-module is used for focusing on all the input videos to be embedded for each layer in the fully connected network, and extracting the semantics of events occurring in a plurality of modes in the input videos.

A span-based dynamic convolution sub-module shares a query with the sub-attention sub-module, but uses a different key from the sub-attention sub-module to generate an attention map kernel convolution kernel separately from the sub-attention sub-module. The span-based dynamic convolution sub-module fuses visual/text information of global, local and context, and can generate features with larger discriminant power. More specifically, a span-based dynamic convolution sub-module uses depth-wise separable convolution to collect information about spans of tokens (which represent individual elements on a matrix), and then dynamically generates a convolution kernel to enable collection of local relationships of input tokens based on their local contexts. The span-based dynamic convolution sub-module can solve the problem of lack of inductive bias in self-attention, and such span-based convolution can collect context information more efficiently. The final output is sent to the feed forward layer for further processing.

In another aspect, in the multi-head attention module of the converter, in addition to using shortcuts of the present layer, shortcuts from the previous layer are additionally used as enhanced shortcuts in view of the above-described drawbacks in the converter framework. By parallelizing the multi-headed attention module of joint convolution and self-attention with more identity projections, the model capabilities can be effectively stabilized and learning bias that may occur in a deep network can be prevented.

Fig. 1 shows a schematic diagram of a video retrieval framework for video retrieval according to one embodiment of the invention. As shown, the video retrieval framework herein includes a video encoder, a text encoder, and a similarity estimation unit.

The video encoder receives a video input, passes the video input through a neural network, and obtains a video characteristic representation of the video input at an output.

The text encoder receives a text input and obtains a text feature representation of the text input. In the "title to video" task, the text input may be a query, while in the "video to title" the text input may be from a candidate title in a pre-set candidate title library.

As described above, text editors for generating text feature representations have been relatively sophisticated and efficient because text analysis is relatively simple and research for text retrieval has been conducted for decades. For example, a CLIP text encoder is used in document 3. One specific embodiment of a text encoder may employ the CLIP text encoder by Radford et al. Accordingly, the paper and various documents cited therein are incorporated by reference herein as part of the present disclosure to complete the disclosure herein. Of course, the present invention is not limited thereto, and any text encoder may be used. The text encoder used may be a text encoder initialized when training the video search framework, or may be a text encoder trained in advance using a large amount of training data. When a text encoder trained with a large amount of training data in advance is employed, a two-step training method for training the video search framework according to one aspect of the present invention may be used to obtain an optimal text encoder and video search framework.

The similarity estimation receives and performs similarity estimation on the video feature representation from the video encoder and the text feature representation from the text encoder. For example, the similarity estimation is performed in the manner as in the above formulas (1) and (2).

Fig. 2 shows a schematic diagram of a video encoder in a video retrieval framework for video retrieval according to one embodiment of the invention. The video encoder includes a Convolutional Neural Network (CNN) and a Multi-modal converter (Multi-modal transformer, hereinafter referred to as MMT). CNNs extract from video input a variety of modalities, such as motion, RGB, scene, face, OCR, speech and audio, all preprocessed by pre-training networks in their domain. For example, details of multi-modality extraction are discussed in detail in document 2 and its references.

The multiple modalities extracted by the CNN are input into the MMT as multiple video embeddings, see E1-EN in fig. 2, where E1, E2, …, EN denote video embeddings corresponding to the respective modalities. In one particular embodiment, n=7, and E1, E2, …, EN represent video embedding corresponding to 7 modalities of motion, RGB, scene, face, OCR, speech and audio. Of course, the present invention is not limited thereto. For example, modalities may also include subtitles, graphics, and so on.

MMT is a fully connected network of nodes by multiple transcoder encoders (Trm), as shown on the left side of fig. 2. The input of MMT is video embedded E1, E2, …, EN corresponding to each modality. In a preferred embodiment, each video insert is composed of an extracted feature, feature type code, timing information, location information, etc. code.

In a preferred embodiment, the MMT may comprise a 6-layer network of Trm, i.e. there is a 6-layer Trm network between the input and the output. The input of each Trm is embodied in the form of a query (Q), a key (K) and a value (V). This form is well known in the art (for example, in document 3), and will not be described here.

The right side of fig. 2 shows an exemplary structure of each Trm in the MMT network. Each Trm includes a multi-headed attention (MSA) module that receives a query (Q), a key (K), and a value (V) as inputs to the Trm and maps the query and a set of key-value pairs to outputs.

Each Trm also includes an adder (e.g., a first adder) at the output of the multi-head attention module that adds the output of the MSA module to the input of the Trm (i.e., the input of the MSA module), thereby implementing an MSA module with shortcuts, which can be represented as

ShortcuMSA(Z _l )＝MSA(Z _l )+Z _l . (3)

Wherein the identity projection (i.e. Z _l ) Parallel to the MSA module, "l" denotes the first layer, e.g., the first layer of the 6-layer Trm. Intuitively, the shortcut connection bypasses the MSA module, providing another alternative path, and features can be passed directly to the next layer without interference from other flags. The success of shortcuts suggests that bypassing the MSA layer through additional paths is an effective way to enhance feature representation and improve similar structural performance.

Each Trm is also followed by a layer normalization module (e.g., a first layer normalization module) that layer normalizes the output of the first adder.

Each Trm also includes a feed-forward module that receives the output of the first layer normalization module and performs feed-forward. In a preferred embodiment, the feed-forward takes the form of a fully connected feed-forward network.

Each Trm also includes an adder (e.g., a second adder) at the output of the feed-forward module that adds the output of the feed-forward module to the input of the feed-forward module, thereby implementing the feed-forward module with a shortcut, which can be represented in a form similar to equation (3) above.

Each Trm also includes a layer normalization module (e.g., a second layer normalization module) following the second adder that layer normalizes the output of the second adder.

In one embodiment, the output of the second layer normalization module is taken as the output of the corresponding Trm. In other embodiments, other optional modules, such as a linear module, a Softmax module, etc., may be added between the second layer normalization module and the output of the corresponding Trm.

As described above, MMT may comprise a 6-layer fully-connected network of multiple trms, so that these trms form the effect of multiple iterations.

Fig. 4 shows a schematic structural diagram of one embodiment of a transducer encoder (Trm) according to one embodiment of the present invention. In fig. 4, a high-level block diagram of an exemplary multi-head attention (MSA) module in Trm according to the present invention is presented. As shown in fig. 4, the MSA module includes a sub-attention (SA) sub-module and a span-based dynamic convolution (SDC) sub-module. The SA submodule and the SDC submodule simultaneously receive input query (Q), key (K) and value (V) of the Trm, respectively process the query, the key (K) and the value (V), and output ends of the SA submodule and the SDC submodule are connected together through a connecting module (connector) and output. In one embodiment, the output of the coupler is taken as the output of the corresponding MSA module. In other embodiments, other optional modules, such as a linear module, a Softmax module, etc., may be added between the output of the coupler and the output of the corresponding MSA module.

In one embodiment, multiple parallel MSA layers may be processed in parallel and the output of the coupler after each parallel MSA layer is linearly combined by a linear layer.

Fig. 5 shows a schematic structural diagram of one embodiment of a sub-attention (SA) sub-module in a transducer encoder (Trm) according to one embodiment of the invention. The sub-attention sub-module receives the input query (Q), key (K), and value (V), and maps the query and a set of key-value pairs to an output.

As shown in fig. 5, the SA submodule includes three linear layers, e.g., a first linear layer, a second linear layer, and a third linear layer, which perform linear combinations of the query (Q), the key (K), and the value (V), respectively. It will be readily appreciated by those skilled in the art that these linear layers merely rearrange (i.e., linearly combine) the elements of query (Q), key (K), and value (V) without further transformation thereof, and thus the output of each linear layer in fig. 5 still corresponds to query (Q), key (K), and value (V).

The SA submodule also includes "MatMul" (e.g., a first multiplier) at the outputs of the first and second linear layers (corresponding to Q and K). The "MatMul" is typically implemented using a torch. Matmul function to multiply two tensor matrices, as is well known in the art and will not be described in detail herein. Thus, the first multiplier is used to tensor matrix multiply the outputs of the first and second linear layers.

The SA submodule also includes a scaling module (e.g., scaler) at the output of "Mat Mul" for the purpose of scaling the matrix values to mitigate gradient vanishing.

The SA submodule also includes a SoftMax unit at the output of the scaler, which is a numerical process of the scaler output using a SoftMax function.

The SA submodule also includes another "Mat Mul" (e.g., a second multiplier) that performs another tensor matrix multiplication of the output of the SoftMax unit with the output of the third linear layer (corresponding to V).

The SA sub-module further includes a fourth linear layer that receives the output of the second multiplier, performs linear combination, and outputs as the output of the sub-attention sub-module.

In one embodiment, the output of the fourth linear layer is taken as the output of the corresponding SA sub-module. In other embodiments, other optional modules, such as another scaling module, softmax module, etc., may be added between the output of the fourth linear layer and the output of the corresponding SA sub-module.

Fig. 6 shows a schematic structural diagram of one embodiment of a span-based dynamic convolution (SDC) sub-module in a converter encoder (Trm) according to one embodiment of the present invention. The SDC submodule receives an input query (Q), a key (K), and a value (V) and maps the query and a set of convolutional key-value pairs to an output.

The main difference with the SA sub-module of fig. 5 is that the linear layer corresponding to key (K) is replaced with a convolutional layer (Conv). By using this convolution layer, the SDC not only has the flexibility of dynamic convolution, but also allows the kernel to be generated within a local range of the current marker. Such convolution can effectively utilize local dependencies and can distinguish between different meanings of the same token. Span-based dynamic convolution first uses depth separable convolution to collect span-based marker information, as shown in fig. 6, and then dynamically generates a convolution kernel. By generating the local relationship of the input tokens according to their local context rather than a single token, the kernel can efficiently capture local dependencies. Specifically, with query Q and key Ks as inputs, the kernel is generated by (4):

f(Q,K _s )＝softmax(W _f (Q⊙K _s )). (4)

wherein +. _f Is a convolution kernel. As shown in fig. 6, this operator is referred to as a span-based dynamic convolution. The output may be written as:

SDC(Q,K _s ,V；W _f ,i)＝LConv(V,softmax(W _f (Q⊙K _s )),i). (5)

where Lconv represents a light weight vector convolution by depth that ties all weights together once along the channel dimension, but uses a different convolution kernel at each location, the calculation of which can be shown in equation (6):

Wherein W is _j Representing the convolution kernel.

A more detailed example of Lconv is given in Felix Wu et al, "Felix Wu, angela Fan, alesei Baevski, yann N Dauphin, and Michael auli.pay less attention with lightweight and dynamic con. ArXiv preprint arXiv:1901.10430,2019" (document 4), which paper and its cited documents are incorporated herein by reference as part of the present disclosure for the complete disclosure herein.

As shown in fig. 6, the SDC submodule includes linear layers for performing linear combination of the query (Q) and the value (V), for example, a fifth linear layer and a sixth linear layer, respectively.

The SDC submodule also includes a convolution layer for performing vector convolution on the key (K) and outputting a convolved key (K). As shown in fig. 6, the inputs to the SDC submodule are converted to Q, ks and V by using a convolutional layer.

The SDC submodule also includes a multiplier ("multiply", e.g., a third multiplier) for point-wise multiplying the outputs of the fifth linear layer and the convolutional layer by a tensor matrix. Those skilled in the art will readily appreciate that this is due to the use of a convolution layer for key (K).

The SDC submodule also includes a linear layer (e.g., a seventh linear layer) at an output of the third multiplier for linear processing of the output of the third multiplier.

The SDC sub-module further includes a SoftMax unit at the output of the seventh linear layer that numerically processes the output of the scaler using a SoftMax function.

The SDC submodule further includes an Lconv unit (light weight convolution unit) that performs light weight vector convolution on an output of the SoftMax unit and an output of the sixth linear layer corresponding to V. The light weight vector convolution may, for example, take the form described above in connection with equations (5) and (6). The light weight vector convolution can effectively collect local dependencies. For example, this may allow segments with high similarity to obtain a high score when performing video retrieval; and the Mat Mul in the SA submodule can enable frames with strong correlation in the content vector to obtain higher scores when video retrieval is carried out.

The SDC submodule also includes a linear layer (e.g., an eighth linear layer) at the output of the Lconv cell that receives the output of the light weight convolution, performs linear combination, and outputs. In one embodiment, the output of the eighth linear layer may be the output of the SDC submodule. In other embodiments, other optional modules, such as a scaling module, softmax module, etc., may be added between the output of the eighth linear layer and the output of the corresponding SDC submodule.

Fig. 7 adds more shortcuts than fig. 4. Specifically, as described above, in FIG. 4, an MSA module and a feed-forward module are implemented with shortcuts, which can be expressed as equation (3), where the identity projection (i.e., Z _l ) Parallel to the MSA module, "l" denotes the first layer, e.g., the first layer of the 6-layer Trm. Intuitively, the shortcut connection bypasses the MSA module and the feed-forward module, providing another alternative path, and features can be passed directly to the next layer without interference from other flags. The success of shortcuts suggests that bypassing the MSA layer and the feed-forward module through additional paths is an effective way to enhance feature representation and improve similar structural performance.

In fig. 7, N enhanced shortcuts are added to the MCA module and the feed-forward module on the basis of the layer ("l") shortcuts of fig. 4. Taking the MCA module as an example, the MCA module equipped with N enhanced shortcuts may be expressed as:

where "l" denotes the first layer of the layers, e.g. 6 Trm.Is the ith enhanced shortcut connection of the first layer,indicating its parameters, L indicates the value of L, for example, when using a 6-layer network, l=6, n is an integer greater than or equal to 1, indicating the number of added enhanced shortcuts. / >A simple expression of (c) is a sequence of linear projections and activation functions, i.e.,

wherein the method comprises the steps ofIs a weight matrix (d represents a dimension, determined by the dimension of Z), σ is a nonlinear activation function (e.g., GELU) or other module. In equation (8), ∈>Each tag (token) is treated independently and their specificity is preserved, which is complementary to the aggregation of different tags by the MCA module. Note that identity mapping is a special case of (8), i.e., σ (x) =x and +.>Is an identity matrix. The comparative experiments show that the enhanced shortcuts of the form σ (x) =x are superior to the other shortcuts.

In a preferred embodiment, n=1, i.e. an enhanced shortcut connection can be added.

In addition to the original shortcuts, the enhanced shortcuts provide more alternative paths to bypass the attention mechanism. Enhanced shortcuts provide more alternative paths around the attention mechanism, while the same enhanced shortcuts strengthen the original features, suppress gradient divergence, and do not lead to gradual learning bias of the markers as the network deepens.

The enhanced shortcuts of the feed-forward module may be in a similar manner, and will not be described again.

Fig. 8 shows a schematic diagram of a video retrieval device for implementing an embodiment of the invention. The apparatus includes a processor and a memory. As described above, the processor may be a general purpose processor, a neural network processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any circuit module designed to perform the functions described herein. The processors include various processors located in remote computers or servers or server clusters or clouds.

The memory may be any type of memory capable of storing programs, code, and manipulated data. The memory includes various stores located in remote computers or servers or server clusters or clouds. The memory contains instructions that, when executed by the processor, enable the methods or the apparatus described herein to be implemented. The memory may also store data manipulated by the methods and apparatus described herein, such as video input, text input, intermediate data, various model parameters, neural network parameters, and the like.

The device may also be implemented in a system on a chip (SOC).

The above aspects of the present invention can be summarized as follows, but are not limited to, various aspects, which can be readily appreciated as any combination, deletion, and substitution.

In aspect 1, a method for retrieving video, comprising:

obtaining a text feature representation of the input text using a text encoder;

Aspect 2, the method of aspect 1, wherein the sub-attention sub-module comprises:

a scaler for scaling the output of the first multiplier;

Aspect 3, the method of aspect 1 or 2, wherein the span-based dynamic convolution sub-module further comprises:

Aspect 4, the method of any one of aspects 1-3, wherein the converter encoder (Trm) comprises:

Aspect 5, the method of any one of aspects 1-3, wherein the converter encoder (Trm) comprises:

The method of aspect 6, aspect 5, wherein the enhanced shortcuts are obtained by parameterizing the corresponding shortcut connection.

Aspect 7, the method of aspect 6, wherein the parameterized projection is a sequence of linear projections and activation functions.

Aspect 8, the method of any one of aspects 1-7, the video encoder further comprising:

And the convolutional neural network is used for receiving the input video and extracting the plurality of modes from the input video.

In aspect 9, a video retrieval apparatus includes:

a video encoder that obtains a video characteristic of an input video, the video encoder processing a plurality of modalities extracted from the input video with a multi-modality converter (MMT), the MMT comprising:

a text encoder that obtains a text feature representation of the input text;

The apparatus of aspect 10, aspect 9, wherein the sub-attention sub-module comprises:

a scaler for scaling the output of the first multiplier;

The apparatus of aspect 11, aspect 9 or 10, wherein the span-based dynamic convolution sub-module further comprises

Aspect 12, the apparatus of any one of aspects 9-11, wherein the converter encoder (Trm) includes:

Aspect 13, the apparatus of any one of aspects 9-11, wherein the converter encoder (Trm) comprises:

The apparatus of aspect 14, aspect 13, wherein the enhanced shortcuts are obtained by parameterizing projection of the corresponding shortcut connections.

The apparatus of aspect 15, aspect 14, wherein the parameterized projection is a sequence of linear projections and activation functions.

Aspect 16, the apparatus of any one of aspects 9-15, wherein the video encoder further comprises:

Aspect 17, a computer readable storage medium storing code for performing video retrieval, the code when executed, being capable of implementing the method of any one of aspects 1-8 or the framework of any one of aspects 9-16.

While the foregoing disclosure discusses exemplary aspects and/or embodiments, it should be noted that many changes and modifications could be made herein without departing from the scope of the described aspects and/or embodiments as defined by the appended claims. Furthermore, although elements of the described and/or illustrated embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated. Additionally, all or a portion of any aspect and/or embodiment may be utilized in combination with all or a portion of any other aspect and/or embodiment, unless stated to the contrary.

Claims

1. A method for retrieving video, comprising:

obtaining a text feature representation of the input text using a text encoder;

a coupler for connecting the output of the sub-attention sub-module and the output of the span-based dynamic convolution sub-module,

wherein the converter encoder (Trm) comprises:

2. The method of claim 1, wherein the sub-attention sub-module comprises:

a scaler for scaling the output of the first multiplier;

3. The method of claim 1 or 2, wherein the span-based dynamic convolution sub-module further comprises:

4. The method of claim 1, wherein,

the enhanced shortcuts are obtained by parameterized projection of corresponding shortcut connections.

5. The method of claim 4, wherein,

the parameterized projection is a sequence of linear projections and activation functions.

6. The method of claim 1 or 2, wherein the video encoder further comprises:

7. A video retrieval device comprising:

a text encoder that obtains a text feature representation of the input text;

wherein the converter encoder (Trm) comprises:

8. The apparatus of claim 7, wherein the sub-attention sub-module comprises:

A scaler for scaling the output of the first multiplier;

9. The apparatus of claim 7 or 8, wherein the span-based dynamic convolution sub-module further comprises

10. The apparatus of claim 7, wherein,

11. The apparatus of claim 10, wherein,

12. The apparatus of claim 7 or 8, wherein the video encoder further comprises:

13. A computer readable storage medium storing code for performing video retrieval, the code when executed being capable of implementing the method of any one of claims 1-6.