CN114090823A

CN114090823A - Video retrieval method, video retrieval device, electronic equipment and computer-readable storage medium

Info

Publication number: CN114090823A
Application number: CN202111055136.9A
Authority: CN
Inventors: 范清; 唐大闰
Original assignee: Miaozhen Information Technology Co Ltd
Current assignee: Miaozhen Information Technology Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2022-02-25

Abstract

The invention provides a video retrieval method, a video retrieval device, electronic equipment and a computer-readable storage medium, which relate to the technical field of data processing, wherein when video retrieval is carried out, retrieval contents are firstly obtained, and the types of the retrieval contents comprise voice or text; converting the retrieval content into a query vector according to a pre-trained text feature encoder; determining a target video matched with the retrieval content according to the query vector and the visual feature vector of each video in the retrieval library; the visual feature vector is obtained by extracting the features of the corresponding video in the search library through a self-supervision pre-trained visual encoder. Therefore, the method and the device realize that the user quickly and accurately retrieves the matched video by using voice or text description based on the content guessing or memorizing of the video to be searched, and improve the accuracy and the working efficiency, reduce the cost and enhance the user experience compared with the conventional video retrieval mode based on manual text labeling.

Description

Video retrieval method, video retrieval device, electronic equipment and computer-readable storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a video retrieval method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In recent years, the domestic video industry has seen explosive growth, and video is becoming an important way for people to entertain, study and socialize. The education and training videos in the online education and video sharing platform have a considerable ratio, and for the platform, how to quickly and accurately retrieve and recommend relevant videos to users in a more convenient mode in mass video data according to the learning intention of the users becomes one of important indexes of content change.

In a traditional video retrieval mode, manual text annotation is carried out on a video frame image, and video data are described according to the content of the video frame image by utilizing a text annotation mode, so that a video label for describing video content is formed; when searching is carried out, a user provides keywords according to own interests, and the database returns a searching result by matching the video tags and the keywords. However, this searching method has the following disadvantages: the first is that the annotation is mainly processed by human, so that the annotation is greatly affected by subjective factors of annotators, which may cause different descriptions to the same video; secondly, the text description is a fixed abstraction to the video scene content, so that a specific video tag is only suitable for specific retrieval; thirdly, the amount of video data is large and the effort to add annotations manually is large, especially for today's increasing video volume, which is costly and inefficient.

In summary, the existing video retrieval mode has the problems of low accuracy, high cost and low working efficiency, and the user experience is poor.

Disclosure of Invention

The invention aims to provide a video retrieval method, a video retrieval device, electronic equipment and a computer-readable storage medium, so as to improve the accuracy and the working efficiency, reduce the cost and enhance the user experience.

In a first aspect, an embodiment of the present invention provides a video retrieval method, including:

acquiring retrieval content, wherein the type of the retrieval content comprises voice or text;

converting the retrieval content into a query vector according to a pre-trained text feature encoder;

determining a target video matched with the retrieval content according to the query vector and the visual feature vector of each video in a retrieval library; and the visual feature vector is obtained by performing feature extraction on the corresponding video in the search library through a self-supervision pre-trained visual encoder.

Further, the step of converting the search content into a query vector according to a pre-trained text feature encoder includes:

when the type of the retrieval content is voice, converting the retrieval content into a text form to obtain a retrieval text;

and inputting the retrieval text into a pre-trained text feature encoder to obtain a query vector output by the text feature encoder.

Further, the step of determining a target video matching the search content according to the query vector and the visual feature vector of each video in the search library includes:

respectively calculating cosine similarity between the query vector and a visual feature vector of each video in a search library to obtain a similarity value between the search content and each video;

and determining a target video matched with the retrieval content according to the similarity value of the retrieval content and each video.

Further, the method further comprises:

acquiring a training sample, wherein the training sample comprises a batch of video data and at least one piece of text data;

for each piece of video data and each piece of text data, respectively inputting the video data and the text data into a visual encoder network and a text encoder network to obtain a first feature vector corresponding to the video data and a second feature vector corresponding to the text data;

calculating to obtain model loss according to the first feature vector and the second feature vector;

and updating the network parameters of the visual encoder network and the network parameters of the text encoder network according to the model loss so as to obtain a pre-trained visual encoder and a text feature encoder.

Further, the step of calculating a model loss according to the first feature vector and the second feature vector includes:

inputting the first feature vector and the second feature vector into a nonlinear mapping module and a linear mapping module respectively to obtain a third feature vector and a fourth feature vector of a preset dimension;

and inputting the third feature vector and the fourth feature vector into a cross-mode contrast loss function, and calculating to obtain model loss.

Further, the step of updating the network parameters of the network of visual encoders and the network parameters of the network of text encoders includes:

performing parameter optimization on the visual encoder network and the text encoder network using an Adam optimizer.

Further, the visual encoder network comprises a 3D ResNet50 network; the text encoder network comprises a Word segmentation module, a Word2vec module, a linear layer and a maximum pooling layer which are sequentially connected.

In a second aspect, an embodiment of the present invention further provides a video retrieval apparatus, including:

the acquisition module is used for acquiring retrieval contents, and the types of the retrieval contents comprise voice or text;

the conversion module is used for converting the retrieval content into a query vector according to a pre-trained text feature encoder;

the determining module is used for determining a target video matched with the retrieval content according to the query vector and the visual feature vector of each video in the retrieval library; and the visual feature vector is obtained by performing feature extraction on the corresponding video in the search library through a self-supervision pre-trained visual encoder.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the video retrieval method according to the first aspect when executing the computer program.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the video retrieval method according to the first aspect is executed.

According to the video retrieval method, the video retrieval device, the electronic equipment and the computer readable storage medium, when video retrieval is carried out, retrieval contents are firstly obtained, wherein the types of the retrieval contents comprise voice or text; converting the retrieval content into a query vector according to a pre-trained text feature encoder; determining a target video matched with the retrieval content according to the query vector and the visual feature vector of each video in the retrieval library; the visual feature vector is obtained by extracting the features of the corresponding video in the search library through a self-supervision pre-trained visual encoder. Therefore, the method and the device realize that the user quickly and accurately retrieves the matched video by using voice or text description based on the content guessing or memorizing of the video to be searched, and improve the accuracy and the working efficiency, reduce the cost and enhance the user experience compared with the conventional video retrieval mode based on manual text labeling.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a video retrieval method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a training encoder according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a text encoder network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a non-linear mapping module according to an embodiment of the present invention;

FIG. 5 is a flow chart of another exemplary training encoder according to the present invention;

fig. 6 is a schematic structural diagram of a video retrieval apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to improve the defects of the video retrieval mode based on the artificial text annotation, a content-based video retrieval mode can be adopted. A video retrieval mode based on contents is a fuzzy query technology, which is established on the basis of video content analysis, and connects semantic gap between bottom layer visual information and high-level abstract concepts by introducing video space-time characteristics.

There is an unbalanced complementary relationship between different modalities such as images, videos and texts, and the amount of information contained in describing the same subject is often unequal, but the semantics of the different modalities intersect. The cross-modal retrieval task aims to search related semantic information among different modalities of different data sets, and essentially requires a computer to solve different forms of semantic information of the real world and the mutual relation of the semantic information. Based on the above characteristics, the video retrieval method, the device, the electronic device and the computer-readable storage medium provided by the embodiment of the invention adopt a cross-modal video retrieval technology oriented to the educational training industry, so that a user can quickly and accurately retrieve a matched video based on content guessing or memory of a video to be searched and by using voice or text description.

To facilitate understanding of the present embodiment, a detailed description will be given to a video retrieval method disclosed in the present embodiment.

The embodiment of the invention provides a video retrieval method, which is executed by electronic equipment with data processing capacity. Referring to fig. 1, a schematic flow chart of a video retrieval method is shown, which mainly includes the following steps S102 to S106:

step S102, retrieval content is obtained, and the type of the retrieval content comprises voice or text.

The retrieval content can be in a voice description form or a text description form. The user can input the retrieval contents in the form of voice description or text description based on the guessing or memorizing of the contents of the video to be searched. For example, computer science and technology, which is a word once spoken by a teacher in a video, is a discipline for studying the theory, principle, method and technology of computer design and manufacture and information acquisition, presentation, storage, processing, control, etc. using a computer. At least part of the content of the language can be input in the form of voice description or text description, and the target video which is most matched with the content of the language can be quickly retrieved through the subsequent steps.

And step S104, converting the retrieval content into a query vector according to the pre-trained text feature encoder.

When the type of the retrieval content is voice, the retrieval content can be converted into a text form to obtain a retrieval text; and then inputting the retrieval text into a pre-trained text feature encoder to obtain a query vector output by the text feature encoder. Specifically, for the teaching speech, it can be converted into a text sentence using an existing automatic speech recognition technology, resulting in a retrieval text.

When the type of the retrieval content is a text, the retrieval content can be directly input into a pre-trained text feature encoder to obtain a query vector output by the text feature encoder.

Step S106, determining a target video matched with the retrieval content according to the query vector and the visual feature vector of each video in the retrieval library; the visual feature vector is obtained by extracting the features of the corresponding video in the search library through a self-supervision pre-trained visual encoder.

In some possible embodiments, the step S106 may be implemented by the following process: respectively calculating the cosine similarity of the query vector and the visual feature vector of each video in the retrieval library to obtain the similarity value of the retrieval content and each video; and determining a target video matched with the retrieval content according to the similarity value of the retrieval content and each video.

During specific implementation, a dot product can be obtained between the query vector and the visual feature vector of each video in the search library to obtain a similarity value between the search content and each video; the video corresponding to the similarity value larger than the preset threshold value may be determined as the target video matched with the retrieval content, or the video corresponding to the largest number of similarity values preset before may be determined as the target video matched with the retrieval content. The preset threshold and the preset number can be set according to actual requirements, and are not limited herein. For example, the preset threshold is 80%, and videos corresponding to similarity values greater than 80% may be determined as target videos; as another example, the preset number is 3, and the video corresponding to the largest first 3 similarity values may be determined as the target video.

According to the video retrieval method provided by the embodiment of the invention, when video retrieval is carried out, retrieval contents are firstly obtained, wherein the types of the retrieval contents comprise voice or text; converting the retrieval content into a query vector according to a pre-trained text feature encoder; determining a target video matched with the retrieval content according to the query vector and the visual feature vector of each video in the retrieval library; the visual feature vector is obtained by extracting the features of the corresponding video in the search library through a self-supervision pre-trained visual encoder. Therefore, the method and the device realize that the user quickly and accurately retrieves the matched video by using voice or text description based on the content guessing or memorizing of the video to be searched, and improve the accuracy and the working efficiency, reduce the cost and enhance the user experience compared with the conventional video retrieval mode based on manual text labeling.

The embodiment of the present invention further provides a process for training an encoder, referring to a schematic flow chart of a training encoder shown in fig. 2, a visual encoder and a text feature encoder are trained by the following steps:

step S202, a training sample is obtained, where the training sample includes a batch of video data and at least one piece of text data.

Preparation of training data set: a large number of learning and training videos can be collected from online education platforms, internet search engines, social media and video sharing platforms for training, and the data scale is different from 10-1000 thousands. In learning the visual encoder and the text feature encoder through the self-supervised feature fusion pre-training, one batch of video data (e.g. 1024 pieces) is loaded from the training data set each time for the self-supervised pre-training.

The text data may be directly input or converted from voice data.

Step S204, for each piece of video data and each piece of text data, respectively inputting the video data and the text data into a visual encoder network and a text encoder network to obtain a first feature vector corresponding to the video data and a second feature vector corresponding to the text data.

Alternatively, the visual encoder network described above may employ a 3D ResNet50 network. Referring to the schematic structural diagram of a text encoder network shown in fig. 3, the text encoder network may include a Word segmentation module, a Word2vec module, a linear layer, and a maximum pooling layer, which are connected in sequence. The text encoder network may process the text data as follows: for the text sentences (namely text data) after the preprocessing operation of deleting punctuations and pause words, the word segmentation module cuts (or fills) the text sentences into 16 words by using a Chinese character segmentation tool; a pre-trained Word2vec module extracts 300-dimensional feature vectors from 16 words; the 300-dimensional feature vector sequentially passes through the linear layer and the maximum pooling layer, and 2048-dimensional text feature vector is output. For alignment with the frame features of the visual modality, the texts within 3.2 seconds, for example, can be spliced according to time, word vectors of words in the dictionary are obtained through word segmentation, and finally, the average is carried out to obtain 300-dimensional text feature vectors which can represent text information within 3.2 seconds.

And step S206, calculating to obtain model loss according to the first feature vector and the second feature vector.

In some possible embodiments, the step S206 may be implemented by the following processes: inputting the first feature vector and the second feature vector into a nonlinear mapping module and a linear mapping module respectively to obtain a third feature vector and a fourth feature vector of a preset dimension; and inputting the third feature vector and the fourth feature vector into a cross-modal contrast loss function, and calculating to obtain model loss. The preset dimension can be set according to actual requirements, for example, the preset dimension is 521 dimension.

Optionally, referring to a schematic structural diagram of a non-Linear mapping module shown in fig. 4, the non-Linear mapping module may include a first Linear layer, a first BN (Batch normalization) layer, a ReLU (Rectified Linear Unit, which is also called a modified Linear Unit, and is an activation function commonly used in an artificial neural network) layer, a second Linear layer, and a second BN layer, which are connected in sequence.

And step S208, updating the network parameters of the visual encoder network and the network parameters of the text encoder network according to the model loss so as to obtain the pre-trained visual encoder and the text feature encoder.

Parameter optimization of the visual encoder network and the text encoder network may be performed using an Adam optimizer.

For ease of understanding, referring to another flowchart of the training encoder shown in fig. 5, video data and text data are input to a 3D ResNet50 network and a text encoder network, respectively, to generate 2048-dimensional visual feature vectors and text feature vectors; the visual characteristic vector and the text characteristic vector are respectively input into a nonlinear mapping module and a linear mapping module, and the video and text modes are respectively embedded into a 521-dimensional characteristic fusion subspaceA (c) is added; and inputting the visual characteristics and the text characteristics of the characteristic fusion subspace into a cross-modal contrast loss function to calculate model loss, and updating network parameters according to a back propagation algorithm. Adam optimizer and parameter β can be used₁＝0.9,β₂＝0.999,ε＝10^-8By optimizing the 3D ResNet50 network and the text encoder network, training can last for 50 ten thousand times, and the initial learning rate can be 0.001.

After training is finished, inputting the video into a visual encoder of a 3D ResNet50 to obtain a 2048-dimensional visual feature vector; the text is input into a text encoder to obtain a 2048-dimensional text feature vector, and the similarity between the text and the video can be obtained by performing dot product on the two vectors.

When video retrieval is carried out, firstly, a retrieval library is constructed, visual feature vectors of all videos in the retrieval library are extracted by using a self-supervision pre-trained visual encoder, and the extracted visual feature vectors are stored in a database. When the online retrieval is carried out, if a user inputs voice, the voice is converted into a text by using an existing automatic voice recognition tool, and the text is converted into a 2048-dimensional query vector by using a pre-trained text feature encoder; if the user inputs text, it is converted into a 2048-dimensional query vector directly using a pre-trained text feature coder. And finally, calculating the similarity between the query vector and each visual feature vector in the search library by using cosine similarity, and returning a target video matched with the voice or the text according to the similarity.

The embodiment of the invention is based on a self-supervision contrast learning framework, co-occurrence of videos and language teaching narration is used as an agent task, visual features and text features are embedded into a low-dimensional shared subspace (namely a feature fusion subspace) for deep fusion, and multi-mode contrast loss is utilized to optimize network parameters. After the self-supervision contrast learning is finished, the video is input to a visual encoder of the 3D ResNet50 to obtain a visual characteristic vector, the text characteristic vector can be obtained by inputting the text after the teaching voice conversion, and the similarity of the text and the video can be obtained by dot products of the two vectors.

Additionally, for video data, temporal and spatial averaging pooling can be applied at the last layer of the 3D ResNet50 network to obtain a single 2048-dimensional vector. In training, 32 frames may be sampled from the video at a rate of 10fps, and the input sequence may have a resolution of 224 x 224. The following criteria for data enhancement can be used during training: random cropping, horizontal flipping, temporal sampling, scale dithering, and color enhancement.

In summary, the video retrieval method provided by the embodiment of the invention adopts a special cross-modal video retrieval technology based on self-supervision learning, and can quickly and accurately retrieve the matched target video aiming at the education training type video in the education training industry; the voice or text description can be directly used for retrieval, and the user experience is enhanced.

Corresponding to the video retrieval method, an embodiment of the present invention further provides a video retrieval apparatus, referring to a schematic structural diagram of the video retrieval apparatus shown in fig. 6, where the apparatus includes:

an obtaining module 62, configured to obtain search content, where the type of the search content includes voice or text;

a conversion module 64, configured to convert the retrieved content into a query vector according to the pre-trained text feature encoder;

a determining module 66, configured to determine, according to the query vector and the visual feature vector of each video in the search library, a target video that matches the search content; the visual feature vector is obtained by extracting the features of the corresponding video in the search library through a self-supervision pre-trained visual encoder.

The video retrieval device provided by the embodiment of the invention firstly acquires the retrieval content when performing video retrieval, wherein the type of the retrieval content comprises voice or text; converting the retrieval content into a query vector according to a pre-trained text feature encoder; determining a target video matched with the retrieval content according to the query vector and the visual feature vector of each video in the retrieval library; the visual feature vector is obtained by extracting the features of the corresponding video in the search library through a self-supervision pre-trained visual encoder. Therefore, the method and the device realize that the user quickly and accurately retrieves the matched video by using voice or text description based on the content guessing or memorizing of the video to be searched, and improve the accuracy and the working efficiency, reduce the cost and enhance the user experience compared with the conventional video retrieval mode based on manual text labeling.

Further, the conversion module 64 is specifically configured to: when the type of the retrieval content is voice, converting the retrieval content into a text form to obtain a retrieval text; and inputting the retrieval text into a pre-trained text feature encoder to obtain a query vector output by the text feature encoder.

Further, the determining module 66 is specifically configured to: respectively calculating the cosine similarity of the query vector and the visual feature vector of each video in the retrieval library to obtain the similarity value of the retrieval content and each video; and determining a target video matched with the retrieval content according to the similarity value of the retrieval content and each video.

Further, the apparatus further comprises a training module, wherein the training module is configured to: acquiring a training sample, wherein the training sample comprises a batch of video data and at least one piece of text data; for each piece of video data and each piece of text data, respectively inputting the video data and the text data into a visual encoder network and a text encoder network to obtain a first feature vector corresponding to the video data and a second feature vector corresponding to the text data; calculating to obtain model loss according to the first eigenvector and the second eigenvector; and updating the network parameters of the visual encoder network and the network parameters of the text encoder network according to the model loss so as to obtain the pre-trained visual encoder and text feature encoder.

Further, the training module is specifically configured to: inputting the first feature vector and the second feature vector into a nonlinear mapping module and a linear mapping module respectively to obtain a third feature vector and a fourth feature vector of a preset dimension; and inputting the third feature vector and the fourth feature vector into a cross-modal contrast loss function, and calculating to obtain model loss.

Further, the training module is specifically configured to: parameter optimization of the visual encoder network and the text encoder network was performed using an Adam optimizer.

Further, the visual encoder network comprises a 3D ResNet50 network; the text encoder network comprises a Word segmentation module, a Word2vec module, a linear layer and a maximum pooling layer which are connected in sequence.

The device provided by the embodiment has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

Referring to fig. 7, an embodiment of the present invention further provides an electronic device 100, including: a processor 70, a memory 71, a bus 72 and a communication interface 73, wherein the processor 70, the communication interface 73 and the memory 71 are connected through the bus 72; the processor 70 is arranged to execute executable modules, such as computer programs, stored in the memory 71.

The Memory 71 may include a Random Access Memory (RAM) or a non-volatile Memory (NVM), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 73 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

The bus 72 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.

The memory 71 is configured to store a program, and the processor 70 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 70, or implemented by the processor 70.

The processor 70 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 70. The Processor 70 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 71, and the processor 70 reads the information in the memory 71 and completes the steps of the method in combination with the hardware thereof.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the video retrieval method described in the foregoing method embodiment. The computer-readable storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk.

In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for video retrieval, comprising:

2. The video retrieval method of claim 1, wherein the step of converting the retrieved content into a query vector according to a pre-trained text feature coder comprises:

3. The video retrieval method of claim 1, wherein the step of determining the target video matching the retrieved content according to the query vector and the visual feature vector of each video in the retrieval base comprises:

4. The video retrieval method of claim 1, wherein the method further comprises:

5. The video retrieval method of claim 4, wherein the step of calculating a model loss based on the first feature vector and the second feature vector comprises:

6. The video retrieval method of claim 4, wherein the step of updating the network parameters of the visual encoder network and the network parameters of the text encoder network comprises:

7. The video retrieval method of claim 4, wherein the visual encoder network comprises a 3D ResNet50 network; the text encoder network comprises a Word segmentation module, a Word2vec module, a linear layer and a maximum pooling layer which are sequentially connected.

8. A video retrieval apparatus, comprising:

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any one of claims 1-7 when executing the computer program.

10. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, is adapted to carry out the method of any one of claims 1-7.