CN113656639A

CN113656639A - Video retrieval method and device, computer-readable storage medium and electronic equipment

Info

Publication number: CN113656639A
Application number: CN202110962699.XA
Authority: CN
Inventors: 周密; 唐景群; 姜波; 胡光龙
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-11-16

Abstract

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a video retrieval method and apparatus, a computer-readable storage medium, and an electronic device. The method comprises the following steps: acquiring a video to be retrieved, extracting key frames included in the video to be retrieved, and obtaining a plurality of key frame images according to the key frames; inputting the plurality of key frame images into a video fingerprint extraction model to obtain video level characteristics and video segment level characteristics corresponding to a video to be retrieved, and obtaining the video fingerprint of the video to be retrieved according to the video level characteristics and the video segment level characteristics; and retrieving by using the video fingerprints to obtain a video similar to the video to be retrieved. The method can reduce the storage cost of the video fingerprints and accelerate the retrieval speed of the video to be retrieved.

Description

Video retrieval method and device, computer-readable storage medium and electronic equipment

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a video retrieval method and apparatus, a computer-readable storage medium, and an electronic device.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims and the description herein is not admitted to be prior art by inclusion in this section.

In recent years, related industries of the internet are rapidly developed, videos are favored by more and more people as a more vivid and visual information expression mode, video application scenes in the internet are more and more, however, the flourishing development of the video industry also brings huge challenges to the protection of original videos, and with the gradual expansion of the scale of video resources, repeated videos or infringement videos are difficult to find out from massive videos manually. In the current technology, video retrieval may include: extracting video key frame images, taking the characteristics of each key frame image as video fingerprints, and retrieving through the video fingerprints. In the retrieval process, the video fingerprints of the video to be queried are compared with the video fingerprints of each video in the video library, so that the videos similar to the video to be retrieved are determined.

Disclosure of Invention

However, in the current video retrieval method, on one hand, the characteristics of all key frame images of a video to be retrieved are used as video fingerprints, so that the video fingerprints are monotonous, and the characteristics of the key frame images can be compared with the characteristics of the video in a video database; on the other hand, when in retrieval, the characteristics of all key frame images of the video to be retrieved need to be compared with all the characteristics of all the videos in the video database, so that the retrieval speed is reduced.

Therefore, an improved video retrieval method and apparatus, a computer-readable storage medium and an electronic device are needed to provide a video retrieval method that can reduce the storage cost of video fingerprints and increase the retrieval speed of the video to be retrieved.

In this context, embodiments of the present disclosure are intended to provide a video retrieval method and apparatus, a computer-readable storage medium, and an electronic device.

According to an aspect of the present disclosure, there is provided a video retrieval method, including:

acquiring a video to be retrieved, extracting key frames included in the video to be retrieved, and obtaining a plurality of key frame images according to the key frames;

inputting the plurality of key frame images into a video fingerprint extraction model to obtain video level characteristics and video segment level characteristics corresponding to a video to be retrieved, and obtaining the video fingerprint of the video to be retrieved according to the video level characteristics and the video segment level characteristics;

and retrieving by using the video fingerprints to obtain a video similar to the video to be retrieved.

In an exemplary embodiment of the present disclosure, at least a shot boundary prediction model is included in the video fingerprint model.

In an exemplary embodiment of the present disclosure, the video fingerprint extraction model further includes a key frame image feature extraction model, a timing model, and a video feature extraction model.

In an exemplary embodiment of the present disclosure, inputting the key frame image into the video fingerprint extraction model to obtain a video level feature and a video segment level feature corresponding to a video to be retrieved, includes:

inputting the plurality of key frame images into the key frame image feature extraction model, and obtaining key frame image features corresponding to the key frame images through a depth feature network in the key frame image feature extraction model;

inputting the key frame image features into the time sequence model, and learning the key frame image features to obtain time sequence features corresponding to the key frame image features;

inputting the time sequence characteristics into the video characteristic extraction model to obtain video level characteristics corresponding to the video to be retrieved;

and inputting the time sequence characteristics into the shot boundary prediction model to obtain segment sub-characteristics, and inputting the segment sub-characteristics into the video characteristic extraction model to obtain video segment level characteristics corresponding to the video to be retrieved.

In an exemplary embodiment of the present disclosure, the video retrieval method further includes:

calculating color histograms of the plurality of key frame images to obtain color histogram features corresponding to the key frame images;

splicing the color histogram feature of each key frame image and the key frame image feature to obtain a splicing feature corresponding to each key frame image;

inputting the splicing features into the time sequence model, and learning the splicing features to obtain first time sequence features corresponding to the splicing features;

inputting the first time sequence feature into the video feature extraction model to obtain a video level feature corresponding to the video to be retrieved;

and inputting the first time sequence feature into the shot boundary prediction model to obtain a first segment sub-feature, and inputting the first segment sub-feature into the video feature extraction model to obtain a video segment level feature corresponding to the video to be retrieved.

In an exemplary embodiment of the present disclosure, inputting the key frame image features into the time sequence model, and learning the key frame image features to obtain time sequence features corresponding to the key frame image features, includes:

multiplying each key frame image characteristic by the initialization matrix of the time sequence model to obtain a query vector, a key vector and a value vector corresponding to each key frame image characteristic; the initialization matrix comprises three matrixes;

calculating the correlation between each query vector corresponding to the key frame image features and the key vector through a similarity function;

and obtaining the time sequence characteristics corresponding to the key frame image characteristics according to the correlation and the value vector corresponding to each key frame image characteristic.

In an exemplary embodiment of the present disclosure, inputting the time-series feature into the video feature extraction model to obtain a video-level feature corresponding to the video to be retrieved includes:

splicing the time sequence characteristics corresponding to the key frame images to obtain first spliced time sequence characteristics;

and inputting the first splicing time sequence feature into the feature extraction model, and pooling and normalizing the first splicing time sequence feature to obtain a video-level feature corresponding to the video to be retrieved.

In an exemplary embodiment of the present disclosure, inputting the time sequence feature into the shot boundary prediction model to obtain a segment sub-feature, and inputting the segment sub-feature into the video feature extraction model to obtain a video segment-level feature corresponding to the video to be retrieved, includes:

converting the plurality of time sequence characteristics into a plurality of first probability values through an excitation function in the shot boundary prediction model, and taking a key frame image corresponding to any one first probability value as a shot boundary of the video to be retrieved when the fact that any one first probability value is larger than a preset probability value is determined;

converting the time sequence characteristics into segment sub-characteristics according to the shot boundary of the video to be retrieved;

and inputting the sub-features of the segments into the video feature extraction model to obtain video segment-level features corresponding to the video to be retrieved.

In an exemplary embodiment of the present disclosure, converting the time-series feature into a segment sub-feature according to a shot boundary of the video to be retrieved includes:

forming a characteristic sequence by the time sequence characteristics corresponding to the key frame images;

acquiring a shot boundary of the video to be retrieved, which is included in a key frame image of the video to be retrieved, and a first position of a time sequence feature corresponding to the shot boundary in the feature sequence;

and acquiring the time sequence characteristics of the non-shot boundary included before the first position, and splicing the time sequence characteristics of the non-shot boundary and the time sequence characteristics corresponding to the shot boundary to obtain the sub-characteristics of the segment.

analyzing the video to be retrieved through an encoding and decoding protocol to obtain a key frame included in the video to be retrieved;

and extracting a key frame image included in the video to be retrieved according to the key frame.

In an exemplary embodiment of the present disclosure, retrieving by using the video fingerprint to obtain a video similar to the video to be retrieved, includes:

acquiring video level characteristics corresponding to videos included in a video database, and constructing a search tree according to the video level characteristics included in the video database;

searching in the search tree by using video level characteristics included in the video fingerprint set to obtain a plurality of candidate videos similar to the video to be retrieved;

and acquiring the video segment level characteristics of the candidate video, and acquiring the video and/or the video segment most similar to the video to be retrieved through the video segment level characteristics corresponding to the video to be retrieved and the video segment level characteristics of the candidate video.

According to an aspect of the present disclosure, there is provided a video retrieval apparatus including:

the key frame image acquisition module is used for acquiring a video to be retrieved, extracting key frames included in the video to be retrieved and obtaining a plurality of key frame images according to the key frames;

the video fingerprint acquisition module is used for inputting the plurality of key frame images into a video fingerprint extraction model to obtain video level characteristics and video segment level characteristics corresponding to a video to be retrieved, and obtaining the video fingerprint of the video to be retrieved according to the video level characteristics and the video segment level characteristics;

and the video retrieval module is used for retrieving by utilizing the video fingerprints to obtain videos similar to the videos to be retrieved.

In an exemplary embodiment of the present disclosure, the video fingerprint acquisition module includes:

a key frame image feature obtaining module, configured to input the plurality of key frame images into the key frame image feature extraction model, and obtain, through a depth feature network in the key frame image feature extraction model, a key frame image feature corresponding to each of the key frame images;

the time sequence characteristic acquisition module is used for inputting the key frame image characteristics into the time sequence model and learning the key frame image characteristics to obtain time sequence characteristics corresponding to the key frame image characteristics;

the video level characteristic acquisition module is used for inputting the time sequence characteristics into the video characteristic extraction model to obtain video level characteristics corresponding to the video to be retrieved;

and the video segment level feature acquisition module is used for inputting the time sequence features into the shot boundary prediction model to obtain segment sub-features, and inputting the segment sub-features into the video feature extraction model to obtain video segment level features corresponding to the video to be retrieved.

In an exemplary embodiment of the present disclosure, the video retrieval apparatus further includes:

the color histogram feature extraction module is used for calculating color histograms of the plurality of key frame images to obtain color histogram features corresponding to the key frame images;

the splicing feature acquisition module is used for splicing the color histogram feature of each key frame image and the key frame image feature to obtain a splicing feature corresponding to each key frame image;

the first time sequence feature acquisition module is used for inputting the splicing features into the time sequence model, learning the splicing features and obtaining first time sequence features corresponding to the splicing features;

the video level characteristic acquisition module is used for inputting the first time sequence characteristic into the video characteristic extraction model to obtain a video level characteristic corresponding to the video to be retrieved;

and the video segment level feature acquisition module is used for inputting the first time sequence feature into the shot boundary prediction model to obtain a first segment sub-feature, and inputting the first segment sub-feature into the video feature extraction model to obtain a video segment level feature corresponding to the video to be retrieved.

In an exemplary embodiment of the present disclosure, the timing characteristic obtaining module includes:

the key frame image characteristic initial processing module is used for multiplying each key frame image characteristic by the initialization matrix of the time sequence model to obtain a query vector, a key vector and a value vector corresponding to each key frame image characteristic; the initialization matrix comprises three matrixes;

a correlation calculation module, configured to calculate, through a similarity function, a correlation between each of the query vectors corresponding to the key frame image features and the key vector;

and the time sequence characteristic acquisition module is used for acquiring the time sequence characteristics corresponding to the key frame image characteristics according to the correlation and the value vector corresponding to each key frame image characteristic.

In an exemplary embodiment of the disclosure, the video level feature obtaining module includes:

the first splicing time sequence characteristic determining module is used for splicing the time sequence characteristics corresponding to the key frame images to obtain first splicing time sequence characteristics;

and the video level characteristic acquisition module is used for inputting the first splicing time sequence characteristic into the characteristic extraction model, and pooling and normalizing the first splicing time sequence characteristic to obtain a video level characteristic corresponding to the video to be retrieved.

In an exemplary embodiment of the present disclosure, the video segment level feature obtaining module includes:

a shot boundary determining module, configured to convert the plurality of time sequence features into a plurality of first probability values through an activation function in the shot boundary prediction model, and when it is determined that any one of the first probability values is greater than a preset probability value, take a key frame image corresponding to any one of the first probability values as a shot boundary of the video to be retrieved;

the segment sub-feature acquisition module is used for converting the time sequence feature into segment sub-features according to the shot boundary of the video to be retrieved;

and the video segment level feature determination module is used for inputting the segment sub-features into the video feature extraction model to obtain the video segment level features corresponding to the video to be retrieved.

In an exemplary embodiment of the disclosure, the segment sub-feature obtaining module includes:

the characteristic sequence construction module is used for constructing the time sequence characteristics corresponding to the key frame images into a characteristic sequence;

the first position determining module is used for acquiring a shot boundary of the video to be retrieved, which is included in a key frame image of the video to be retrieved, and a first position of a time sequence feature corresponding to the shot boundary in the feature sequence;

and the segment sub-feature acquisition module is used for acquiring the time sequence feature of the non-shot boundary included before the first position, and splicing the time sequence feature of the non-shot boundary and the time sequence feature corresponding to the shot boundary to obtain the segment sub-feature.

the key frame determining module is used for analyzing the video to be retrieved through an encoding and decoding protocol to obtain key frames included in the video to be retrieved;

and the key frame image extraction module is used for extracting the key frame images included in the video to be retrieved according to the key frames.

In an exemplary embodiment of the present disclosure, the video retrieval module includes:

the search tree construction module is used for acquiring video level characteristics corresponding to videos in a video database and constructing a search tree according to the video level characteristics in the video database;

the candidate video determining module is used for searching in the search tree by using the video level characteristics included in the video fingerprint set to obtain a plurality of candidate videos similar to the video to be retrieved;

and the video retrieval module is used for acquiring the video segment level characteristics of the candidate video and obtaining the video and/or the video segment most similar to the video to be retrieved through the video segment level characteristics corresponding to the video to be retrieved and the video segment level characteristics of the candidate video.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the above-described video retrieval method.

According to an aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform any of the video retrieval methods described above via execution of the executable instructions.

According to the video retrieval method of the embodiment of the disclosure, after the key frame image of the video to be retrieved is acquired, the key frame image is input into the video fingerprint extraction model, the video level characteristics and the video segment level characteristics of the video to be retrieved are acquired, and the video similar to the video to be retrieved is acquired by retrieving through the video level characteristics and the video segment level characteristics of the video to be retrieved. The video level characteristics and the video segment level characteristics of the video to be retrieved are used as the video fingerprints of the video to be retrieved, so that the diversity of the video fingerprints is increased, meanwhile, the retrieval is only carried out according to the video level characteristics and the video segment level characteristics of the video to be retrieved, the comparison between the characteristics of all key image frames of the video to be retrieved and the characteristics of all videos in the video fingerprint database is avoided, and the retrieval speed is increased.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 schematically shows a flow chart of a video retrieval method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of deriving a plurality of key frame images from key frames according to an embodiment of the present disclosure;

FIG. 3 schematically shows a schematic diagram of a video fingerprint extraction model according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of a method for inputting a key frame image into a video fingerprint extraction model to obtain video level features and video segments and features corresponding to a video to be retrieved according to an embodiment of the disclosure;

FIG. 5 schematically shows a flow chart of a video retrieval method according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flowchart of a method for inputting key frame image features into a timing model to obtain timing features corresponding to the key frame image features, according to an embodiment of the disclosure;

FIG. 7 is a flow chart schematically illustrating a method for inputting temporal features into a video feature extraction model to obtain video-level features corresponding to a video to be retrieved according to an embodiment of the present disclosure;

FIG. 8 is a flow chart schematically illustrating a method for inputting segment sub-features into a video feature extraction model to obtain video segment-level features corresponding to a video to be retrieved, according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates a flow chart of a method for converting temporal features into segment sub-features according to shot boundaries of a video to be retrieved, according to an embodiment of the present disclosure;

FIG. 10 is a flow chart of a method for retrieving videos similar to a video to be retrieved by using video fingerprints according to an embodiment of the disclosure;

fig. 11 schematically shows a block diagram of a video retrieval apparatus according to an embodiment of the present disclosure;

FIG. 12 shows a schematic diagram of a computer-readable storage medium in accordance with embodiments of the present disclosure;

FIG. 13 schematically illustrates a block diagram of an electronic device in accordance with the disclosed embodiments.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the present disclosure, a video retrieval method, a video retrieval apparatus, a computer-readable storage medium, and an electronic device are provided.

In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Summary of The Invention

In the related art, the feature extraction of the key frame image can be roughly classified into two types: based on the traditional image algorithm and the deep learning algorithm; extracting features based on a traditional image algorithm can be divided into a method based on global features and a method based on local features, wherein in the method based on the global features, the global features of a video to be retrieved are obtained by compressing the global features of an image; in the method based on the local features, the local features of the video to be retrieved are obtained by extracting special interest points in the video to be retrieved and the features of the boundary region, wherein the features of the boundary region can comprise Harris corner points, scale invariant features and features of rapid robustness; based on a deep learning algorithm, extracting the characteristics of the key frame image based on a neural network. And after the characteristics of the key frame image are extracted, comparing the characteristics of the key frame image of the video to be retrieved with the video fingerprints of the video in the video library, and further obtaining the video similar to the video to be retrieved. However, compared with the deep learning algorithm, the method for extracting the features of the key frame images has poor robustness and universality of the traditional image algorithm, and meanwhile, the time sequence features of the key frame images are lost by the features extracted by the deep learning algorithm, so that the extracted feature characterization capability is insufficient; the retrieval method has very slow retrieval speed due to the adoption of frame-by-frame feature comparison.

In view of the above, the basic idea of the present disclosure is: according to the video retrieval method and the video retrieval device, after the key frame image of the video to be retrieved is acquired, the key frame image is input into the video fingerprint extraction model, the video level feature and the video segment level feature corresponding to the video to be retrieved are obtained, retrieval is carried out through the video level feature and the video segment level feature, and then the video similar to the video to be retrieved is obtained.

Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.

Exemplary method

A video retrieval method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 1.

Referring to fig. 1, the video retrieval method may include the steps of:

s1, acquiring a video to be retrieved, extracting key frames included in the video to be retrieved, and obtaining a plurality of key frame images according to the key frames;

s2, inputting the plurality of key frame images into a video fingerprint extraction model to obtain video level characteristics and video segment level characteristics corresponding to a video to be retrieved, and obtaining the video fingerprint of the video to be retrieved according to the video level characteristics and the video segment level characteristics;

and S3, retrieving by using the video fingerprints to obtain a video similar to the video to be retrieved.

In the video retrieval method of the embodiment of the disclosure, after the key frame image of the video to be retrieved is acquired, the key frame image is input into the video fingerprint extraction model, the video level feature and the video segment level feature of the video to be retrieved are acquired, and the video similar to the video to be retrieved is acquired by retrieving through the video level feature and the video segment level feature of the video to be retrieved. The video to be retrieved is retrieved through the video level features and the video segment level features of the video to be retrieved, the number of features included in video fingerprints is reduced, the storage cost of the video fingerprints is reduced, meanwhile, retrieval is only performed according to the video level features and the video segment level features of the video to be retrieved, the comparison of the features of all key image frames of the video to be retrieved with the features of all key image frames in a video fingerprint library is avoided, and the retrieval speed is improved.

In step S1, a video to be retrieved is acquired, a key frame included in the video to be retrieved is extracted, and a plurality of key frame images are obtained according to the key frame.

In an exemplary embodiment of the present disclosure, the video retrieval method described above may be applied to a server side that performs video retrieval. The video to be retrieved may be a variety video clip, an entertainment video, or a self-media video, and is not particularly limited in this exemplary embodiment. A key frame, i.e. an i (i frame) frame, included in a video to be retrieved, where the key frame includes a large amount of information, and is a most important frame for performing compression coding between video frames, and the key frame may include an image background in an image corresponding to the frame and an entity included in the image, where when the image background in the image of the frame changes compared with an image of a previous frame, the frame may be a key frame; when the entity included in the frame image changes compared with the entity included in the previous frame image, the frame may also be a key frame, the entity included in the image may be a still object, an animal, or a game character, in this example embodiment, the entity is not specifically limited, the entity included in the frame image changes, the number of the entities included in the frame image changes compared with the previous frame, the position of the entity included in the frame image changes compared with the previous frame, the type of the entity included in the frame image changes compared with the previous frame, and in this example embodiment, the entity changes without specific limitation. When the key frame is obtained, the key frame of the video to be retrieved may be extracted by an open-source video coding and decoding tool, where the open-source video coding and decoding tool may be an FFmpeg, and may also be an OpenCV, and the open-source video coding and decoding tool is not specifically limited in this exemplary embodiment. Specifically, referring to fig. 2, obtaining a plurality of key frame images according to key frames may include steps S21 and S22:

s21, analyzing the video to be retrieved through an encoding and decoding protocol to obtain a key frame included in the video to be retrieved;

and S22, extracting a key frame image included in the video to be retrieved according to the key frame.

Hereinafter, step S21, step S22 will be explained and explained. Specifically, firstly, analyzing a video to be retrieved through an encoding and decoding protocol to obtain a key frame included in the video to be retrieved and a time point of the key frame appearing in the video to be retrieved, and then extracting a key frame image included in the video to be retrieved according to the key frame and the time point of the key frame appearing in the video to be retrieved.

For example, when the video to be retrieved is a moving video, and the moving video is analyzed through an encoding and decoding protocol, the obtained key frame may be that the movement posture of a moving subject in any frame of the video to be retrieved is changed compared with a previous frame, or the relative position of the moving subject in a current frame is changed compared with the previous frame, or the background of any frame is changed compared with the previous frame, after the key frame is obtained through analysis, the time point of the key frame appearing in the video is determined, and finally, the key frame image is extracted according to the time point.

In step S2, the plurality of key frame images are input to a video fingerprint extraction model, so as to obtain video level features and video segment level features corresponding to a video to be retrieved, and obtain video fingerprints of the video to be retrieved according to the video level features and the video segment level features.

In an exemplary embodiment of the present disclosure, at least a shot boundary prediction model is included in the video fingerprint extraction model. The shot boundary prediction model is used for performing shot boundary detection on a video to be retrieved to obtain a key frame image which belongs to a shot boundary of the video to be retrieved in key frame images input into the video fingerprint extraction model, wherein when a background image of a current frame in the video to be retrieved is converted compared with a previous frame, an image corresponding to the current frame can be the shot boundary, or when a main body in the image corresponding to the current frame in the video to be retrieved is changed compared with the previous frame, the image corresponding to the current frame can also be the shot boundary, wherein the background image of the current frame can be a game scene or other scenes; the motion subject may be a game character included in a game scene, may be a subject included in another scene, and the like, which is not particularly limited in the present exemplary embodiment. The shot boundary detection can divide the video to be retrieved into a plurality of sub-segments, the plurality of features input into the shot boundary prediction model can be spliced again through the shot boundary detection model to obtain segment sub-features, the number of the features is reduced through the shot boundary prediction model, and the storage cost of the features is reduced.

Based on this, in the exemplary embodiment of the present disclosure, as shown in fig. 3, the video fingerprint extraction model may further include a key frame image feature extraction model 301, a time sequence model 302, and a video feature extraction model 303; the key frame image feature extraction model 301 is a depth feature network and is formed by a multilayer convolutional neural network, wherein the feature extraction network may be ResNet or VGG, and is not specifically limited in the present exemplary embodiment, and is configured to obtain a bottom layer feature and a high layer semantic feature of a key frame image input to the video fingerprint extraction model, where input data is the key frame image and output data is the key frame image feature; the time sequence model 302 is a self-attention model, the input data of which is the key frame image characteristics and the output data of which is the time sequence characteristics, and is used for modeling the key frame image characteristics in time sequence, so that the robustness of the characteristics is improved; the video feature extraction model 303 is configured to obtain a video fingerprint of a video to be retrieved, that is, a video level feature and a video segment level feature, and when an output feature of the video feature extraction model is the video level feature, corresponding input features are: timing characteristics of the timing model output; when the feature output by the video feature extraction model is a video segment level feature, the corresponding input feature is as follows: and (4) segment sub-features output in the shot boundary prediction model.

In an exemplary implementation method of the present disclosure, referring to fig. 4, inputting the key frame image into the video fingerprint extraction model to obtain a video level feature and a video clip level feature corresponding to a video to be retrieved, may include steps S41-S44:

s41, inputting the plurality of key frame images into the key frame image feature extraction model, and obtaining key frame image features corresponding to the key frame images through a depth feature network in the key frame image feature extraction model;

s42, inputting the key frame image characteristics into the time sequence model, and learning the key frame image characteristics to obtain time sequence characteristics corresponding to the key frame image characteristics;

s43, inputting the time sequence characteristics into the video characteristic extraction model to obtain video level characteristics corresponding to the video to be retrieved;

and S44, inputting the time sequence characteristics into the shot boundary prediction model to obtain segment sub-characteristics, and inputting the segment sub-characteristics into the video characteristic extraction model to obtain video segment level characteristics corresponding to the video to be retrieved.

Hereinafter, steps S41 to S44 will be explained and explained. Specifically, firstly, inputting a key frame image into a key frame image feature extraction model in a video fingerprint extraction model, and extracting key frame image features of the key frame image through a depth feature network included in the key frame image feature extraction model; then, inputting the characteristics of the key frame images into a time sequence model, and learning the characteristics of the key frame images through the time sequence model to obtain the time sequence characteristics corresponding to the key frame images; finally, splicing a plurality of time sequence characteristics into one characteristic and inputting the characteristic into a video characteristic extraction model to obtain the video level characteristics of the video to be retrieved; and inputting the time sequence characteristics into the shot boundary prediction model to obtain segment sub-characteristics of the video to be retrieved, and inputting the obtained segment sub-characteristics into the video characteristic extraction model to obtain video segment level characteristics of the video to be retrieved. After the key frame image features are extracted, the key frame image features are modeled in a time sequence through a time sequence model, the robustness of the features is improved, the time sequence features are spliced into one feature, the video level features which can represent the whole video to be retrieved are obtained through the video feature extraction model, and the diversity of video fingerprints is improved.

In an exemplary embodiment of the present disclosure, the bottom-layer features and the high-layer features of the key frame image may be obtained through a key frame image feature extraction model, where the bottom-layer features and the high-layer features of the key frame image are semantic features of the key frame image mentioned through a depth feature network, and the bottom-layer features may include: the video retrieval method includes the following steps that (1) the characteristics such as contour, edge, color, texture and shape are included, the semantic information included in the bottom layer characteristics is less, but the target position is accurate, the semantic information in the high layer characteristics is rich, but the target position is rough, and the high resolution capability is achieved, in order to improve the robustness of the video fingerprint representation, before the learning of the time sequence characteristics, the global characteristics of the key frame image, namely the color histogram characteristics, can be acquired, specifically, as shown in fig. 5, the video retrieval method further includes the steps of S51-S55:

s51, calculating color histograms of the plurality of key frame images to obtain color histogram features corresponding to the key frame images;

s52, splicing the color histogram characteristics of each key frame image and the key frame image characteristics to obtain splicing characteristics corresponding to each key frame image;

s53, inputting the splicing characteristics into the time sequence model, and learning the splicing characteristics to obtain first time sequence characteristics corresponding to the splicing characteristics;

s54, inputting the first time sequence feature into the video feature extraction model to obtain a video level feature corresponding to the video to be retrieved;

and S55, inputting the first time sequence feature into the shot boundary prediction model to obtain a first segment sub-feature, and inputting the first segment sub-feature into the video feature extraction model to obtain a video segment level feature corresponding to the video to be retrieved.

Hereinafter, steps S51 to S55 will be explained and explained. Specifically, firstly, calculating a color histogram of each key frame image to obtain color histogram characteristics of each key frame image; then, splicing the color histogram features of each key frame image and the key frame image features to obtain splicing features corresponding to each key frame image; inputting the splicing characteristics into a time sequence model, and learning the splicing characteristics to obtain first time sequence characteristics corresponding to the splicing characteristics; finally, inputting the first time sequence feature into a feature extraction model to obtain a video level feature corresponding to the video to be retrieved; and inputting the first time sequence feature into a shot boundary prediction model to obtain a first segment sub-feature, and inputting the first segment sub-feature into a feature extraction model to obtain a video segment level feature of the video to be retrieved. The color histogram features and the key frame image features are spliced to obtain splicing features, so that the scale invariance of the key frame image features is further improved.

In an exemplary embodiment of the disclosure, referring to fig. 6, inputting the key frame image features into the time sequence model, and learning the key frame image features to obtain time sequence features corresponding to the key frame image features may include steps S61-S63:

s61, multiplying each key frame image feature by the initialization matrix of the time sequence model to obtain a query vector, a key vector and a value vector corresponding to each key frame image feature; the initialization matrix comprises three matrixes;

s62, calculating the correlation between each query vector corresponding to the key frame image characteristics and the key vector through a similarity function;

and S63, obtaining time sequence characteristics corresponding to the key frame image characteristics according to the correlation and the value vector corresponding to each key frame image characteristic.

Hereinafter, steps S61 to S63 will be explained and explained. Specifically, after obtaining the key frame image features of the video to be retrieved, in order to generate more distinguishing feature representations for each key frame image and improve the robustness of video fingerprint representation, the key frame image features need to be learned through a self-attention model to obtain the time sequence features of the key frame images; the calculation process of the self-attention model may include: calculating a weight coefficient according to the query vector and the key vector, and performing weighted summation on the value vector according to the weight coefficient to obtain a time sequence characteristic; specifically, firstly, multiplying the characteristics of a key frame image by three initialization matrixes of a time sequence model to obtain a query vector, a key vector and a value vector; then, calculating the query vector and the key vector through a similarity function to obtain the similarity between the query vector and the key vector, and normalizing the similarity through a softmax function after the similarity is obtained; the similarity function may be a dot product, a concatenation, or a perceptron, and is not specifically limited in this exemplary embodiment; and finally, carrying out weighted summation on the similarity and the value vector corresponding to each key frame image feature to obtain a time sequence feature corresponding to each key frame image feature. The time sequence model combines the bottom layer characteristics and the high-layer semantic characteristics of the key frame images and the incidence relation of each key frame image in time sequence, and improves the robustness of video fingerprint representation.

After obtaining the time series characteristics of the key frame images, inputting the time series characteristics corresponding to each key frame image into a video characteristic extraction model to obtain the video level characteristics of the video to be retrieved, and inputting the time series characteristics into the video characteristic extraction model to obtain the video level characteristics corresponding to the video to be retrieved as shown in fig. 7 may include steps S71 and S72:

s71, splicing the time sequence characteristics corresponding to the key frame images to obtain first spliced time sequence characteristics;

and S72, inputting the first splicing time sequence characteristics into the characteristic extraction model, and pooling and normalizing the first splicing time sequence characteristics to obtain video level characteristics corresponding to the video to be retrieved.

Hereinafter, step S71, step S72 will be explained and explained. Specifically, a time sequence feature corresponding to the key frame image is obtained through the time sequence model, and the obtained multiple time sequence features can be spliced into one feature, namely, the time sequence features are spliced to obtain a first splicing time sequence feature; and then, inputting the first splicing time sequence feature into a feature extraction model, and performing pooling and normalization processing on the first splicing time sequence feature to obtain the video-level feature of the video to be retrieved.

After obtaining the time sequence feature, the time sequence feature may be further input into a shot boundary prediction model to obtain a video segment level feature of the video to be retrieved, in an exemplary embodiment of the present disclosure, referring to fig. 8, the time sequence feature is input into the shot boundary prediction model to obtain a segment sub-feature, and the segment sub-feature is input into the video feature extraction model to obtain a video segment level feature corresponding to the video to be retrieved, which may include steps S81-S83:

s81, converting the time sequence characteristics into a plurality of first probability values through an activation function in the shot boundary prediction model, and taking a key frame image corresponding to any one first probability value as a shot boundary of the video to be retrieved when any one first probability value is determined to be greater than a preset probability value;

s82, converting the time sequence characteristics into segment sub-characteristics according to the shot boundary of the video to be retrieved;

and S83, inputting the sub-features of the segments into the video feature extraction model to obtain the video segment level features corresponding to the video to be retrieved.

Hereinafter, steps S81 to S83 will be explained and explained. Specifically, the shot boundary prediction model is composed of a plurality of fully-connected networks, and after the time sequence characteristics are input into the shot boundary detection model, the time sequence characteristics are converted into 1 x N dimensional characteristics through the fully-connected networks, wherein N is the number of key frame images input into the video fingerprint extraction model; then, converting the multiple 1 × N-dimensional features into multiple first probability values by using an activation function, and when any one of the first probability values is greater than a preset probability value, considering a key frame image corresponding to the feature as a shot boundary of the video to be retrieved; secondly, converting the time sequence characteristics into a plurality of segment sub-characteristics according to the shot boundary of the video to be retrieved; and finally, inputting the sub-characteristics of the plurality of segments into the characteristic extraction model to obtain the video segment level characteristics of the video to be retrieved. Shot boundary prediction is performed on the input time sequence characteristics through a shot boundary prediction model, and the time sequence characteristics are converted into segment sub-characteristics according to the predicted shot boundary.

Further, in an exemplary embodiment of the present disclosure, referring to fig. 9, converting the time-series feature into a segment sub-feature according to a shot boundary of the video to be retrieved may include steps S91 to S93:

s91, forming a characteristic sequence by using the time sequence characteristics corresponding to the key frame image;

s92, acquiring a shot boundary of the video to be retrieved, which is included in a key frame image of the video to be retrieved, and a first position of a time sequence feature corresponding to the shot boundary in the feature sequence;

and S93, acquiring the time sequence characteristics of the non-shot boundary included before the first position, and splicing the time sequence characteristics of the non-shot boundary and the time sequence characteristics corresponding to the shot boundary to obtain the sub-characteristics of the segment.

Hereinafter, steps S91 to S93 will be explained and explained. Specifically, firstly, time sequence features of key frame images form a feature sequence, which can be recorded as { X1, X2, X3, …, Xn }, where n is a positive integer, and then, a shot boundary of a video to be retrieved, time sequence features corresponding to the shot boundary, and a first position of the time sequence features in the feature sequence are obtained; and then, acquiring the time sequence characteristics of the non-shot boundary included before the first position, and splicing the time sequence characteristics of the non-shot boundary and the time sequence characteristics of the shot boundary to obtain segment sub-characteristics.

For example, when the number of the time series features input into the video feature extraction model is 5, the time series features corresponding to 5 key frame images may be represented by a feature sequence { X1, X2, X3, X4, X5}, and if the key frame image 2 and the key frame image 5 are shot boundaries of the video to be retrieved, first, the position of the time series feature of the shot boundaries in the feature sequence is determined, and it may be obtained that, in the feature sequence, the position of the time series feature X2 in the feature sequence is 2, and the position of the time series feature X5 in the feature sequence is 5; then, the timing characteristics X1 of the non-shot boundary included before position 2, and the timing characteristics X3 and X4 of the non-shot boundary included before position 5 are obtained; finally, the time sequence characteristic X2 at the position 2 and the time sequence characteristic X1 at the non-shot boundary before the position 2 are spliced to obtain a first segment sub-characteristic, the time sequence characteristic X5 at the position 5 and the time sequence characteristics X3 and X4 at the non-shot boundary before the position 5 are spliced to obtain a second segment sub-characteristic, and finally the first segment sub-characteristic and the second segment sub-characteristic are input into a video characteristic extraction model to obtain 2 video segment level characteristics of the video to be retrieved. Compared with the number of features input to the shot boundary prediction model, the shot boundary prediction model reduces the number of output features, the segment sub-features obtained from the shot boundary prediction model are input to the video feature extraction model, and the video segment level features corresponding to the video to be retrieved are obtained.

In step S3, the video fingerprint is used to perform retrieval, and a video similar to the video to be retrieved is obtained.

In an exemplary embodiment of the present disclosure, after obtaining a video fingerprint of a video to be retrieved, that is, a video-level feature and a video-segment-level feature, two-stage retrieval may be performed according to the video fingerprint to obtain a video and/or a video segment similar to the video to be retrieved, referring to fig. 10, retrieving by using the video fingerprint to obtain a video similar to the video to be retrieved, which may include steps S1001 to S1003:

s1001, acquiring video level characteristics corresponding to videos included in a video database, and constructing a search tree according to the video level characteristics included in the video database;

s1002, searching in the search tree by using video level characteristics included in the video fingerprint set to obtain a plurality of candidate videos similar to the video to be retrieved;

and S1003, acquiring the video segment level characteristics of the candidate video, and acquiring the video and/or the video segment most similar to the video to be retrieved through the video segment level characteristics corresponding to the video to be retrieved and the video segment level characteristics of the candidate video.

Hereinafter, steps S1001 to S1003 will be explained and explained. Specifically, when retrieving, firstly, retrieving is performed through the video level features of the video to be retrieved, that is, firstly, searching is performed in the video database according to the video level features of the video to be retrieved, and a plurality of candidate videos most similar to the video to be retrieved are determined. When retrieving is performed in the video database through the video-level features of the video to be retrieved, NNS (Nearest Neighbor Search) may be used in order to improve the retrieval efficiency, and is also called Nearest point Search, which is an optimization problem of finding a Nearest point in a scale space, and other Search methods may also be used, which are not specifically limited in this exemplary implementation method; after a plurality of candidate videos similar to the video to be retrieved are obtained, the video and/or video clips similar to the video to be retrieved can be obtained by comparing the video clip level characteristics of the video to be retrieved with the video clip level characteristics of the candidate videos one by one.

Through two-stage retrieval, comparison between all key frame image characteristics of the video to be retrieved and key frame image characteristics of all videos in the video library is avoided, and retrieval speed is improved; in the related technology, if the video fingerprint of the video to be retrieved comprises y key frame image characteristics, the number of videos in the video database is m, each video comprises y characteristics, and the videos to be retrieved need to be compared for m × y times; in the present disclosure, if a video fingerprint of a video to be retrieved includes n video segment-level features, the number of a plurality of candidate videos similar to the video to be retrieved is x, and each candidate video includes n features, when a two-stage retrieval is used in the present disclosure, the number of times of comparison is x + n × n, where the number n of video segment-level features included in the video fingerprint in the present disclosure is less than the number y of features included in a video fingerprint of a video to be retrieved in a related art, and a retrieval speed of the present disclosure is increased compared to a retrieval speed of a related art

And (4) doubling.

In summary, the method provided by the present disclosure may be applied to retrieve videos and/or video segments similar to the video to be retrieved from a massive video database. When retrieving a video to be retrieved, firstly acquiring a key frame image of the video to be retrieved, and inputting the key frame image into a key frame image feature extraction model of a video fingerprint extraction model to obtain key frame image features; then calculating the color histogram feature of the key frame image, splicing the key frame image feature and the color histogram feature to obtain a splicing feature, inputting the splicing feature into a time sequence model, and learning the input splicing feature through the time sequence model to obtain a time sequence feature; inputting the time sequence characteristics into a video characteristic extraction model to obtain video level characteristics of a video to be retrieved; inputting the time sequence characteristics into a shot boundary prediction model, converting the time sequence characteristics into a plurality of segment sub-characteristics, and inputting the plurality of segment sub-characteristics into a video characteristic extraction model to obtain a plurality of segment sub-characteristics of the video to be retrieved; and finally, searching in a video database according to the video level characteristics of the video to be searched to obtain a plurality of candidate videos similar to the video to be searched, and comparing the plurality of video segment level characteristics of the video to be searched with the video segment level characteristics of the plurality of candidate videos to obtain the video and/or video segments similar to the video to be searched. On one hand, the bottom layer characteristics, the high-layer semantic characteristics, the color histogram characteristics and the time sequence characteristics of the key frame image are fused during the characteristic extraction, so that the robustness of the video fingerprint representation is improved; on the other hand, the number of video fingerprints is reduced, and the storage cost of the video fingerprints is reduced; on the other hand, by adopting two-stage retrieval, the comparison times of the video fingerprints of the video to be retrieved in the video database are reduced, and the retrieval speed is improved.

Exemplary devices

Having introduced the video retrieval method of the exemplary embodiment of the present disclosure, next, a video retrieval apparatus of the exemplary embodiment of the present disclosure is described with reference to fig. 11.

Referring to fig. 11, the video retrieval apparatus 11 of the exemplary embodiment of the present disclosure may include: a key frame image acquisition module 1101, a video fingerprint acquisition module 1102 and a video retrieval module 1103; wherein:

the key frame image obtaining module 1101 may be configured to obtain a video to be retrieved, extract a key frame included in the video to be retrieved, and obtain a plurality of key frame images according to the key frame.

The video fingerprint obtaining module 1102 may be configured to input the plurality of key frame images into a video fingerprint extraction model, obtain video level characteristics and video segment level characteristics corresponding to a video to be retrieved, and obtain a video fingerprint of the video to be retrieved according to the video level characteristics and the video segment level characteristics.

The video retrieval module 1103 may be configured to perform retrieval by using the video fingerprint to obtain a video similar to the video to be retrieved.

According to an exemplary embodiment of the present disclosure, at least a shot boundary prediction model is included in the video fingerprint model.

According to an exemplary embodiment of the present disclosure, the video fingerprint extraction model further includes a key frame image feature extraction model, a timing model, and a video feature extraction model.

According to an exemplary embodiment of the present disclosure, the video fingerprint acquisition module includes:

According to an exemplary embodiment of the present disclosure, the video retrieval apparatus further includes:

According to an exemplary embodiment of the present disclosure, the timing characteristic acquisition module includes:

According to an exemplary embodiment of the present disclosure, the video level feature obtaining module includes:

According to an exemplary embodiment of the present disclosure, the video clip level feature acquisition module includes:

According to an exemplary embodiment of the present disclosure, the segment sub-feature obtaining module includes:

According to an exemplary embodiment of the present disclosure, the video retrieval module includes:

Since each functional module of the video apparatus according to the embodiment of the present disclosure is the same as that in the embodiment of the video retrieval method, it is not described herein again.

Exemplary storage Medium

Having described the video retrieval method and apparatus of the exemplary embodiments of the present disclosure, a computer-readable storage medium of the exemplary embodiments of the present disclosure is explained next with reference to fig. 12.

Referring to fig. 12, a program product 1200 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary electronic device

Having described the storage medium of the exemplary embodiment of the present disclosure, next, an electronic device of the exemplary embodiment of the present disclosure is explained with reference to fig. 13.

The electronic device 1300 shown in fig. 13 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 13, the electronic device 1300 is in the form of a general purpose computing device. The components of the electronic device 1300 may include, but are not limited to: the at least one processing unit 1310, the at least one memory unit 1320, the bus 1330 connecting the various system components (including the memory unit 1320 and the processing unit 1310), the display unit 1340.

Wherein the memory unit stores program code that is executable by the processing unit 1310 to cause the processing unit 1310 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, the processing unit 1310 may perform steps S1 through S3 as shown in fig. 1.

The memory unit 1320 may include volatile memory units, such as a random access memory unit (RAM)13201 and/or a cache memory unit 13202, and may further include a read only memory unit (ROM) 13203.

Storage unit 1320 may also include a program/utility 13204 having a set (at least one) of program modules 13205, such program modules 13205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1330 may include a data bus, an address bus, and a control bus.

The electronic device 1300 may also communicate with one or more external devices 1400 (e.g., keyboard, pointing device, bluetooth device, etc.) via an input/output (I/O) interface 1350. The electronic device 1300 also includes a display unit 1340 connected to the input/output (I/O) interface 1350 for display. Also, the electronic device 1300 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through the network adapter 1360. As shown, the network adapter 1360 communicates with other modules of the electronic device 1300 via the bus 1330. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several modules or sub-modules of the audio playback device and the audio sharing device are mentioned, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for video retrieval, comprising:

2. The video retrieval method of claim 1, wherein the video fingerprint model comprises at least a shot boundary prediction model.

3. The video retrieval method of claim 2, wherein the video fingerprint extraction model further comprises a key frame image feature extraction model, a time sequence model and a video feature extraction model.

4. The method according to claim 3, wherein inputting the key frame image into the video fingerprint extraction model to obtain a video level feature and a video segment level feature corresponding to a video to be retrieved, comprises:

5. The video retrieval method of claim 4, wherein inputting the time-series feature into the video feature extraction model to obtain a video-level feature corresponding to the video to be retrieved comprises:

6. The video retrieval method of claim 5, wherein inputting the time sequence features into the shot boundary prediction model to obtain segment sub-features, and inputting the segment sub-features into the video feature extraction model to obtain video segment-level features corresponding to the video to be retrieved, comprises:

converting the plurality of time sequence characteristics into a plurality of first probability values through an activation function in the shot boundary prediction model, and taking a key frame image corresponding to any one first probability value as a shot boundary of the video to be retrieved when the fact that any one first probability value is larger than a preset probability value is determined;

7. The video retrieval method of claim 6, wherein converting the temporal features into segment sub-features according to shot boundaries of the video to be retrieved comprises:

8. A video retrieval apparatus, comprising:

9. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the video retrieval method of any of claims 1-7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the video retrieval method of any of claims 1-7 via execution of the executable instructions.