CN116361507A

CN116361507A - Video retrieval model construction method, device, equipment and storage medium

Info

Publication number: CN116361507A
Application number: CN202111618713.0A
Authority: CN
Inventors: 敖吉; 胡传锐; 江大山
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2023-06-30

Abstract

The invention discloses a method, a device, equipment and a storage medium for constructing a video retrieval model, and belongs to the technical field of video retrieval. The method comprises the steps of screening a first video set and a second video set from the collected video sets; constructing a video feature extraction model according to the first video set; determining display videos and query videos corresponding to a video library according to the second video set; and constructing a video retrieval model according to the video feature extraction model, the display video and the query video, and constructing the video retrieval model through the video feature extraction model, the display video and the query video, so that the preprocessing process in the model construction process is greatly simplified, the high precision of the model is ensured, and the retrieval speed is improved.

Description

Video retrieval model construction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of video retrieval technologies, and in particular, to a method, an apparatus, a device, and a storage medium for constructing a video retrieval model.

Background

The purpose of video retrieval is to provide a target video, find the most similar result in a large number of videos, and the method is commonly used in the fields of video information acquisition, video de-duplication and the like.

In recent years, update-based video retrieval methods have been proposed, such as NetVLAD, neXtVLAD, where NetVLAD was originally used to aggregate spatial expressions in location recognition, and found to be more efficient and faster than conventional temporal models (LSTM/GRU) for aggregating visual and auditory feature tasks. One of the main drawbacks of NetVLAD is the high feature dimension, which requires millions of parameters for a large classification model based on such features. Inspired by ResNeXt, a new network architecture NeXtVLAD was developed. Unlike NetVLAD, the input features are decomposed into a set of relatively low latitude vectors with intent prior to aggregation and encoding. A potential assumption is that a video frame may have multiple targets, and it is beneficial to decompose the frame-level features prior to encoding to produce a simpler video representation of the model. The NextVLAD model converges faster and can prevent overfitting. However, the preprocessing process of constructing the model by the retrieval algorithms is very complex, the calculation amount is huge, and the high precision of the model and the high speed of the retrieval cannot be combined.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide a video retrieval model construction method, a device, equipment and a storage medium, and aims to solve the technical problem that a retrieval algorithm in the prior art is complex in a model construction pretreatment process and cannot combine high precision of a model with high retrieval speed.

In order to achieve the above object, the present invention provides a method for constructing a video search model, comprising the steps of:

screening a first video set and a second video set from the collected video sets;

constructing a video feature extraction model according to the first video set;

determining display videos and query videos corresponding to a video library according to the second video set;

and constructing a video retrieval model according to the video feature extraction model, the display video and the query video.

Optionally, the screening the first video set and the second video set from the collected video sets includes:

acquiring video duration of each video in the acquired video set;

comparing the video duration of each video with a preset duration respectively;

and screening the first video set and the second video set from the video sets according to the video duration comparison result.

Optionally, the screening the first video set and the second video set from the video sets according to the video duration comparison result includes:

determining videos with the video time length longer than the preset time length and videos with the video time length less than or equal to the preset time length in the video set according to the video time length comparison;

And taking a video set formed by videos with the video time length longer than the preset time length as a first video set, and taking a video set formed by videos with the video time length shorter than or equal to the preset time length as a second video set.

Optionally, the constructing a video feature extraction model according to the first video set includes:

constructing a pre-training model according to a preset image data set;

acquiring a plurality of video frames corresponding to the first video set;

and inputting a plurality of video frames into the pre-training model for training to obtain a video feature extraction model.

Optionally, the acquiring a plurality of video frames corresponding to the first video set includes:

and respectively extracting a preset number of video frames from each video in the first video set to obtain a plurality of video frames corresponding to the first video set.

Optionally, the determining, according to the second video set, the display video and the query video corresponding to the video library includes:

taking each video in the second video set as a display video corresponding to a video library;

screening a plurality of reference videos with video duration conforming to a preset video duration range from the second video set;

randomly intercepting video clips from each reference video according to the target duration;

And taking the video clips corresponding to the reference videos as query videos of the video library.

Optionally, the constructing a video retrieval model according to the video feature extraction model, the presentation video and the query video includes:

extracting display video features corresponding to the display video and query video features corresponding to the query video through the video feature extraction model;

constructing feature vectors corresponding to all videos in a video library according to the display video features and the query video features;

and constructing a video retrieval model according to the video feature extraction model and feature vectors corresponding to each video in the video library.

Optionally, the constructing feature vectors corresponding to each video in the video library according to the display video features and the query video features includes:

performing feature fusion on the display video features and the query video features to obtain target video features;

and constructing feature vectors of preset dimensions corresponding to each video in the video library according to the target video features.

Optionally, after the video retrieval model is constructed according to the video feature extraction model, the presentation video and the query video, the method further includes:

Acquiring the video characteristics to be searched of the video to be searched through the video searching model;

constructing a video feature vector to be searched corresponding to the video to be searched according to the video feature to be searched;

inquiring a target video corresponding to the video to be searched from the video library according to the video feature vector to be searched.

Optionally, the querying, from the video library, the target video corresponding to the video to be retrieved according to the feature vector of the video to be retrieved includes:

acquiring feature vectors corresponding to all videos in the video library;

determining video similarity between the video to be searched and each video in the video library according to the video feature vector to be searched and a plurality of feature vectors corresponding to the video library;

inquiring a target video corresponding to the video to be retrieved from the video library according to the video similarity.

Optionally, the querying, from the video library, the target video corresponding to the video to be retrieved according to the video similarity includes:

ordering the video similarity between the video to be retrieved and each video in the video library;

and taking the video corresponding to the maximum video similarity in the video library as a target video based on the video similarity sorting result.

In addition, to achieve the above object, the present invention also proposes a video retrieval model construction apparatus including:

the screening module is used for screening the first video set and the second video set from the collected video sets;

the construction module is used for constructing a video feature extraction model according to the first video set;

the creation module is used for determining a display video and a query video corresponding to the video library according to the second video set;

and the fusion module is used for constructing a video retrieval model according to the video feature extraction model, the display video and the query video.

Optionally, the screening module is further configured to obtain a video duration of each video in the collected video set; comparing the video duration of each video with a preset duration respectively; screening a first video set and a second video set from the video sets according to the video duration comparison result

Optionally, the screening module is further configured to determine, according to the video duration comparison, a video with a video duration greater than the preset duration and a video with a duration less than or equal to the preset duration in the video set; and taking a video set formed by videos with the video time length longer than the preset time length as a first video set, and taking a video set formed by videos with the video time length shorter than or equal to the preset time length as a second video set.

Optionally, the construction module is further configured to construct a pre-training model according to a preset image dataset; acquiring a plurality of video frames corresponding to the first video set; and inputting a plurality of video frames into the pre-training model for training to obtain a video feature extraction model.

Optionally, the construction module is further configured to extract a preset number of video frames from each video in the first video set, so as to obtain a plurality of video frames corresponding to the first video set.

Optionally, the creating module is further configured to use each video in the second video set as a display video corresponding to a video library; screening a plurality of reference videos with video duration conforming to a preset video duration range from the second video set; randomly intercepting video clips from each reference video according to the target duration; and taking the video clips corresponding to the reference videos as query videos of the video library.

Optionally, the building module is further configured to extract, by using the video feature extraction model, a display video feature corresponding to the display video and a query video feature corresponding to the query video; constructing feature vectors corresponding to all videos in a video library according to the display video features and the query video features; and constructing a video retrieval model according to the video feature extraction model and feature vectors corresponding to each video in the video library.

In addition, in order to achieve the above object, the present invention also proposes a video retrieval model construction apparatus including: a memory, a processor, and a video retrieval model construction program stored on the memory and executable on the processor, the video retrieval model construction program configured to implement the video retrieval model construction method as described above.

In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a video retrieval model construction program which, when executed by a processor, implements the video retrieval model construction method as described above.

The method comprises the steps of screening a first video set and a second video set from the collected video sets; constructing a video feature extraction model according to the first video set; determining display videos and query videos corresponding to a video library according to the second video set; and constructing a video retrieval model according to the video feature extraction model, the display video and the query video, and constructing the video retrieval model through the video feature extraction model, the display video and the query video, so that the preprocessing process in the model construction process is greatly simplified, the high precision of the model is ensured, and the retrieval speed is improved.

Drawings

FIG. 1 is a schematic diagram of a video retrieval model construction device of a hardware running environment according to an embodiment of the present invention;

FIG. 2 is a flowchart of a first embodiment of a method for constructing a video search model according to the present invention;

FIG. 3 is a flowchart of a second embodiment of a method for constructing a video search model according to the present invention;

FIG. 4 is a flowchart of a third embodiment of a method for constructing a video search model according to the present invention;

fig. 5 is a block diagram showing the construction of a first embodiment of the video search model construction apparatus according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a video retrieval model construction device of a hardware running environment according to an embodiment of the present invention.

As shown in fig. 1, the video retrieval model construction apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation on the video retrieval model construction device, and may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a video retrieval model construction program may be included in the memory 1005 as one type of storage medium.

In the video retrieval model construction device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the video retrieval model construction apparatus of the present invention may be provided in the video retrieval model construction apparatus, which invokes the video retrieval model construction program stored in the memory 1005 through the processor 1001 and executes the video retrieval model construction method provided by the embodiment of the present invention.

The embodiment of the invention provides a method for constructing a video retrieval model, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the method for constructing a video retrieval model.

In this embodiment, the method for constructing a video retrieval model includes the following steps:

step S10: and screening the first video set and the second video set from the collected video sets.

In this embodiment, the execution body of the embodiment may be a video search model building device, and the video search model building device may be an electronic device such as a personal computer or a server, or may be another server capable of implementing the same or similar functions in an air conditioner, which is not limited in this embodiment, and in this embodiment and the following embodiments, the video search model building method of the present invention is described by taking the video search model building device as an example.

It should be noted that with the development of computer technology, more and more video retrieval methods, such as NetVLAD, neXtVLAD, have been proposed, and NetVLAD was originally used for aggregating spatial expressions in location recognition, and found to be more efficient and faster than conventional time models (LSTM/GRU) for aggregating visual and auditory feature tasks. One of the main drawbacks of NetVLAD is the high feature dimension, which requires millions of parameters for a large classification model based on such features. Inspired by ResNeXt, a new network architecture NeXtVLAD was developed. Unlike NetVLAD, the input features are decomposed into a set of relatively low latitude vectors with intent prior to aggregation and encoding. A potential assumption is that a video frame may have multiple targets, and it is beneficial to decompose the frame-level features prior to encoding to produce a simpler video representation of the model. The NextVLAD model converges faster and can prevent overfitting.

The search algorithms are also essentially models for video search, the process of the models needs to use sample data, and corresponding models are finally constructed based on the preprocessed sample data, but the search algorithms need sample data with huge scale in order to ensure the search precision, and preprocessing the sample data is complicated, so that the time spent in the process of video search in the later stage is long, but if the scale of the sample data is reduced, the accuracy of the models is reduced, and the accuracy of video search is affected although the efficiency of video search can be ensured.

In order to solve the above technical problems, in this embodiment, the video search model is optimized, so that the video search speed when the video search model is applied can be improved while the accuracy of the video search model is ensured.

In a specific implementation, in this embodiment, a large amount of video data needs to be collected first to obtain a video set, and then a first video set and a second video set are screened from the video set, specifically, in this embodiment, screening of the first video set and the second video set may be performed according to a duration of a video, and screening of the first video set and the second video set may be performed according to a memory size of the video.

Further, in this embodiment, the screening process of the first video set and the second video set is described by taking the video duration as an example, and the specific process may be implemented as follows.

In a specific implementation, in this embodiment, the video duration of each video in the collected video set is first obtained, and then the video in the video set is divided based on the video duration of each video, so as to obtain a first video set and a second video set respectively. The video division is based on preset time length, the video time length of each video is respectively compared with the preset time length, and finally the video in the video set is divided based on the video time length comparison result. The preset time length may be set to 90s, or may be set to other time lengths, and the specific time length setting may be adjusted accordingly according to the actual situation, which is not limited in this embodiment.

Further, after comparing the video duration of each video with the preset duration, it may be determined that the video duration of the video set is longer than the video of the preset duration, and the video duration of the video set is shorter than or equal to the video of the preset duration.

Step S20: and constructing a video feature extraction model according to the first video set.

In a specific implementation, in this embodiment, after the first video set is screened from the video sets, a video feature extraction model may be constructed according to the first video set, where the video feature extraction model is used for extracting video features later.

Specifically, in the embodiment, when a video feature extraction model is constructed, feature extraction model training is performed through self-supervision contrast learning by an image net data set and a MoCov2 algorithm to obtain a pre-training model. And then further training by a VCLR algorithm on the basis of the pre-training model, and finally obtaining a video feature extraction model.

Step S30: and determining the display video and the query video corresponding to the video library according to the second video set.

In a specific implementation, after the second video set is screened, in this embodiment, the construction of the display video and the query video of the second video set video library may be performed according to the second video set video library. The video library in this embodiment is used for video retrieval, and after a user obtains a certain video clip, the complete video corresponding to the video clip can be queried from the video library.

It should be noted that, in this embodiment, the more the display videos represent the amount of videos stored in the video library, the more videos stored in the video library and available for video retrieval. The query video corresponds to a video reduction corresponding to the display video, is a certain video segment in the display video, each display video corresponds to one query video, and the video time length of the display video is longer than that of the corresponding query video.

Step S40: and constructing a video retrieval model according to the video feature extraction model, the display video and the query video.

In a specific implementation, after the video feature extraction model is built, the video feature extraction model can extract video features corresponding to videos which a user wants to search, and based on the extracted features, the display videos and the query videos are compared, so that target videos which are similar to the videos which the user wants to search in a video library are determined, and in the embodiment, the video feature extraction model, the display videos and the query videos which correspond to the video library are used as a video search model together.

It should be noted that, the video feature extraction model constructed in this embodiment not only can extract the video features corresponding to the video that the user wants to search, but also can extract the corresponding video features from the obtained display video and query video, that is, the display video features and the query video features.

In a specific implementation, after the display video features and the query video features are obtained, feature vectors corresponding to all videos in the video library are constructed in a feature fusion mode, in this embodiment, feature vectors with preset dimensions can be constructed, of course, feature vectors with other dimensions can also be constructed, the preset dimensions can be set to be one-dimensional, and the preset dimensions can be adjusted according to actual vector construction requirements, which is not limited in this embodiment. In the same manner, in this embodiment, a video feature vector to be searched corresponding to the video to be searched may also be constructed according to the video feature to be searched.

It should be emphasized that, before the feature vectors of the preset dimensions corresponding to the videos in the video library are constructed, feature fusion is performed on the display video features and the query video features in this embodiment to obtain target video features, where the target video features are the video features corresponding to the videos in the video library, and then the feature vectors corresponding to the videos in the video library can be constructed based on the target video features. In this embodiment, the display video feature and the query video feature may be spliced together in a pre-fusion manner to obtain the target video feature, and the display video feature and the query video feature may be mapped in a matrix at the same time to obtain the target video feature.

The first video set and the second video set are screened out from the collected video sets; constructing a video feature extraction model according to the first video set; determining display videos and query videos corresponding to a video library according to the second video set; and constructing a video retrieval model according to the video feature extraction model, the display video and the query video, and constructing the video retrieval model through the video feature extraction model, the display video and the query video, so that the preprocessing process in the model construction process is greatly simplified, the high precision of the model is ensured, and the retrieval speed is improved.

Referring to fig. 3, fig. 3 is a flowchart of a second embodiment of a video retrieval model construction method according to the present invention.

Based on the above first embodiment, in the method for constructing a video search model of this embodiment, the step S20 specifically includes:

step S201: and constructing a pre-training model according to the preset image data set.

In a specific implementation, the preset image dataset in this embodiment may be an ImageNet dataset, and then the feature extraction model training is performed through the MoCov2 algorithm by self-supervised contrast learning to obtain the pre-training model.

Step S202: and acquiring a plurality of video frames corresponding to the first video set.

Step S203: and inputting a plurality of video frames into the pre-training model for training to obtain a video feature extraction model.

It should be noted that, based on the pre-training model, further training is performed through VCLR algorithm, and finally, a video feature extraction model is obtained. The video data required for further training is obtained from the first video set, specifically, in this embodiment, a plurality of video frames may be obtained from the first video set, and then the obtained plurality of video frames are input as training data into the pre-training model, and then the VCLR algorithm is adopted for training, so as to obtain the video feature extraction model.

In a specific implementation, in this embodiment, a plurality of video frames corresponding to the first video set may be obtained by adopting a frame extraction manner, and in a specific process, an equal amount of video frames are extracted from each video in the first video set, and in this embodiment, the number of extracted video frames may be set to 300, and of course, the number of extracted video frames may also be set to other numbers according to actual needs, which is not limited in this embodiment.

Further, in this embodiment, the step S30 specifically includes:

step S301: and taking each video in the second video set as a display video corresponding to a video library.

In a specific implementation, after the second video set is screened, in this embodiment, each video in the second video set is used as a display video corresponding to the video library, and one video in the second video set corresponds to one display video of the video library.

Step S302: and screening a plurality of reference videos with video duration conforming to a preset video duration range from the second video set.

It should be noted that, the video duration of each video in the second video set is less than or equal to the preset duration, and the preset is equal to the first filtering of the video duration, in this embodiment, the second filtering is performed on the video in the second video set, where the second filtering is based on the preset video duration range, and specifically, a plurality of reference videos whose video durations meet the preset video duration range are screened from the second video set. The preset duration range of the video may be set to 30 s-60 s, or may be set to other duration ranges, and the setting of the specific duration range may be adjusted accordingly according to the actual requirement, which is not limited in this embodiment.

Step S303: and randomly intercepting video clips from each reference video according to the target duration.

Step S304: and taking the video clips corresponding to the reference videos as query videos of the video library.

In a specific implementation, after obtaining multiple reference videos, in this embodiment, a segment is cut from each reference video, and the video segment is randomly cut from each reference video in a random cutting manner, where the random cutting means that the playing start and stop time of the video is not limited.

Further, although the video clips are randomly intercepted in the embodiment, the video duration corresponding to the video clips needs to meet the target duration, and the target duration in the embodiment may be set to 10s, that is, the video clips with the video duration of 10s are randomly intercepted from each reference video, the target duration may also be set to other durations, and the target duration may be adjusted accordingly according to the actual situation, which is not limited in the embodiment.

It should be noted that the intercepted video clip is the query video.

The embodiment constructs a pre-training model according to a preset image data set; acquiring a plurality of video frames corresponding to the first video set; inputting a plurality of video frames into the pre-training model for training to obtain a video feature extraction model, extracting a plurality of video frames from a first video set, improving the accuracy of constructing the video feature model, and simultaneously taking each video in a second video set as a display video corresponding to a video library; screening a plurality of reference videos with video duration conforming to a preset video duration range from the second video set; randomly intercepting video clips from each reference video according to the target duration; the video clips corresponding to the reference videos are used as query videos of the video library, so that the display videos and the query videos corresponding to the video library can be accurately constructed, and subsequent video retrieval is facilitated.

Referring to fig. 4, fig. 4 is a flowchart of a third embodiment of a video retrieval model construction method according to the present invention.

Based on the first embodiment or the second embodiment described above, a third embodiment of a video retrieval model construction method of the present invention is proposed.

Taking the first embodiment as an example, in this embodiment, after the step S40, the method further includes:

step S50: and acquiring the video characteristics to be searched of the video to be searched through the video searching model.

Step S60: and constructing a video feature vector to be searched corresponding to the video to be searched according to the video feature to be searched.

Step S70: inquiring a target video corresponding to the video to be searched from the video library according to the video feature vector to be searched.

In a specific implementation, after the video retrieval model is built, video retrieval can be performed based on the video retrieval model in this embodiment, and a specific process of video retrieval can be implemented as follows.

In a specific implementation, the video feature extraction model constructed based on the first video set not only can extract video features of videos to be searched by a user, namely video features to be searched corresponding to the videos to be searched, but also can extract display video features of display videos and query video features corresponding to query videos, and target videos corresponding to the videos to be searched can be searched from a video library based on the video features to be searched, the display video features and the query video features, and the target videos in the embodiment are videos similar to the videos to be searched in the video library.

In a specific implementation, after the display video features and the query video features are obtained, feature vectors corresponding to all videos in the video library are constructed in a feature fusion mode, wherein the feature vectors which can be constructed in the embodiment are one-dimensional feature vectors, and of course, feature vectors of other dimensions can be constructed according to actual requirements, and the embodiment is not limited to the feature vectors. In the same manner, in this embodiment, a video feature vector to be searched corresponding to the video to be searched may also be constructed according to the video feature to be searched.

Further, after the feature vectors of the video to be searched and the feature vectors corresponding to the videos in the video library are constructed, the feature vectors of the video to be searched and the feature directions corresponding to the videos in the video library in the embodiment can determine the video similarity between the video to be searched and the videos in the video library, and the larger the video similarity is, the more similar the video to be searched is indicated. Specifically, in this embodiment, cosine similarity between the video to be searched and each video in the video library may be calculated according to the feature vector of the video to be searched and the feature vector corresponding to each video in the video library, and the calculated cosine similarity is used as the video similarity.

Further, after the video phase speed is determined, the video similarity is ranked, the maximum video similarity is determined, each video in the video library corresponds to one video similarity, and the video corresponding to the maximum video similarity in the video library is the target video.

The embodiment obtains the video characteristics to be searched of the video to be searched through the video searching model; constructing a video feature vector to be searched corresponding to the video to be searched according to the video feature to be searched; and inquiring the target video corresponding to the video to be searched from the video library according to the video feature vector to be searched, and searching the video similar to the video to be searched from the video library through the feature vector, thereby improving the accuracy of video searching.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a video retrieval model construction program, and the video retrieval model construction program realizes the steps of the video retrieval model construction method when being executed by a processor.

Because the storage medium adopts all the technical schemes of all the embodiments, the storage medium has at least all the beneficial effects brought by the technical schemes of the embodiments, and the description is omitted here.

Referring to fig. 5, fig. 5 is a block diagram showing the construction of a first embodiment of the video retrieval model construction device of the present invention.

As shown in fig. 5, the video retrieval model construction device provided by the embodiment of the invention includes:

and the screening module 10 is used for screening the first video set and the second video set from the collected video sets.

A construction module 20 is configured to construct a video feature extraction model from the first video set.

The creating module 30 is configured to determine, according to the second video set, a presentation video and a query video corresponding to the video library.

And the fusion module 40 is used for constructing a video retrieval model according to the video feature extraction model, the display video and the query video.

In an embodiment, the filtering module 10 is further configured to obtain a video duration of each video in the collected video set; comparing the video duration of each video with a preset duration respectively; and screening the first video set and the second video set from the video sets according to the video duration comparison result.

In an embodiment, the filtering module 10 is further configured to determine, according to the video duration comparison, a video with a video duration greater than the preset duration and a video with a duration less than or equal to the preset duration in the video set; and taking a video set formed by videos with the video time length longer than the preset time length as a first video set, and taking a video set formed by videos with the video time length shorter than or equal to the preset time length as a second video set.

In an embodiment, the construction module 20 is further configured to construct a pre-training model according to a preset image dataset; acquiring a plurality of video frames corresponding to the first video set; and inputting a plurality of video frames into the pre-training model for training to obtain a video feature extraction model.

In an embodiment, the construction module 20 is further configured to extract a preset number of video frames from each video in the first video set, so as to obtain a plurality of video frames corresponding to the first video set.

In an embodiment, the creating module 30 is further configured to use each video in the second video set as a presentation video corresponding to a video library; screening a plurality of reference videos with video duration conforming to a preset video duration range from the second video set; randomly intercepting video clips from each reference video according to the target duration; and taking the video clips corresponding to the reference videos as query videos of the video library.

In an embodiment, the fusion module 40 is further configured to extract, through the video feature extraction model, a display video feature corresponding to the display video and a query video feature corresponding to the query video; constructing feature vectors corresponding to all videos in a video library according to the display video features and the query video features; and constructing a video retrieval model according to the video feature extraction model and feature vectors corresponding to each video in the video library.

In an embodiment, the fusion module 40 is further configured to perform feature fusion on the display video feature and the query video feature to obtain a target video feature; and constructing feature vectors of preset dimensions corresponding to each video in the video library according to the target video features.

In an embodiment, the video retrieval model construction device further includes: a retrieval module;

the retrieval module is used for acquiring the video characteristics to be retrieved of the video to be retrieved through the video retrieval model; constructing a video feature vector to be searched corresponding to the video to be searched according to the video feature to be searched; inquiring a target video corresponding to the video to be searched from the video library according to the video feature vector to be searched.

In an embodiment, the retrieving module is further configured to obtain feature vectors corresponding to each video in the video library; determining video similarity between the video to be searched and each video in the video library according to the video feature vector to be searched and a plurality of feature vectors corresponding to the video library; inquiring a target video corresponding to the video to be retrieved from the video library according to the video similarity.

In an embodiment, the searching module is further configured to sort video similarities between the video to be searched and each video in the video library; and taking the video corresponding to the maximum video similarity in the video library as a target video based on the video similarity sorting result.

The invention discloses A1, a video retrieval model construction method, which comprises the following steps:

constructing a video feature extraction model according to the first video set;

A2, the method for constructing the video retrieval model according to A1, wherein the step of screening the first video set and the second video set from the collected video sets comprises the following steps:

acquiring video duration of each video in the acquired video set;

comparing the video duration of each video with a preset duration respectively;

A3, the method for constructing the video retrieval model according to A2, wherein the step of screening the first video set and the second video set from the video sets according to the video duration comparison result comprises the following steps:

A4, constructing a video feature extraction model according to the first video set by the video retrieval model construction method of A1, including:

constructing a pre-training model according to a preset image data set;

acquiring a plurality of video frames corresponding to the first video set;

A5, the method for constructing a video retrieval model according to A4, wherein the step of obtaining a plurality of video frames corresponding to the first video set includes:

A6, the method for constructing the video retrieval model according to A1, wherein the step of determining the display video and the query video corresponding to the video library according to the second video set comprises the following steps:

A7, the method for constructing a video retrieval model according to A1, wherein the method for constructing a video retrieval model according to the video feature extraction model, the display video and the query video comprises the following steps:

A8, constructing a video retrieval model according to the method of A7, wherein the constructing feature vectors corresponding to each video in a video library according to the display video features and the query video features comprises the following steps:

A9, the method for constructing a video retrieval model according to any one of A1 to A8, wherein after the video retrieval model is constructed according to the video feature extraction model, the presentation video and the query video, the method further comprises:

A10, a video retrieval model construction method as set forth in A9, wherein the querying the target video corresponding to the video to be retrieved from the video library according to the video feature vector to be retrieved includes:

acquiring feature vectors corresponding to all videos in the video library;

A11, a video retrieval model construction method as set forth in A10, wherein the querying the target video corresponding to the video to be retrieved from the video library according to the video similarity includes:

The invention also discloses a B12 and a video retrieval model construction device, wherein the video retrieval model construction device comprises:

B13, the video retrieval model construction device as described in the B12, wherein the screening module is further used for obtaining video duration of each video in the collected video set; comparing the video duration of each video with a preset duration respectively; and screening the first video set and the second video set from the video sets according to the video duration comparison result.

B14, the video retrieval model construction device as described in the B13, wherein the screening module is further used for comparing and determining videos with the video time length longer than the preset time length and videos with the video time length less than or equal to the preset time length in the video set according to the video time length; and taking a video set formed by videos with the video time length longer than the preset time length as a first video set, and taking a video set formed by videos with the video time length shorter than or equal to the preset time length as a second video set.

B15, the video retrieval model construction device as described in the B12, wherein the construction module is further used for constructing a pre-training model according to a preset image dataset; acquiring a plurality of video frames corresponding to the first video set; and inputting a plurality of video frames into the pre-training model for training to obtain a video feature extraction model.

B16, the video search model construction device of B15, where the construction module is further configured to extract a preset number of video frames from each video in the first video set, to obtain a plurality of video frames corresponding to the first video set.

B17, the video retrieval model construction device as described in B12, wherein the creation module is further configured to use each video in the second video set as a display video corresponding to a video library; screening a plurality of reference videos with video duration conforming to a preset video duration range from the second video set; randomly intercepting video clips from each reference video according to the target duration; and taking the video clips corresponding to the reference videos as query videos of the video library.

B18, the video retrieval model construction device as described in B12, wherein the construction module is further configured to extract, through the video feature extraction model, a display video feature corresponding to the display video and a query video feature corresponding to the query video; constructing feature vectors corresponding to all videos in a video library according to the display video features and the query video features; and constructing a video retrieval model according to the video feature extraction model and feature vectors corresponding to each video in the video library.

The invention also discloses C19, a video retrieval model construction device, which is characterized in that the video retrieval model construction device comprises: a memory, a processor, and a video retrieval model construction program stored on the memory and executable on the processor, the video retrieval model construction program configured to implement the video retrieval model construction method as described above.

The invention also discloses D20 and a storage medium, which is characterized in that the storage medium is stored with a video retrieval model construction program, and the video retrieval model construction program realizes the video retrieval model construction method when being executed by a processor.

It should be understood that the foregoing is illustrative only and is not limiting, and that in specific applications, those skilled in the art may set the invention as desired, and the invention is not limited thereto.

It should be noted that the above-described working procedure is merely illustrative, and does not limit the scope of the present invention, and in practical application, a person skilled in the art may select part or all of them according to actual needs to achieve the purpose of the embodiment, which is not limited herein.

In addition, technical details not described in detail in this embodiment may refer to the method for constructing a video search model provided in any embodiment of the present invention, which is not described herein.

Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. Read Only Memory)/RAM, magnetic disk, optical disk) and including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. The method for constructing the video retrieval model is characterized by comprising the following steps of:

constructing a video feature extraction model according to the first video set;

2. The method for constructing a video retrieval model according to claim 1, wherein the step of screening the first video set and the second video set from the collected video sets comprises:

acquiring video duration of each video in the acquired video set;

comparing the video duration of each video with a preset duration respectively;

3. The method for constructing a video retrieval model according to claim 2, wherein the step of screening the first video set and the second video set from the video sets according to the comparison result of the video duration comprises:

4. The method for constructing a video retrieval model according to claim 1, wherein said constructing a video feature extraction model from said first video set comprises:

constructing a pre-training model according to a preset image data set;

acquiring a plurality of video frames corresponding to the first video set;

5. The method for constructing a video search model according to claim 4, wherein said obtaining a plurality of video frames corresponding to said first video set comprises:

6. The method for constructing a video retrieval model according to claim 1, wherein the determining, according to the second video set, the presentation video and the query video corresponding to the video library includes:

7. The method for constructing a video retrieval model according to claim 1, wherein the constructing a video retrieval model from the video feature extraction model, the presentation video, and the query video comprises:

8. A video retrieval model construction apparatus, characterized in that the video retrieval model construction apparatus comprises:

9. A video retrieval model construction apparatus, characterized in that the video retrieval model construction apparatus comprises: a memory, a processor, and a video retrieval model construction program stored on the memory and executable on the processor, the video retrieval model construction program configured to implement the video retrieval model construction method of any one of claims 1 to 7.

10. A storage medium having stored thereon a video retrieval model construction program which, when executed by a processor, implements the video retrieval model construction method of any one of claims 1 to 7.