CN111639230B

CN111639230B - Similar video screening method, device, equipment and storage medium

Info

Publication number: CN111639230B
Application number: CN202010478656.XA
Authority: CN
Inventors: 罗雄文; 刘振强; 卢江虎; 项伟
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Bigo Technology Pte Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2023-05-30
Anticipated expiration: 2040-05-29
Also published as: CN111639230A

Abstract

The embodiment of the invention discloses a screening method, a screening device, screening equipment and a storage medium of similar videos. Wherein the method comprises the following steps: according to video level characteristics of each video to be selected in the video library to be selected, constructing a similar video library of each video to be selected under a specified similar scale; searching out similar videos with the similarity exceeding a preset similarity threshold value from the similar video library of each video to be selected respectively to obtain a corresponding video pair candidate library; and screening out the corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight. According to the technical scheme provided by the embodiment of the invention, the comprehensiveness and accuracy of similar video screening are improved by adopting a double-screening mode, meanwhile, the extraction operand of frame-level features of both video parties is controlled, the extraction operand of the frame-level features is reduced, and the screening efficiency of the similar video is improved.

Description

Similar video screening method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of video processing, in particular to a method, a device, equipment and a storage medium for screening similar videos.

Background

With the rapid development of the short video and live broadcast industry, the growth scale of the online video is continuously enlarged, the daily gain of the short video even reaches tens of millions, and at the moment, the newly added video of the repeated content brings a great amount of audit redundancy for audit work, and also brings great risk to the copyright protection work of a video originator, so that the newly added video needs to be subjected to duplicate elimination by screening similar videos with the repeated content, thereby ensuring the high efficiency of video audit and the copyright security of the video originator.

At present, the similarity between different videos is generally calculated by extracting the characteristics of each video at the video level granularity or the picture level granularity and further adopting the characteristics of different videos at the video level granularity or the picture level granularity. At this time, for the features at the granularity of the video level, each video only needs to perform a feature extraction step, the features of individual video frames of the video under sparse sampling are fused, and the features of the video under the granularity of the video level are obtained through dimension reduction, so that the feature dimension at the granularity of the video level is lower, and the similarity between different videos cannot be accurately represented; for the characteristics of the picture level granularity, the spatial details of the video under the condition of no frame are considered, but the characteristic extraction step is required to be executed for a plurality of video frames in the video, so that the calculated amount of characteristic extraction is overlarge, the time sequence association of different video frames in a certain video is split, and the similarity calculation efficiency between different videos is greatly improved.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for screening similar videos, which are used for realizing the similarity judgment of different video to be selected in a video library under the combination of video-level characteristics and frame-level characteristics and improving the comprehensiveness and accuracy of similar video screening.

In a first aspect, an embodiment of the present invention provides a method for screening similar videos, where the method includes:

according to video level characteristics of each video to be selected in the video library to be selected, constructing a similar video library of each video to be selected under a specified similar scale;

searching out similar videos with the similarity exceeding a preset similarity threshold value from the similar video library of each video to be selected respectively to obtain a corresponding video pair candidate library;

and based on the frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight, selecting the corresponding similar video pair from the video pair candidate library.

In a second aspect, an embodiment of the present invention provides a screening apparatus for similar videos, including:

the similarity library construction module is used for constructing a similarity video library of each video to be selected under a specified similarity scale according to the video level characteristics of each video to be selected in the video library to be selected;

The candidate library generation module is used for respectively searching out similar videos with the similarity exceeding a preset similarity threshold value from the similar video libraries of each video to be selected to obtain corresponding video pair candidate libraries;

and the similar video screening module is used for screening the corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight.

In a third aspect, an embodiment of the present invention provides an apparatus, including:

one or more processors;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for screening similar videos according to any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, where a computer program is stored, where the program when executed by a processor implements the method for screening similar videos according to any embodiment of the present invention.

According to the method, the device, the equipment and the storage medium for screening the similar videos, firstly, the video level characteristics of each video to be selected in the video library to be selected are constructed, the similar video library of each video to be selected under the appointed similar scale is constructed, the first rescreening of the similar videos is carried out by adopting the appointed similar scale, then the similar videos, of which the similarity between the different video to be selected in the video library to be selected is beyond the preset similar threshold, are respectively searched out from the similar video libraries of each video to be selected, the corresponding video pair candidate libraries are formed by the video to be selected and the searched similar videos, the second rescreening of the similar videos is carried out by adopting the preset similar threshold in the similar video libraries of each video to be selected, the number of the candidate video pairs in the candidate libraries is reduced through double screening, and then the frame level characteristics of the two video pairs of each candidate video pair under the corresponding similarity weight are selected from the video pair candidate libraries, so that the similarity judgment of the similarity between the different video to be selected in the video library under the combination of the video level characteristics and the frame level characteristics is realized, the similarity of the two-pair video pair of the video pair is not required to be extracted, the frame-level characteristics of the two-pair of the video is comprehensively extracted, the similarity of the two-pair video is not required, and the similarity of the frame-level characteristics of the two-pair of the video is extracted, and the frame-level feature of the similarity is completely extracted, and the frame-level feature of the frame-level of the feature of the video is not required.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

fig. 1A is a flowchart of a method for screening similar videos according to a first embodiment of the present invention;

fig. 1B is a schematic diagram of a screening process of similar videos according to an embodiment of the invention;

fig. 2A is a flowchart of a method for screening similar videos according to a second embodiment of the present invention;

fig. 2B is a schematic diagram of a video level feature extraction process in the method according to the second embodiment of the present invention;

fig. 2C is a schematic structural diagram of a depth separation residual network in the method according to the second embodiment of the present invention;

fig. 2D is a schematic diagram of a convolution operation performed by a depth separation residual network in the method according to the second embodiment of the present invention;

fig. 2E is a schematic structural diagram of a space-time separation residual network in the method according to the second embodiment of the present invention;

fig. 2F is a schematic structural diagram of each space-time convolution layer in a space-time separation residual network in the method according to the second embodiment of the present invention;

fig. 3A is a flowchart of a method for screening similar videos according to a third embodiment of the present invention;

Fig. 3B is a schematic diagram of an extraction process of frame-level features in the method according to the third embodiment of the present invention;

fig. 3C is a schematic structural diagram of a multi-scale attention residual network in the method according to the third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a screening device for similar videos according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a device according to a fifth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

Fig. 1A is a flowchart of a method for screening similar videos according to an embodiment of the present invention, which is applicable to any case of screening similar videos from a large volume of video data. The method for screening the similar videos provided by the embodiment of the invention can be implemented by the device for screening the similar videos provided by the embodiment of the invention, the device can be implemented in a software and/or hardware mode, and the device is integrated in a device for implementing the method, and the device can be a background server specially responsible for storing the uploaded video data.

Specifically, referring to fig. 1A, the method may include the steps of:

s110, constructing a similar video library of each video to be selected under a specified similar scale according to video level characteristics of each video to be selected in the video library to be selected.

In this embodiment, the video library to be selected refers to a video set currently including a large amount of video data and needing to perform similarity analysis on different videos, for example, an online video set newly added daily at a video server, and an auditor needs to audit out illegal videos in the online video set and perform offline processing at this time. The video level features of the video to be selected refer to features which are extracted by taking the video as granularity and can represent the picture content of the video to be selected, and at the moment, the video level features of each video to be selected only need to be executed once for feature extraction, so that feature dimensions are low, the video level features of the video to be selected can only represent local picture features of the video to be selected, and picture details of the video to be selected under different video frames cannot be comprehensively represented.

Specifically, firstly, a video library to be selected for similar video screening is required to be obtained, then video level features of the video to be selected are required to be extracted for each video to be selected in the video library, in this embodiment, an existing extraction network of any video level feature can be adopted to extract video level features of each video to be selected, then video level features of each video to be selected in the video library to analyze similarity between every two videos to be selected in the video library are adopted, at this time, a designated similarity scale is preset in this embodiment, and the designated similarity scale can be an upper limit of the number of similarity of the real similar videos of each video to be selected in the video library, which is ensured to be covered by empirical analysis of similar video screening history; at this time, for each video to be selected in the video library to be selected, other videos to be selected may be ranked according to the similarity between the video to be selected and other videos to be selected in the video library to be selected, other videos to be selected with a front similarity to the video to be selected within the specified similarity scale may be found, and the other videos to be selected within the specified similarity scale may be directly filtered out as dissimilar videos of the video to be selected, so as to execute a first rescreening process of the similar videos in the video library to be selected, and then the found other videos to be selected may form a similar video library of the video to be selected under the specified similarity scale, where the similar video library may be capable of primarily filtering out dissimilar videos from the video to be selected; according to the same steps, a first rescreening process of the similar videos of each video to be selected in the video library to be selected is executed, so that the similar video library of each video to be selected under the appointed similar scale is generated, other video to be selected with higher similarity with the video to be selected can be further searched in the similar video library of each video to be selected, and accuracy of similar video screening is improved.

After extracting the video level features of each video to be selected in the video library, the embodiment establishes a hash index corresponding to the video level features of each video to be selected, and establishes a feature distance set of each two videos to be selected in the video library quickly by using the hash index and the set distance measure for representing the similarity between different videos, wherein it is to be noted that the distance measure in the embodiment may be a cosine distance; then, the existing K neighbor searching algorithm is adopted to find out topK other videos to be selected, which have the highest similarity with each video to be selected, in the video library to be selected respectively, topK is the designated similarity scale in the embodiment, and then the topK other videos to be selected found out for each video to be selected are combined to form a similar video library of each video to be selected under the designated similarity scale, so that the first re-filtering of dissimilar videos of each video to be selected under the low similarity is realized, and then similar videos with higher similarity are further found out from the similar video library of each video to be selected, and the accuracy of similar video screening is guaranteed.

S120, respectively searching out similar videos with the similarity exceeding a preset similarity threshold value from the similar video libraries of each video to be selected, and obtaining corresponding video pair candidate libraries.

Optionally, after the similar video library of each video to be selected under the specified similar scale is constructed, because the video level features of the video to be selected cannot comprehensively represent the picture details of the video to be selected under different video frames, the similar video library of each video to be selected constructed by the video level features of each video to be selected may have a video dissimilar to the video to be selected, but all the similar videos of each video to be selected can be covered in the video library to be selected, so that the comprehensiveness of similar video screening is improved; at this time, in order to further ensure accuracy of the similar video screening, the embodiment also needs to perform further similar video screening on the similar video library of each video to be selected, that is, the second screening process of the similar video is performed on the similar video library of each video to be selected, as shown in fig. 1B, a preset similarity threshold is preset for the second screening, where the preset similarity threshold is obtained by the existing video similarity evaluation service according to the historical experience of judging the video similarity by using the video level features, so that whether the videos are similar or not can be accurately distinguished according to the video level features of different videos to be selected; at this time, firstly, calculating the similarity between each video to be selected and each other video to be selected contained in the similar video library of the video to be selected, so as to find out other video to be selected, of which the similarity between the video to be selected and the similar video library of each video to be selected exceeds a preset similarity threshold, as the similar video of the video to be selected under the second screening, further respectively forming corresponding video pairs by each video to be selected and each similar video of the video to be selected under the second screening, adding the video pairs to a pre-established video pair candidate library, and further executing similarity judgment of each candidate video pair by adopting frame-level characteristics for each candidate video pair in the video pair candidate library, thereby ensuring the accuracy of the similar video screening.

S130, based on the frame-level characteristics of the two video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight, the corresponding similar video pair is screened out from the video pair candidate library.

Optionally, because each candidate video pair in the candidate library of video pairs is a similar video pair determined by the video level features of the video to be selected, and the video level features cannot comprehensively represent the picture details of the video to be selected under different video frames, there may be video pairs with dissimilar actual picture contents in the candidate library of video pairs; meanwhile, the video pair candidate library is obtained by double screening of similar videos of the video library to be selected, so that the number of candidate video pairs contained in the video pair candidate library is small, and even if the video frames are used as granularity to extract frame-level features of both video sides of the candidate video pair, the embodiment does not excessively increase the operation amount of feature extraction; therefore, in order to further improve the similarity of each candidate video pair in the candidate library, the embodiment further adopts the existing frame-level feature extraction network to further extract the frame-level features of both video sides of the candidate video pair for each candidate video pair in the candidate library, and subsequently adopts the frame-level features of both video sides of the candidate video pair to further judge the similarity between both video sides of the candidate video pair, thereby improving the comprehensiveness and the accuracy of the similar video screening on the basis of ensuring the high efficiency of the similar video screening.

Specifically, considering that the similarity of the video to different candidate video pairs in the candidate library is different, if one video in a certain candidate video pair exists in a plurality of other candidate video pairs, the video will have the similarity with a plurality of other videos in the candidate library, if the similarity between the video and the other videos is higher, only a small number of video frames need to be extracted from the video to extract corresponding frame-level features to judge the similarity between the video and the other videos, but if the similarity between the video and the other videos is lower, only a large number of video frames need to be extracted from the video to extract corresponding frame-level features, and at this time, the frame-level features need to show detailed pictures of the video so as to carefully judge the similarity between the video and the other videos; therefore, in this embodiment, for both video parties of each candidate video pair in the video pair candidate library, a corresponding similarity weight is set, where the similarity weight is used to indicate the number of video frames that need to be extracted when extracting frame-level features, so as to reduce feature extraction in unnecessary video frames and improve feature extraction efficiency in the similar video screening process.

Further, for each candidate video pair in the candidate library, the similarity weight corresponding to the two video pairs of the candidate video pair is adopted first, a corresponding number of video frames are respectively extracted from the two video pairs, further, the corresponding frame-level features are respectively extracted from the video frames extracted from the two video pairs, the frame-level features of the two video pairs of each candidate video pair are adopted, similarity between the candidate video pairs is continuously calculated, similarity accuracy between the candidate video pairs is guaranteed, and according to the similarity between each candidate video pair calculated by the frame-level features, candidate video pairs with similarity exceeding a specified similarity threshold value are selected from the candidate library of the video pairs and used as the final selected similar video pairs from the candidate video library in the embodiment.

In addition, in this embodiment, a similar video pair library is pre-established, and is used for storing a large number of similar video pairs screened in the candidate video library, and in order to ensure the comprehensiveness of similar video screening, after constructing a similar video library of each candidate video under a specified similar scale, this embodiment may further include: and submitting each video pair formed by the video to be selected and the similar videos with the similarity exceeding a preset submitting threshold value in a similar video library of the video to be selected to a pre-created similar video pair library.

Specifically, the video level characteristics of each video to be selected in the video to be selected library are calculated, at this time, a preset submitting threshold is preset for the video level characteristics, the preset submitting threshold is higher than a preset similarity threshold set during second rescreening after the video to be selected is built in the video to be selected library under a specified similarity scale, if the similarity between a certain video to be selected and a certain video in the video to be selected library exceeds the preset submitting threshold, it is indicated that the video to be selected is very likely to be truly similar to the video to be selected in the video to be selected library, so that the implementation submits the video to be selected and the video to be selected in the similar video library to be created in advance, and at this time, a video pair candidate library formed by a video pair with the similarity exceeding the preset similarity threshold is also stored with a pair of videos submitted to the video pair library, and therefore after the corresponding pair of videos to be selected from the video pair candidate library is selected, the pair of videos also needs to be updated according to the pair of selected pair of videos; the method comprises the steps of storing similar video pairs screened from a video pair candidate library into a similar video pair library, and pre-submitting video-level features to the video pairs of the similar video pair library to judge whether the video pairs of the similar video pair library are truly similar or not according to the screened similar video pairs, wherein the similar video pairs are stored in the similar video pair library but are determined to be dissimilar through frame-level features, so that accuracy of the similar video pairs is guaranteed.

Meanwhile, after the similar video pairs in the video library to be selected are submitted to the similar video pair library, duplicate elimination can be carried out on the video library to be selected according to the similar video pairs in the similar video pair library, and the duplicate eliminated video library to be selected is pushed to the corresponding auditing server, so that the auditing server only needs to audit once for different videos with repeated contents, and redundancy of video auditing is reduced; in addition, the embodiment can push the updated similar video pair library to the copyright checking server side to perform copyright checking on the similar video, so that the copyright safety of a video creator is ensured.

According to the technical scheme provided by the embodiment, firstly, the video level characteristics of each video to be selected in the video library to be selected are constructed, the similar video library of each video to be selected under the appointed similar scale is constructed, the first rescreening of the similar video is carried out by adopting the appointed similar scale, then the similar video which has the similarity exceeding the preset similar threshold value is further searched from the similar video library of each video to be selected, the similar video which is corresponding to the video to be selected is formed into the video pair candidate library by the video to be selected and the searched similar video, the second rescreening of the similar video is carried out by adopting the preset similar threshold value in the similar video library of each video to be selected, the number of the candidate video pairs in the candidate libraries is reduced by double screening, further, the corresponding similar video pairs are selected from the video pair candidate libraries on the basis of the frame level characteristics of each video pair under the corresponding similarity weight, therefore, the similarity judgment under the combination of the video level characteristics of different videos to be selected in the video library to be selected and the frame level characteristics is realized, the overall similarity of the video to be selected and the similarity of each video pair are improved, the frame level characteristic of each candidate pair is not required to be extracted, the operation is carried out by the double screening operation, and the feature extraction of each video pair of the video pair has no need of the frame level characteristics is reduced.

Example two

Fig. 2A is a flowchart of a method for screening similar videos according to a second embodiment of the present invention, and fig. 2B is a schematic diagram of a process for extracting video level features in the method according to the second embodiment of the present invention. This embodiment is optimized based on the above embodiment. Specifically, as shown in fig. 2A, the embodiment mainly explains the specific extraction process of the video level feature of each video to be selected in the video library in detail.

Optionally, as shown in fig. 2A, the present embodiment may include the following steps:

s210, extracting a corresponding sparse video frame from each video to be selected in the video library to be selected.

Optionally, for each video to be selected in the video library to be selected, when the video level feature of the video to be selected is extracted, firstly, setting a sparse sampling density, respectively extracting corresponding sparse video frames from each video to be selected by adopting the sampling density, subsequently, extracting the frame features of each sparse video frame, and fusing the frame features of each sparse video frame extracted under each video to be selected to obtain the video level feature of the video to be selected.

S220, according to the time sequence information of the sparse video frames in the video to be selected, the depth separation characteristics of each sparse video frame in the video to be selected are fused, and the two-dimensional spatial characteristics of the video to be selected are obtained.

Optionally, the embodiment may use a corresponding depth separation convolution operation to extract a depth separation feature of each sparse video frame in the video to be selected, where the extraction of the depth separation feature can effectively avoid redundant feature mapping calculation in a common convolution operation, so as to greatly reduce the cost of convolution operation; meanwhile, the sparse video frames have a certain time sequence before and after the video to be selected, according to the time sequence information of the sparse video frames in the video to be selected, the depth separation characteristics of each sparse video frame in the video to be selected can be sequenced in sequence, and further, the depth separation characteristics of each sparse video frame in each characteristic dimension are fused by combining the time sequence information, so that the two-dimensional spatial characteristics of the video to be selected are obtained.

As shown in fig. 2B, in this embodiment, according to the time sequence information of the sparse video frames in the video to be selected, the depth separation feature of each sparse video frame in the video to be selected is fused to obtain the two-dimensional spatial feature of the video to be selected, which may specifically include: and extracting the depth separation characteristic of each sparse video frame in the video to be selected by adopting a pre-constructed depth separation residual error network, and fusing the depth separation characteristics of each sparse video frame under different characteristic dimensions by adopting a preset pseudo-time sequence convolution kernel to obtain the two-dimensional spatial characteristic of the video to be selected.

Specifically, each of the depth separation residual error networks in this embodiment is composed of a single-channel convolution kernel and a full-channel convolution kernel, where the overall structure of the depth separation residual network is shown in fig. 2C, the single-channel convolution kernel may be an m×m convolution kernel under a single channel, and the full-channel convolution kernel may be a 1*1 convolution kernel under multiple channels; after each sparse video frame in the video to be selected is sequentially input into a depth separation residual error network according to time sequence information, convolution operation is continuously carried out on the sparse video frame through each spatial convolution layer (stage) cascaded in the depth separation residual error network, when the convolution operation is carried out on each spatial convolution layer cascaded in the depth separation residual error network, as shown in fig. 2D, conventional M x M convolution operation is split into convolution operation of a single channel convolution kernel and a full channel convolution kernel which are sequentially carried out, under each spatial convolution layer (stage), first, corresponding convolution operation is carried out on feature graphs of a result of the previous spatial convolution layer after the convolution operation is carried out on the sparse video frame under each channel, and in this embodiment, channels can consider corresponding feature dimensions, so that feature graph groups with the same channel number are obtained; then, corresponding convolution operations are respectively executed again on the characteristic image groups under different channels by adopting 1*1 convolution kernels of N full channels, characteristic output results under N channels are obtained, the characteristic output results are sequentially input into a next space convolution layer, and the convolution operations are executed again; at this time, the number of channels to be finally output, that is, the feature dimension of the extracted features, can be controlled by setting the number of full-channel convolution kernels in each spatial convolution layer, so that the parameter number of the depth separation convolution kernels in the embodiment is far smaller than that of the conventional convolution kernels; the depth separation residual error network in the embodiment reduces the feature loss and offset after convolution mapping by using a reverse residual error mechanism (channel expansion is performed before channel compression is performed) and linear transformation output, and removes bypass convolution in a conventional residual error network, so that the feature extraction precision is ensured not to be excessively reduced when the parameter quantity of the depth separation residual error network is reduced.

In this embodiment, after each sparse video frame in the video to be selected is sequentially input to a depth separation residual error network, the depth separation operation is performed by each spatial convolution layer (stage) cascaded in the depth separation residual network according to a single-channel convolution kernel and a full-channel convolution kernel in the spatial convolution layer, after the convolution operation is continuously performed on the sparse video frame, a feature map group of each sparse video frame under multiple channels is finally output, at this time, feature summarization is performed on the feature map group under each channel through a channel-by-channel global averaging pooling (Global Average Pooling, GAP) operation, so as to obtain a depth separation feature of the sparse video frame, a weighted summation is performed by adopting a preset pseudo-time sequence convolution kernel for each depth separation feature value of each sparse video frame under each feature dimension, interference of channel noise on the feature value is reduced, the preset pseudo-time sequence convolution kernel can be a convolution kernel of F1, F is the number of the sparse video frames in the video to be selected, further, the depth separation feature of each sparse video frame is fused, and the two-dimensional spatial feature of the video to be selected is obtained, and the accuracy of the feature extraction of the video to be selected is ensured through the depth separation feature extraction operation.

S230, determining three-dimensional space-time separation characteristics of the video to be selected according to time sequence information of the sparse video frames in the video to be selected and depth separation characteristics of each sparse video frame.

Optionally, after extracting the depth separation feature of each sparse video frame in the video to be selected, the depth separation feature of each sparse video frame has a corresponding feature map, at this time, the feature maps corresponding to the depth separation feature of each sparse video frame can be combined into a three-dimensional feature map set according to the time sequence information of the sparse video frame in the video to be selected, at this time, the three-dimensional feature map set is further added with a time feature on the spatial feature, at this time, the three-dimensional space-time separation feature of the three-dimensional feature map set under the video to be selected is extracted by adopting a corresponding space-time separation convolution operation, at this time, the separation of the spatial convolution and the time convolution avoids the mutual interference of the time feature and the spatial feature.

As shown in fig. 2B, in this embodiment, determining the three-dimensional space-time separation characteristic of the video to be selected according to the time sequence information of the sparse video frame and the depth separation characteristic of each sparse video frame may specifically include: determining an intermediate convolution feature map of each sparse video frame in the video to be selected according to the depth separation feature of each sparse video frame in the video to be selected, and combining the intermediate convolution feature maps of the sparse video frames into a three-dimensional reference feature map set of the video to be selected according to the time sequence information of the sparse video frames in the video to be selected; and extracting the three-dimensional space-time separation characteristics of the three-dimensional reference characteristic image group through a pre-constructed space-time separation residual error network.

The main residual branch of each space-time convolution layer in the space-time separation residual network is composed of a space convolution kernel and a time convolution kernel which are cascaded, and the bypass residual branch of the space-time convolution layer is composed of a space convolution kernel and a time convolution kernel which are connected in parallel, so that the whole structure of the space-time separation residual network is shown in fig. 2E.

Specifically, each spatial convolution kernel cascaded in the depth separation residual error network in this embodiment is adopted to continuously perform convolution operation on each sparse video frame in the video to be selected, when the depth separation characteristic of each sparse video frame is extracted, after the corresponding convolution operation is performed on the last spatial convolution kernel in the depth separation residual error network, an intermediate convolution characteristic image of the sparse video frame is output, according to the time sequence information of the sparse video frame in the video to be selected, the intermediate convolution characteristic images of the sparse video frames are combined into a three-dimensional reference characteristic image group of the video to be selected, the three-dimensional reference characteristic image group carries a time characteristic and a space characteristic, at this time, the three-dimensional reference characteristic image group is used as an input of the space-time separation residual error network in this embodiment, each time space convolution layer of the space-time separation residual error network separates time features and space features, for example, 3 x 3 three-dimensional space convolution kernels are split into 1 x 3 space convolution kernels and 3 x 1 time convolution kernels, space detail information of each middle convolution feature image in the three-dimensional reference feature image group is mapped, and then relation mapping of each middle convolution feature image in time is learned, so that redundant calculation of feature extraction is reduced, mutual interference between the time features and the space features is avoided, about one third of parameter quantity in the space-time separation residual error network can be reduced, and the extraction speed of the three-dimensional space-time separation features is further improved.

In addition, the space-time separation residual error network in this embodiment designs a projection residual error structure shared by cascade connection and parallel connection of a space convolution kernel and a time convolution kernel at the entrance of each space-time convolution layer, learns the space-time characteristics of the three-dimensional reference characteristic image group on the main residual error branch of each space-time convolution layer through the cascade connection space convolution kernel and the time convolution kernel, strengthens the local expression capacity of the three-dimensional reference characteristic image group on the time characteristics and the space characteristics through the parallel connection space convolution kernel and the time convolution kernel on the bypass residual error branch, at this time, the space convolution kernel and the time convolution kernel on the main residual error branch and the bypass residual error branch can downsample the three-dimensional reference characteristic image group, ensures that the characteristics can be continuously summarized, and finally adopts the GAP operation of each channel to carry out characteristic global summary on the characteristics under each channel, thereby extracting the three-dimensional space-time separation characteristics of the three-dimensional reference characteristic image group.

And S240, splicing the two-dimensional spatial features and the three-dimensional space-time separation features of the video to be selected to obtain video-level features of the video to be selected.

Optionally, after the two-dimensional spatial feature and the three-dimensional space-time separation feature of the video to be selected are obtained, as shown in fig. 2B, the two-dimensional spatial feature and the three-dimensional space-time separation feature of each video to be selected are spliced to obtain the video-level feature of the video to be selected, so that the two-dimensional spatial feature and the three-dimensional space-time separation feature are mixed in the video-level feature, and the accuracy of the video-level feature of the video to be selected is greatly improved.

It should be noted that, considering that the emphasis points of the video-level feature extraction and the frame-level feature extraction are different, different types of training labels and loss functions are adopted for the video-level feature extraction and the frame-level feature extraction; depth separation residual networks and spatio-temporal separation residual networks for video level feature extraction may guide training using cluster-level tags of video, which refer to: each video in the training data set is subjected to multiple data enhancement, all videos (including the original video) obtained by enhancing the same video form a cluster, and the videos in the cluster all use the same digital label; the video obtained by data enhancement is also used for training; at this time, considering that the cluster-level label is used to guide training, the number of categories will be quite huge, so the embodiment uses ArcFace loss combined with cross entropy to improve the bearing capacity of the huge number of categories by properly adding a penalty boundary near the decision plane; when the depth separation residual error network and the space-time separation residual error network under the video-level feature extraction are used, the corresponding loss function calculation layer is removed, and an independent feature extraction network is generated, so that the video-level feature extraction is more accurate.

S250, constructing a similar video library of each video to be selected under a specified similar scale according to video level characteristics of each video to be selected in the video library to be selected.

And S260, respectively searching out similar videos with the similarity exceeding a preset similarity threshold value from the similar video libraries of each video to be selected, and obtaining corresponding video pair candidate libraries.

S270, based on the frame-level characteristics of the two video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight, the corresponding similar video pair is screened out from the video pair candidate library.

According to the technical scheme provided by the embodiment, the two-dimensional spatial features of the video to be selected are obtained by carrying out feature extraction and fusion under the depth separation of the sparse video frames extracted from the video to be selected, the extraction efficiency of the two-dimensional spatial features is improved, meanwhile, the three-dimensional spatial and temporal separation features of the video to be selected are obtained by carrying out feature extraction under the space-time separation of the depth separation features of each sparse video frame, the extraction efficiency of the three-dimensional spatial and temporal separation features is improved, and then the two-dimensional spatial features and the three-dimensional spatial and temporal separation features of the video to be selected are spliced to obtain the video-level features of the video to be selected, so that the two-dimensional spatial features and the three-dimensional spatial and temporal separation features are mixed in the video-level features, and the accuracy of the video-level features of the video to be selected is greatly improved on the basis of guaranteeing the feature extraction efficiency.

Example III

Fig. 3A is a flowchart of a method for screening similar videos according to the third embodiment of the present invention, and fig. 3B is a schematic diagram of a frame-level feature extraction process in the method according to the third embodiment of the present invention. This embodiment is optimized based on the above embodiment. Specifically, as shown in fig. 3A, the embodiment mainly explains the specific extraction process of the frame-level features of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight in detail.

Optionally, as shown in fig. 3A, the present embodiment may include the following steps:

s310, constructing a similar video library of each video to be selected under a specified similar scale according to video level characteristics of each video to be selected in the video library to be selected.

S320, respectively searching out similar videos with the similarity exceeding a preset similarity threshold value from the similar video libraries of each video to be selected, and obtaining corresponding video pair candidate libraries.

S330, aiming at any one video in each candidate video pair in the video pair candidate library, determining the similarity weight of the video according to the similarity of the video in the video pair candidate library and other videos.

Optionally, since each candidate video pair in the video pair candidate library is a similar video pair determined by video level features of each candidate video, the similarity is inaccurate; moreover, the video pair candidate library is obtained by double screening of similar videos in the video library to be selected, and the number of the included candidate video pairs is small, so that the frame-level characteristics of the two video sides of each candidate video pair are required to be adopted to further judge the similarity between the two video sides of the candidate video pair; at this time, considering that the similarity of the video to different candidate video pairs in the candidate library is different, if any one of the candidate video pairs exists in a plurality of other candidate video pairs, the video will have the similarity with a plurality of other videos in the candidate library, at this time, if the similarity between the video and the other videos is higher, only a small number of video frames need to be extracted from the video to extract the corresponding frame-level features to judge the similarity between the video and the other videos, but if the similarity between the video and the other videos is lower, only a large number of video frames need to be extracted from the video to extract the corresponding frame-level features, at this time, the frame-level features need to show detailed detail pictures of the video, so as to carefully judge the similarity between the video and the other videos; therefore, before extracting the frame-level features of the two video sides of each candidate video pair in the candidate library, the similarity weight of the two video sides of each candidate video pair needs to be determined first.

S340, according to the similarity weight of the two video sides of each candidate video pair in the video pair candidate library, determining the frame sampling rate of the two video sides, and respectively extracting target video frames of the two video sides under the corresponding frame sampling rate.

Optionally, after determining the similarity weights of the two video sides of each candidate video pair in the candidate library, the embodiment presets a corresponding maximum frame sampling rate and a minimum frame sampling rate, then selects the most suitable frame sampling rate between the maximum frame sampling rate and the minimum frame sampling rate through the similarity weights of the two video sides of each candidate video pair, and further adopts the frame sampling rates of the two video sides to extract the corresponding target video frames in the two video sides respectively, where the target video frames are used for extracting the corresponding frame-level features subsequently, and at this time, because the frame sampling rates of the two video sides are different, the frame numbers of the target video frames extracted by the two video sides are also different, so that the suitable frame sampling rates are selected respectively according to the actual similarity conditions of the two video sides of each candidate video pair, without setting the same sampling rate for each video, thereby reducing unnecessary feature extraction calculation of the target video frames, and improving the accuracy of the frame-level features on the basis of ensuring the high efficiency of the frame-level features as much as possible.

Illustratively, the formula for calculating the frame sampling rate may be:

wherein, the liquid crystal display device comprises a liquid crystal display device,

for each candidate video pair, the frame sampling rate of both video pairs under the corresponding similarity weight, F _min F for a preset minimum frame sampling rate _max For a preset maximum frame sampling rate, < >>

And weighting the similarity of the two video parties in each candidate video pair.

S350, extracting the attention characteristics of each target video frame of the two video parties under the corresponding frame sampling rate under different spatial scales through a pre-constructed multi-scale attention residual error network, and splicing the attention characteristics of the target video frame under each spatial scale to obtain the multi-scale attention characteristics of the target video frame.

Optionally, after extracting target video frames of both video sides in each candidate video pair at the corresponding frame sampling rate, directly inputting the target video frames extracted by both video sides into a pre-constructed multi-scale attention residual error network, as shown in fig. 3B, extracting corresponding attention features of each target video frame in both video sides by using an attention mechanism under different spatial scales set in the multi-scale attention residual error network by using the multi-scale attention residual error network, and splicing the attention features of each target video frame under each spatial scale to obtain the multi-scale attention features of the target video frame.

As illustrated in fig. 3C, the multi-scale attention residual network in this embodiment is configured with an attention weight satisfying a specific spatial probability distribution at each spatial scale; at this time, extracting, through the pre-constructed multi-scale attention residual network, attention features of each target video frame of both video sides at the corresponding frame sampling rate at different spatial scales may specifically include: extracting frame-level sub-features of each target video frame of two video parties under the corresponding frame sampling rate under different spatial scales, and adjusting the frame-level sub-features of the target video frame under the corresponding spatial scales by adopting the attention weight of each spatial scale to obtain the attention features of the target video frame under different spatial scales.

Specifically, the multi-scale attention residual network in this embodiment uses a common Res50 as an infrastructure, and includes a total of 50 layers and 16 bottleneck residual blocks. These bottleneck residual blocks are mainly of two types, in which a projection residual block with bypass convolution is used at the entrance of each convolution layer, so that feature summation can be gradually performed on the target video frame, while residual blocks containing only main branch convolutions are used at other positions, so as to reduce the parameter amount of feature extraction in the network. Meanwhile, in order to reduce unnecessary feature extraction redundancy, the multi-scale attention residual error network also greatly reduces the channel number of each convolution layer in a way of halving the convolution kernel, and meanwhile, the feature extraction operation amount of the whole multi-scale attention residual error network is also greatly reduced.

Further, after inputting the target video frames of both the inner video of each candidate video pair into the multi-scale attention residual error network in this embodiment, as shown in fig. 3C, each convolution layer of the multi-scale attention residual error network is taken as a corresponding spatial scale, at this time, each convolution layer of the multi-scale attention residual error network outputs a feature image group under the spatial scale to represent a meta-frame feature under the corresponding spatial scale, and meanwhile, a 1*1 convolution kernel is adopted to compress the channel number of the feature image group under each spatial scale, so as to model deep connection of the feature image group under the current spatial scale, and simultaneously reduce feature dimensions of frame-level sub-features of the target video frame under different spatial scales; also to avoid feature loss, linear transformations are also used to replace the conventional nonlinear-activated Relu function; then, carrying out feature summation on the feature map group under each spatial scale in the spatial dimension by using the channel-by-channel GAP operation, so as to obtain frame-level sub-features of the target video frame under different spatial scales; at this time, for the frame-level sub-feature of the target video frame under each spatial scale, a full-connection mode is adopted, attention weights under the current spatial scale are fitted in a feature-by-feature weighted summation mode, and because the attention weights under different spatial scales are mutually related, probability distributions of the attention weights under different spatial scales should be unified in the same space, probability distributions of the attention weights obtained by fitting the frame-level sub-feature under the spatial scales under different spatial scales are unified by adopting a softmax function, so that the attention weights under different spatial scales meeting specific spatial probability distributions are obtained, and inconsistency of the feature distributions under different spatial scales is avoided; and then the attention weight of each spatial scale meeting the specific spatial probability distribution is adopted to perform element-by-element multiplication operation on the frame-level sub-features of the target video frame under the spatial scale so as to adjust the frame-level sub-features of the target video frame under the corresponding spatial scale to obtain the attention features of the target video frame under different spatial scales, and the attention features of each target video frame under each spatial scale are spliced to obtain the multi-scale attention features of the target video frame, so that the accuracy of the multi-scale attention features of the target video frame is improved through the multi-scale spatial information and the attention mechanism.

S360, fusing the multi-scale attention features of each target video frame of the two video parties under the corresponding frame sampling rate to obtain the frame-level features of the two video parties under the corresponding similarity weight.

Optionally, after obtaining the multiscale attention feature of each target video frame extracted by both sides of the video in each candidate video pair, feature fusion is performed on the multiscale attention features of each target video frame in both sides of the video under each feature dimension, so as to obtain the frame-level features of both sides of the video under the corresponding similarity weight.

It should be noted that, since a certain video may exist in multiple candidate video pairs, when extracting the frame-level features of both video in each candidate video pair, if the frame-level feature of a certain video in the current candidate video pair is already extracted when the frame-level feature of the previous candidate video pair is performed, it is not necessary to perform the sequential feature extraction process again in the current candidate video pair.

And S370, inputting the frame-level characteristics of the two video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight into a pre-constructed three-layer perceptron network to obtain the similarity score of the candidate video pair.

Optionally, after extracting the frame-level features of the two video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight, directly inputting the frame-level features of the two video sides of the candidate video pair into a pre-constructed three-layer perceptron network, as shown in fig. 3B, analyzing the similarity between the frame-level features of the two video sides of each candidate video pair by the three-layer perceptron network, thereby obtaining the similarity score of each candidate video pair, and then judging whether the two video sides of the candidate video pair are similar or not according to the similarity score of each candidate video pair.

It should be noted that, in this embodiment, the multi-scale attention residual error network and the three-layer perceptron network for frame-level feature extraction are trained together in an end-to-end manner, and then two results of whether the candidate video pairs are similar or not are finally output, at this time, the training is guided by adopting digital labels of video pair levels, each candidate video pair is marked as similar or dissimilar, and only ordinary cross entropy is used as a loss function because the number of categories after training is not large.

S380, screening the corresponding similar video pairs according to the similarity score of each candidate video pair in the video pair candidate library.

Optionally, candidate video pairs with similarity scores exceeding a preset similarity threshold are selected from the video pair candidate library, and are used as the similar video pairs in the embodiment, and then submitted to the similar video pair library for updating.

According to the technical scheme provided by the embodiment, the multi-scale attention characteristics of the target video frames extracted by the multi-scale attention residual error network are fused to obtain the frame-level characteristics of both video sides under the corresponding similarity weight, so that the extraction operand of the frame-level characteristics is reduced, the extraction accuracy of the frame-level characteristics is ensured, and the similarity between the video sides of each candidate video pair is judged by adopting the three-layer perceptron network, so that more accurate similar video pairs are screened out from the video candidate library, and the screening efficiency of the similar videos is improved.

Example IV

Fig. 4 is a schematic structural diagram of a similar video screening apparatus according to a fourth embodiment of the present invention, and specifically, as shown in fig. 4, the apparatus may include:

the similarity library construction module 410 is configured to construct a similarity video library of each video to be selected under a specified similarity scale according to video level features of each video to be selected in the video library to be selected;

the candidate library generating module 420 is configured to find out similar videos with similarity exceeding a preset similarity threshold value from the similar video libraries of each video to be selected, so as to obtain corresponding video pair candidate libraries;

The similar video filtering module 430 is configured to filter out corresponding similar video pairs from the video pair candidate library based on frame-level features of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weights.

The similar video screening device provided by the embodiment can be applied to the similar video screening method provided by any embodiment, and has corresponding functions and beneficial effects.

Example five

Fig. 5 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present invention, and as shown in fig. 5, the apparatus includes a processor 50, a storage device 51, and a communication device 52; the number of processors 50 in the device may be one or more, one processor 50 being taken as an example in fig. 5; the processor 50, the storage means 51 and the communication means 52 in the device may be connected by a bus or other means, in fig. 5 by way of example.

The device provided by the embodiment can be used for executing the similar video screening method provided by any embodiment, and has corresponding functions and beneficial effects.

Example six

The sixth embodiment of the present invention also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the method for screening similar videos in any of the above embodiments.

The method specifically comprises the following steps:

and screening out the corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight.

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the above-described method operations, and may also perform the related operations in the similar video screening method provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the similar video screening apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for screening similar videos, comprising:

based on the frame-level characteristics of the two video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight, screening out the corresponding similar video pair from the video pair candidate library;

Wherein the similarity weight is used to indicate the number of video frames that need to be extracted when extracting frame-level features.

2. The method of claim 1, further comprising, prior to screening out corresponding similar video pairs from the video pair candidate library:

according to the similarity weight of the two video sides of each candidate video pair in the video pair candidate library, determining the frame sampling rate of the two video sides, and respectively extracting target video frames of the two video sides under the corresponding frame sampling rate;

extracting the attention characteristics of each target video frame of two video parties under the corresponding frame sampling rate under different spatial scales through a pre-constructed multi-scale attention residual error network, and splicing the attention characteristics of the target video frame under each spatial scale to obtain the multi-scale attention characteristics of the target video frame;

and fusing the multi-scale attention features of each target video frame of the two video parties under the corresponding frame sampling rate to obtain the frame-level features of the two video parties under the corresponding similarity weight.

3. The method according to claim 2, wherein the multi-scale attention residual network is configured with an attention weight at each spatial scale that satisfies a specific spatial probability distribution;

Extracting attention characteristics of each target video frame of two video parties under the corresponding frame sampling rate under different spatial scales through a pre-constructed multi-scale attention residual error network, wherein the method comprises the following steps:

extracting frame-level sub-features of each target video frame of two video parties under the corresponding frame sampling rate under different spatial scales, and adjusting the frame-level sub-features of the target video frame under the corresponding spatial scales by adopting the attention weight of each spatial scale to obtain the attention features of the target video frame under different spatial scales.

4. The method of claim 1, further comprising, prior to screening out corresponding similar video pairs from the video pair candidate library:

and determining the similarity weight of the video according to the similarity of the video with other videos in the video pair candidate library aiming at any one of the video pairs in each candidate video pair in the video pair candidate library.

5. The method of claim 1, wherein selecting a corresponding similar video pair from the video pair candidate library based on frame-level features of both video sides of each candidate video pair in the video pair candidate library under a corresponding similarity weight comprises:

Inputting frame-level features of two video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight into a pre-constructed three-layer perceptron network to obtain a similarity score of the candidate video pair;

and screening the corresponding similar video pairs according to the similarity score of each candidate video pair in the video pair candidate library.

6. The method of claim 1, further comprising, prior to constructing the library of similar videos for each of the candidate videos at the specified similar scale:

extracting a corresponding sparse video frame from each video to be selected in the video library to be selected;

according to the time sequence information of the sparse video frames in the video to be selected, fusing the depth separation characteristics of each sparse video frame in the video to be selected to obtain the two-dimensional spatial characteristics of the video to be selected;

determining three-dimensional space-time separation characteristics of the video to be selected according to time sequence information of sparse video frames in the video to be selected and depth separation characteristics of each sparse video frame;

and splicing the two-dimensional spatial features and the three-dimensional space-time separation features of the video to be selected to obtain video-level features of the video to be selected.

7. The method of claim 6, wherein fusing the depth separation feature of each sparse video frame in the candidate video according to the timing information of the sparse video frame in the candidate video to obtain the two-dimensional spatial feature of the candidate video comprises:

Extracting the depth separation characteristic of each sparse video frame in the video to be selected by adopting a pre-constructed depth separation residual error network, and fusing the depth separation characteristic of each sparse video frame under different characteristic dimensions by adopting a preset pseudo-time sequence convolution kernel to obtain the two-dimensional spatial characteristic of the video to be selected, wherein each spatial convolution layer in the depth separation residual error network consists of a single-channel convolution kernel and a full-channel convolution kernel which are cascaded.

8. The method of claim 6, wherein determining three-dimensional spatiotemporal separation characteristics of the video to be selected based on timing information of sparse video frames in the video to be selected and depth separation characteristics of each sparse video frame comprises:

determining an intermediate convolution feature map of each sparse video frame in the video to be selected according to the depth separation feature of each sparse video frame in the video to be selected, and combining the intermediate convolution feature maps of the sparse video frames into a three-dimensional reference feature map set of the video to be selected according to the time sequence information of the sparse video frames in the video to be selected;

and extracting three-dimensional space-time separation characteristics of the three-dimensional reference characteristic image group through a pre-constructed space-time separation residual error network, wherein a main residual error branch of each space-time convolution layer in the space-time separation residual error network consists of a cascade space convolution kernel and a time convolution kernel, and a bypass residual error branch of the space-time convolution layer consists of a space convolution kernel and a time convolution kernel which are connected in parallel.

9. The method of any one of claims 1-8, further comprising, after constructing a library of similar videos for each of the candidate videos at a specified similar scale:

video pairs formed by each video to be selected and similar videos with similarity exceeding a preset submitting threshold value in a similar video library of the video to be selected are submitted to a pre-established similar video pair library;

accordingly, after the corresponding similar video pair is screened from the video pair candidate library, the method further comprises:

and updating the similar video pair library according to the screened similar video pairs.

10. The method of claim 9, further comprising, after updating the library of similar video pairs based on the screened similar video pairs:

and according to the similar video pairs in the similar video pair library, eliminating the duplicate of the to-be-selected video library, and pushing the duplicate-eliminated to-be-selected video library to the corresponding auditing server.

11. A screening apparatus for similar video, comprising:

the similar video screening module is used for screening corresponding similar video pairs from the video pair candidate library based on frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight; wherein the similarity weight is used to indicate the number of video frames that need to be extracted when extracting frame-level features.

12. A screening apparatus for similar video, the apparatus comprising:

one or more processors;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of screening similar videos of any one of claims 1-10.

13. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of screening similar videos according to any one of claims 1-10.