CN111651636B

CN111651636B - Video similar segment searching method and device

Info

Publication number: CN111651636B
Application number: CN202010245319.6A
Authority: CN
Inventors: 蒋文; 田泽康; 邓卉; 危明
Original assignee: Ysten Technology Co ltd
Current assignee: Ysten Technology Co ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2023-11-24
Anticipated expiration: 2040-03-31
Also published as: CN111651636A

Abstract

The invention provides a video similar segment searching method for solving the problems of larger error and complex calculation caused by only considering the most similar in the video similar segment searching process, which comprises the following steps: extracting key frames to obtain a query video key frame set and a reference video key frame set; extracting features of key frame images in the query video key frame set and the reference video key frame set to obtain a query video key frame feature set and a reference video key frame feature set; obtaining a similar candidate set; generating a key frame pair set, screening the key frame pair set by using position offset, and generating a matched key frame pair set; and sorting based on the position offset of the key frame pairs, merging the key frame pairs with the difference of the position offset within a threshold value to obtain a start key frame pair and a stop key frame pair of the video similar fragments, and obtaining the video similar fragment pairs from the query video and the reference video. Corresponding apparatus, devices and media are also provided.

Description

Video similar segment searching method and device

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a video similar segment searching method, a searching device, a computer readable medium and electronic equipment.

Background

With the rapid development of the internet and the rising of media, video production is simpler and easier, and transmission is easier, which also causes the flooding of network video. In this connection, the difficulty of managing video resources by video live broadcast platforms and content providers is also increasing. On the one hand, a large amount of pirates and pirates video cause great loss to copyright related parties, and the copyright related parties need to search similar video so as to prevent the behavior of infringing own rights. On the other hand, after the video is distributed and stored for many times, the same video may generate different versions, and the content service provider needs to generalize the different versions to improve the management efficiency. In the past, similar videos are mainly searched manually, but the method is extremely low in efficiency, and as the number of videos increases sharply in the years, more and more importance is paid to a method based on video processing.

The similar video searching technology based on video processing obtains the description of video characteristics by processing video contents, and then searches similar videos by utilizing the video characteristics. Most current methods utilize video features to calculate video similarity to complete a search, but specific positions of similar segments of video are not considered, and in practice, position information of similar segments is often needed to help complete other video processing applications. Meanwhile, in the prior art, similar videos are searched by utilizing the distance of the feature vectors, and then the positions of similar fragments in the videos are positioned based on the window feature vectors by using a sliding window mode, so that the window size and the window position need to be iterated for a plurality of times in the positioning process, and a large number of repeated operations are caused.

Disclosure of Invention

In order to solve the defects of the prior art, the invention discloses a video similar segment searching method, which uses deep learning characteristics based on image content, and utilizes a novel key frame matching method to realize accurate positioning of similar segments, simultaneously considers searching and positioning speed, and is more suitable for application in engineering practice environments. Specifically, a first aspect of the embodiment of the present invention provides a method for searching for video similar segments, including the following steps:

the video similar segment searching method is characterized by comprising the following steps:

s110, acquiring a query video and a reference video, and extracting key frames of the query video and the reference video to acquire a query video key frame set and a reference video key frame set;

s120, extracting features of key frame images in the query video key frame set and the reference video key frame set by using a deep learning network to obtain a query video key frame feature set and a reference video key frame feature set;

s130, obtaining similar candidate sets of all query video key frames in the query video key frame set from the reference video key frame set by utilizing similarity or distance values between key frame features in the query video key frame feature set and the reference video key frame feature set;

S140, generating a key frame pair set by inquiring the relation between the video key frames and similar candidate sets, and screening the key frame pair set by utilizing the position offset of the key frame pairs in the key frame pair set to generate a matched key frame pair set;

s150, sorting the key frame pairs in the matched key frame pair set based on the position offset of the key frame pairs, and merging the key frame pairs with the difference of the position offset within a threshold value to obtain a start key frame pair and a stop key frame pair of the similar video segment;

s160, acquiring the video similar segment pairs from the query video and the reference video according to the time positions of the start key frame pairs and the end key frame pairs of the video similar segments.

Further, in the step S130, the step of obtaining the similar candidate set of each query video key frame in the query video key frame set from the reference video key frame set includes:

when the similarity is greater than a threshold, selecting a corresponding key frame in the reference video key frame set into the similar candidate set;

or,

establishing a first K-dimensional tree by using N query video key frame features with K dimensions in a query video key frame feature set as data points;

Establishing a second K-dimensional tree by taking M reference video key frame features with K dimensions in a reference video key frame feature set as data points;

and searching a keyframe corresponding to the appointed number of nearest neighbor data points of the first K-dimensional tree in the second K-dimensional tree by using the Euclidean distance, and selecting the keyframe into the similar candidate set, wherein the dimension K is the dimension of the keyframe characteristic.

Further, the step S140 includes:

s141, forming a key frame pair by key frames in a query video key frame set and key frames in a corresponding similar candidate set, and acquiring the position offset of the key frames in the query video key frame set and the key frames in the similar candidate set in the key frame pair, wherein the key frame pair forms the key frame pair set as the position offset of the key frame pair;

s142, grouping according to the position offset of the key frame pairs, grouping the key frames with the same position offset in the key frame pair set into the same group to form a key frame pair group, wherein the key frame pair group forms a key frame pair group set;

s143, sorting the keyframe pair group set according to the number of keyframe pairs in the keyframe pair group to form a keyframe pair group sorting set;

s144, traversing the key frame pair group ordered set, and filtering so that only one key frame pair with the same key frame in the query video key frame set is reserved, and obtaining a key frame pair group filter set, wherein the key frame pair group filter set is used as the matched key frame pair set.

Further, the step of obtaining the key frame pair group filter set includes:

s1441, establishing a query video key frame usage record table Tab _q And a keyframe pair group filter set G ^～ The initialization is empty;

s1442, traversing all keyframe pairs in the keyframe pair sorting set, for any keyframe pairTraversing all key frame pairs p contained therein _ij For any key frame pair p _ij If the corresponding query video key frame is not contained in the usage record table Tab _q Then it is added to the table; otherwise from->Deleting the key frame pair;

s1443, after traversing the key frame pair, ifIf not empty, add it to G ^～；

S1444, after traversing all key frame pairs, obtaining a key frame pair filter setM represents the number of new key pairs.

Further, the step S143 includes:

and ordering the key frame pair group from more to less according to the number of the key frame pairs in the key frame pair group to form a key frame pair group ordered set.

Further, the step S150 includes:

s151, filtering the key frame pairs into a set G ^～ Sorting from small to large according to the offset size to obtain a key frame pair group filtering sorting setWherein the offset of each group is marked +. >

S152, establishing a key frame group list G≡initialized to be empty;

s153, for the first groupAdding it to the list G and taking it as the current group of the list G while shifting it by +.>Set to the current offset o _c ；

S154, obtaining any one group remained in the key frame pair group filtering and sorting setCompare the group offset +.>From the current offset o _c A difference value; if it is smaller than the preset threshold o _th Merging the group with the current group of list G; otherwise, add it to list G and use it as the current group of list G from G ^* Delete group->

S155, to be offsetSet to the current offset o _c Step S154 is repeatedly executed until G is null;

s156, output G+= { (G) _1star ，g _1end ) ₁ ，...，(g _nstar ，g _nend ) _n ，…，(g _Tstar ，g _Tend ) _T (wherein T is the number of similar fragments, g) _nstar A pair of start key frames representing an nth segment of similar segments g _nend A terminating key frame pair representing an nth segment of similar segments.

Further, the deep learning network includes any one of VGG, denseNet, UNet, resnet, mobilefacenet.

In a second aspect of the present invention, there is provided a video similar segment searching apparatus, including:

the frame extraction module is used for acquiring a query video and a reference video, extracting key frames of the query video and the reference video, and acquiring a query video key frame set and a reference video key frame set;

The deep learning network module is used for extracting features of the key frame images in the query video key frame set and the reference video key frame set by utilizing a deep learning network to obtain the query video key frame feature set and the reference video key frame feature set;

the similarity candidate set generation module is used for acquiring a similarity candidate set of each query video key frame in the query video key frame set from the reference video key frame set by utilizing a similarity or distance value between the key frame features in the query video key frame feature set and the reference video key frame feature set;

the matching key frame pair set generation module is used for generating a key frame pair set by inquiring the relation between the video key frames and similar candidate sets, and screening the key frame pair set by utilizing the position offset of the key frame pairs in the key frame pair set to generate a matching key frame pair set;

the similar segment starting key frame pair acquisition module is used for sequencing key frame pairs in the matched key frame pair set based on the position offset of the key frame pairs, and then merging the key frame pairs with the difference of the position offset in a threshold value to acquire a starting key frame pair and a stopping key frame pair of the similar video segment;

And the similar segment pair acquisition module is used for acquiring the video similar segment pair from the query video and the reference video according to the time positions of the start key frame pair and the end key frame pair of the video similar segment.

In a third aspect of the present invention, there is provided an electronic apparatus comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods of any of the above.

In a fourth aspect of the invention, a computer readable medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements any of the methods described above.

According to the embodiment of the invention, the similarity set is screened through the deep learning features of the images, then the key frame pairs corresponding to one are obtained in a sequencing filtering mode, the key frame pairs with the position offset within the threshold value are combined, the similar fragments are obtained, the calculation review degree is low, and the similar fragments can be quickly obtained. The key frame matching method adopted in the embodiment of the invention does not simply search for the key frame pair with highest similarity when searching for the key frame pair matched with the query video and the reference video. But first find top-k nearest neighbors of the query video key frame according to the similarity, and then obtain nearest neighbors of each key frame from the overall consideration of video content. Therefore, the image continuity in the video is fully utilized, and the key frame matching accuracy can be obviously improved, so that the similar fragment positioning accuracy is improved.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:

FIG. 1 is a schematic diagram of a system architecture in which a video similar segment searching method and an extracting device according to some examples of the present invention operate;

FIG. 2 is a flow chart of a method for searching for similar video clips in some examples of the invention;

FIG. 3 is a schematic diagram of an algorithm module of a video similar segment search method according to some embodiments of the invention;

FIG. 4 is a schematic diagram of a class boundary for softmax and arcface loss in a deep learning network for a video similar segment search method in some embodiments of the invention;

FIG. 5 is a schematic diagram of results of forming a similarity set in a video similarity segment search method according to some embodiments of the present invention;

FIG. 6 is a schematic diagram of results of matching key frame pairs formed in a video similar segment search method according to some embodiments of the present invention;

FIG. 7 is a diagram illustrating the results of forming similar segment pairs in a video similar segment searching method according to some embodiments of the present invention;

FIG. 8 is a flowchart illustrating a method for searching video similar segments according to some embodiments of the present invention;

FIG. 9 is a flowchart illustrating a method for forming a similarity set in video similarity clip search according to some embodiments of the present invention;

FIG. 10 is a flowchart illustrating a method for filtering a similarity set in a video similarity clip search according to some embodiments of the present application;

FIG. 11 is a flow chart of a method for merging keyframes in video similar segment search according to some embodiments of the application;

FIG. 12 is a system diagram of a video similar segment searching apparatus implemented based on the video similar segment searching method in the above-mentioned figures according to some embodiments of the present application;

FIG. 13 is a schematic diagram of a computer system in which a method for searching or extracting similar segments of video according to some embodiments of the present application operates.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited to the specific embodiments disclosed below.

FIG. 1 illustrates an exemplary system architecture 100 of an embodiment of a video similar segment search method or video similar segment search apparatus to which embodiments of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or transmit data (e.g., video) or the like. Various communication client applications, such as video playing software, video processing class applications, web browser applications, shopping class applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting data transmission, including but not limited to smartphones, tablet computers, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present application is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for videos displayed on the terminal devices 101, 102, 103. The background server may analyze and process the received data, such as an image processing request, and feed back a processing result (for example, a video clip or other data obtained by dividing a video) to an electronic device (for example, a terminal device) communicatively connected to the background server.

It should be noted that, the video similar segment searching method provided by the embodiment of the present application may be executed by the server 105, and accordingly, the video similar segment searching apparatus may be disposed in the server 105. In addition, the video similar segment searching method provided by the embodiment of the application can also be executed by the terminal equipment 101, 102 and 103, and correspondingly, the video similar segment searching device can also be arranged in the terminal equipment 101, 102 and 103.

The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The present application is not particularly limited herein.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the video similar clip search method is run does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., the terminal device 101, 102, 103 or the server 105) on which the video segmentation method is run.

The technical scheme of the embodiment of the invention is as follows: firstly, extracting key frames of a query video and a video (hereinafter referred to as a reference video) in a video library; extracting image features of the key frames to form a feature set; then, searching a similar feature candidate set in the reference video feature set aiming at each feature in the query video feature set; then filtering the similar characteristic candidate set to generate a matched key frame pair of the query video and the reference video; and finally, sorting and merging key frame pairs according to a preset rule, so as to locate similar fragments.

Fig. 2 shows a general flow of a video similar segment search algorithm according to an embodiment of the present invention, and fig. 3 shows major algorithm modules included in the system, which specifically includes the following steps:

S1, extracting key frames

Video can be broken down into many sequential images, which are too bulky for most video processing applications, and there is a large redundancy between images. Therefore, in order to improve video processing efficiency, a part of images, which are called key frames, are generally extracted from video as a representative.

In the invention, key frames are extracted from the video in a uniform sampling mode, namely every preset frame number N _step A frame of key frames is extracted. Recording the length of the query video (i.e. the total number of frames contained) as N _total The number of extracted key frames isWherein->Representing a rounding down. And carrying out the same processing on the query video and the reference video to respectively obtain a query video key frame set and a reference video key frame set. The key frame extraction can also adopt random sampling and variable frequency sampling, and the sampling parameters are not limited by the current setting.

S2, extracting key frame characteristics

If the key frames are used directly for image/video similarity calculation, this means that pixel-level features are adopted, which is not only computationally intensive but also too sensitive to image content variations. In consideration of expressive, robust and expansibility of features, a deep learning network is adopted to extract features from key frame images, the deep features generally have strong robustness to image color, illumination, deformation, watermarking and other interferences, and the deep features can be continuously learned, upgraded and adapted to new types of images.

The present invention is illustrated by the mobilefacenet network model ("MobileFaceNets: efficient CNNs for Accurate Real-time Face Verification on Mobile Devices", CVPR, apr, 2018). The Mobilefacenet replaces the average pooling layer with a global depth convolution layer (global depth convolution) after the embedding layer (imbedding), greatly improving the recognition accuracy. The use of arcface loss also replaces the commonly used softmax to achieve intra-class compactness. Fig. 4 is a schematic diagram of the classification boundary of the two, and it can be seen that arcface loss increases the inter-class variance.

The image of the input model needs to be preprocessed to obtain the standard image when the model is trained and used, and the specific process is as follows:

1) The key frame image is scaled and cropped to a 112 x 112 image, which should be RGB three-channel.

2) And normalizing all the clipping images, namely subtracting the average value of each color channel of the clipping images, and dividing the average value by the standard deviation of each channel.

During training, training data is first prepared, and can be video images collected according to application or public data sets which need to be marked on the category to which the images belong. The image is then subjected to the pre-processing described above and a training program is run using the processed data to obtain a model that meets the performance criteria (average recognition accuracy, recall).

When the method is used for extracting the characteristics, the image is subjected to the same preprocessing. The pre-trained model is then input, but only the embedded layer (embedding) output of the model is taken, one 128-dimensional vector will be output for each image. And finally, normalizing the feature vector into a unit vector with the length of 1.

S3, searching similar feature set of query video key frame

For each key frame Q of query video _i The feature is utilized to search L most similar candidates in the key frames of the reference video, and the result is stored as a binary group list { (r) _i1 ，d _i1 )，(r _i2 ，d _i2 )，...，(r _ij ，d _ij )，...，(r _iL ，d _iL ) -where r _ij ，d _ij Respectively represent and Q _i Sequence number of j-th similar reference video key frame and similarity metric. In practice, in order to improve the efficiency of the subsequent processing, the search result usually returns only those key frames with similarity greater than a preset threshold. Therefore, there is a possibility that the number of searched candidate key frames is less than L, and even no candidate key frames can be searched, and the length of the binary group list is less than or equal to L. When the number is less than L, the processing is carried out according to the actual number, and no complement is needed.

When the number of reference videos in the video library is extremely large, massive reference video features are generated, and extremely large computing power is required for searching candidate feature sets of query video features from the feature sets. To reduce resource consumption and improve search efficiency, efficient search algorithms are typically used.

The embodiment of the invention takes a KDTree (hereinafter referred to as a K-dimensional tree) nearest neighbor search algorithm as an example:

1) If the number of key frames of the reference video is M, then the extractedFeatures are regarded as M data points in K-dimensional space, and K-dimensional Tree is constructed by the data points _r . K is the dimension number of the key frame feature, and the data points are the key frame features.

2) If the number of key frames of the query video is N, the extracted features are regarded as N data points of the K-dimensional space, and the K-dimensional Tree is constructed by the data points _q 。

3) Searching Tree using Euclidean distance _q Is at the point of Tree _r L nearest neighbors in (a), euclidean distanceWhere s, n are data points in K-dimensional space.

The process of finding the similarity set may also employ a similar nearest neighbor method.

Assuming that there are two different similar segments in the query video and the reference video, the search results will form a one-to-many mapping as described in FIG. 5, where gray and black represent key frames in the two different similar segments and white represents key frames of the non-similar segments. The solid arrows represent the correct mapping, while the dashed lines are the wrong mappings that need to be filtered out.

S4, generating a matched key frame pair of the query video and the reference video

Because of the continuity of images in the video, neighboring key frames tend to have a great similarity, meaning that neighboring key frames of the query video may be similar to the same reference frame of the reference video, such that nearest neighbor key frames of multiple key frames of the query video are the same. Therefore, the most similar candidate set cannot be directly selected as the matching, and the overall content of the video needs to be considered, so that the specific method for generating the matching key frame pair is as follows:

1) Each reference video key frame in the query video key frame and its similar feature candidate set is referred to as a key frame pair p _ij According to the reference video key frame sequence number r stored in the similar characteristic candidate set _ij Calculating the position offset (hereinafter referred to as offset) o of the query video key frame and the reference video key frame _ij ＝i-r _ij 。

2) Key frame pairs are grouped by offset size, i.e., belonging to a group of the same offset size, denoted as g= { G ₁ ，g ₂ ，...，g _k ，...g _K K represents the maximum number of groups. Creating offset histogram statistics and then ordering G from high to low according to the histogram heightHistogram statistics is the number of different value ranges that the value of a particular object (pixels in the image, here, the position offset of a key frame pair) falls within. The histogram statistics of the offset is the same as the image histogram statistics method, except that the number of key frame pairs having a certain offset is counted according to the size of the offset of the key frame pairs.

3) And traversing all the ordered groups, and selecting the optimal reference video key frame corresponding to each key frame of the query video according to the offset heat (in other words, the histogram height of the group where the offset is located). The histogram statistics of an offset can be seen as the hotness of the offset, the higher the histogram of a certain offset, the more key frame pairs with the offset. We consider that the correct key frame map has identical/similar offset values and that these offset values are much hotter than the other offsets (i.e., the incorrectly mapped offsets). Therefore, selecting the optimal reference video key frame is actually the candidate with the greatest heat. The specific process is as follows:

a) Creating a query video key frame usage record table Tab _q And a key frame pair group list G ^～ The initialization is empty.

b) Traversing all the ordered groups, for any one groupTraversing all key frame pairs p contained therein _ij 。

c) For any key frame pair p _ij If the corresponding query video key frame is not contained in the usage record table Tab _q Then it is added to the table; otherwise fromThe key frame pair is deleted.

d) After traversing the key frame pair end, ifIf not empty, add it to G ^～。

e) After the traversal of all of the groups,m represents the number of new key pairs. G ^～ Only the key frame pairs of the one-to-one correspondence of the query video and the reference video, i.e., the matching key frame pairs, will be included.

The matching key frame pair can be considered to be the correct mapping of key frames in a similar segment, but is subject to a key frame sampling step size N _step The effect is that some key frames in similar segments may not be mapped to any, and the mapping of query video 3 rd and reference video 4 th frames in fig. 6 is not detected.

S5, generating similar video clips

The matched key frame pairs are arranged unordered, and are required to be ordered and combined according to a certain rule so as to locate the continuous similar fragments on the time axis. The invention groups according to the sequence number continuity of the key frame. The realization is as follows:

1) Will G ^～ Sequencing from small to large according to offsetWherein the offset of each group is marked +.>

2) Establishing a key frame group list G≡, initializing to null, and then traversing G ^* All groups.

3) For the first groupIt is subjected toAdding list G and taking it as the current group of list G while shifting it by +.>Set to the current offset o _c 。

4) For any other groupCompare the group offset +.>From the current offset o _c Difference value. If it is smaller than the preset threshold o _th Merging the group with the current group of list G; otherwise, add it to the list G and take it as the current group of the list G. In both cases will be offset +.>Set to the current offset o _c 。

g ₁ * As the initial current group, while the other groups are loop iteration processes. For any group, it is first compared whether the difference from the current group offset is less than a threshold (in fact, whether the group is sufficiently similar to the current group offset). If less than the threshold (i.e., sufficiently similar), the group is merged into the current group and the offset of the group is taken as the current offset. Otherwise, the group is added as a new current group to G, and the group is used as the current group, and the offset is used as the current offset. In general, the current group and the current offset are the basis for comparison, any one group needs to be compared with the current group to decide its way of processing, and the current group and the current offset will be updated after each comparison.

Combining key frame pairs contained in each group in G, i.e., preserving the start and end frames, each group will generate a sequence of consecutive key frame pairs from which similar segments in the video can be located based on the temporal positions of the key frame pairs in the query video and the reference video. There may be a plurality of similar fragments, the results of which are shown in FIG. 7.

In the embodiment of the invention, the continuity of images in the video is considered, and the adjacent key frames have great similarity, which means that the adjacent key frames of the query video are possibly similar to the same reference frame of the reference video, so that the nearest neighboring key frames of a plurality of key frames of the query video are the same. Therefore, the key frame with the maximum similarity cannot be simply searched, the whole content of the video needs to be considered, and the matching is performed in more key frame candidates, so that higher accuracy is obtained. When searching for a key frame pair in which the query video and the reference video match, the highest similarity is not simply searched. But first find top-k nearest neighbors of the query video key frame according to the similarity, and then obtain nearest neighbors of each key frame from the overall consideration of video content. Therefore, the image continuity in the video is fully utilized, and the key frame matching accuracy can be obviously improved, so that the similar fragment positioning accuracy is improved. The video similar segment searching method in the embodiment of the invention can be applied to video searching and piracy checking.

Still further embodiments of the present invention, as shown in fig. 8, provide a method for searching for similar video clips, including the following steps:

According to the invention, screening is carried out according to a plurality of similar key frames, the overall content of the video is combined, the error rate is smaller, meanwhile, the position deviation of the integrity is correspondingly utilized as a filtering condition, the error key frame pairs are filtered out rapidly, the processing speed is high, and the efficiency is high.

or,

By matching the key frames of the query video with the key frames of the reference video, a key frame set larger than a threshold value is obtained, results of the similar set are enriched, consideration is more comprehensive, and error rate is reduced.

Further, as shown in fig. 9, the step S140 includes:

The grouping is carried out through the position offset, then the sorting is carried out according to the number of each group, the fact that the error matching occupies a small number is considered, the sorting filtering is carried out by utilizing the number of each group, the processing efficiency is high, and the situation of one-to-many is eliminated.

Further, as shown in fig. 10, the step of obtaining the key frame pair group filter set includes:

s1443, after traversing the key frame pair, ifIf not empty, add it to G ^～；

The one-to-many elimination is carried out in a traversing mode, so that the elimination can be carried out only by one traversing, and the algorithm is efficient.

Further, the step S143 includes:

Further, as shown in fig. 11, the step S150 includes:

s151, filtering the key frame pairs into a set G ^～ Sorting from small to large according to the offset size to obtain a key frame pair group filtering sorting setWherein the offset of each group is marked +.>

S152, establishing a key frame group list G≡initialized to be empty;

S155, to be offsetSet to the current offset o _c Step S154 is repeatedly performed until G ^* Is empty;

s156, output G+= { (G) _1star ，g _1end ) ₁ ，…，(g _nstar ，g _nend ) _n ，…，(g _Tstar ，g _Tend ) _T (wherein T is the number of similar fragments, g) _nstar A pair of start key frames representing an nth segment of similar segments g _nend A terminating key frame pair representing an nth segment of similar segments.

By combining combinations where the positional offsets differ by less than a threshold, the integrity of the video is considered. For example, in the case where the query video and the reference video are identical, the position offset between the key frames is within a threshold range, and then the entire video may be regarded as a similar segment. For another example, in the reference video, including similar segments in the multi-segment query video, the interval between the similar segments in the query video is small, and in some security checking mechanisms, in the reference video, in order to be more not queried, the interval between the similar segments is intentionally increased, and when the embodiment of the invention is utilized for merging, the two similar segments are formed due to larger position deviation without merging.

Based on the above-mentioned video similar segment searching method, another embodiment of the present invention is shown in fig. 12, and a video similar segment searching apparatus 109 is provided, which includes:

the frame extraction module 110 is configured to obtain a query video and a reference video, extract key frames of the query video and the reference video, and obtain a query video key frame set and a reference video key frame set;

the deep learning network module 120 is configured to perform feature extraction on the key frame images in the query video key frame set and the reference video key frame set by using a deep learning network, so as to obtain a query video key frame feature set and a reference video key frame feature set;

a similarity candidate set generating module 130, configured to obtain, from a reference video key frame set, a similarity candidate set of each query video key frame in the query video key frame set by using a similarity or a distance value between key frame features in the query video key frame feature set and the reference video key frame feature set;

the matching key frame pair set generating module 140 is configured to generate a key frame pair set by querying a relationship between a video key frame and a similar candidate set, and screen the key frame pair set by using a position offset of a key frame pair in the key frame pair set to generate a matching key frame pair set;

The similar segment start key frame pair obtaining module 150 is configured to sort key frame pairs in the matching key frame pair set based on the position offsets of the key frame pairs, and then combine the key frame pairs with the difference of the position offsets within a threshold value to obtain a start key frame pair and a stop key frame pair of the similar segment of the video;

the similar segment pair obtaining module 160 is configured to obtain a video similar segment pair from the query video and the reference video according to the time positions of the start key frame pair and the end key frame pair of the video similar segment.

The specific execution steps of the above modules are described in detail in the corresponding steps in the video similar segment searching method, and are not described in detail herein.

Referring now to FIG. 13, there is illustrated a schematic diagram of a computer system 800 suitable for use in implementing the control device of an embodiment of the present application. The control device shown in fig. 13 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present application.

As shown in fig. 13, the computer system 800 includes a Central Processing Unit (CPU) 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 801.

The computer readable medium according to the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Python, java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented in software or in hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, a segmentation unit, a determination unit, and a selection unit. The names of these units do not limit the unit itself in some cases, and the acquisition unit may also be described as "a unit that acquires a drawing image to be processed", for example.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a query video and a reference video, and extracting key frames of the query video and the reference video to acquire a query video key frame set and a reference video key frame set; extracting features of the key frame images in the query video key frame set and the reference video key frame set by using a deep learning network to obtain a query video key frame feature set and a reference video key frame feature set; obtaining a similar candidate set of each query video key frame in the query video key frame set from the reference video key frame set by using the similarity or distance value between the key frame features in the query video key frame feature set and the reference video key frame feature set; generating a key frame pair set by inquiring the relation between the video key frames and similar candidate sets, and screening the key frame pair set by utilizing the position offset of the key frame pairs in the key frame pair set to generate a matched key frame pair set; sorting the key frame pairs in the matched key frame pair set based on the position offset of the key frame pairs, and then merging the key frame pairs with the difference of the position offset within a threshold value to obtain a start key frame pair and a stop key frame pair of the similar video segment; and acquiring the video similar fragment pairs from the query video and the reference video according to the time positions of the start key frame pairs and the end key frame pairs of the video similar fragments.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. The video similar segment searching method is characterized by comprising the following steps:

s120, extracting features of key frame images in the query video key frame set and the reference video key frame set by using a deep learning network to obtain a query video key frame feature set and a reference video key frame feature set; the deep learning network comprises any one of VGG, denseNet, UNet, resnet, mobilefacenet;

s140 specifically includes:

S144, traversing the keyframe pair group ordered set, and filtering so that keyframe pairs with the same keyframes in the query video keyframe set are reserved only one, and obtaining a keyframe pair group filter set which is used as the matched keyframe pair set;

the step of obtaining the key frame pair group filter set in S144 includes:

s1443, after traversing the key frame pair, ifIf not empty, add it to G ^～；

S1444, after traversing all key frame pairs, obtaining a key frame pair filter setM represents the number of new key pairs;

S150 specifically includes:

S152, establishing a key frame group list G≡initialized to be empty;

S155, to be offsetSet to the current offset o _c Step S154 is repeatedly performed until G ^＊ Is empty;

s156, output G+= { (G) _1star ，g _1end ) ₁ ，...，(g _nstar ，g _nend ) _n ，...，(g _Tstar ，g _Tend ) _T (wherein T is the number of similar fragments, g) _nstar A pair of start key frames representing an nth segment of similar segments g _nend A termination key frame pair representing an nth segment of similar segments;

2. The method according to claim 1, wherein in step S130, the step of obtaining the similar candidate set of each query video key frame in the set of query video key frames from the set of reference video key frames comprises:

or,

3. The method of searching for video similar clips according to claim 1, wherein said step S143 comprises:

4. A video similar clip search apparatus, comprising:

the deep learning network module is used for extracting features of the key frame images in the query video key frame set and the reference video key frame set by utilizing a deep learning network to obtain the query video key frame feature set and the reference video key frame feature set; the deep learning network comprises any one of VGG, denseNet, UNet, resnet, mobilefacenet;

the matching key frame pair set generation module is used for generating a key frame pair set by inquiring the relation between the video key frames and similar candidate sets, and screening the key frame pair set by utilizing the position offset of the key frame pairs in the key frame pair set to generate a matching key frame pair set; the method specifically comprises the following steps:

Forming a key frame pair by key frames in a query video key frame set and key frames in a corresponding similar candidate set, and acquiring the position offset of the key frames in the query video key frame set and the key frames in the similar candidate set in the key frame pair, wherein the key frame pair forms the key frame pair set as the position offset of the key frame pair;

grouping according to the position offset of the key frame pairs, grouping the key frames with the same position offset in the key frame pair set into the same group to form a key frame pair group, wherein the key frame pair group forms a key frame pair group set;

sorting the keyframe pair group set according to the number of keyframe pairs in the keyframe pair group to form a keyframe pair group sorting set;

traversing the key frame pair group ordered set, and filtering to ensure that only one key frame pair with the same key frame in the query video key frame set is reserved, so as to obtain a key frame pair group filter set, wherein the key frame pair group filter set is used as the matched key frame pair set;

the step of obtaining the key frame pair group filter set includes:

creating a query video key frame usage record table Tab _q And a keyframe pair group filter set G ^～ The initialization is empty;

traversing all keyframe pairs in the keyframe pair ordering set for any one keyframe pairTraversing all key frame pairs p contained therein _ij For any key frame pair p _ij If the corresponding query video key frame is not contained in the usage record table Tab _q Then it is added to the table; otherwise from->Deleting the key frame pair;

after traversing the key frame pair end, ifIf not empty, add it to G ^～；

After traversing all key frame pairs, obtaining a key frame pair filter setM represents the number of new key pairs;

the similar segment starting key frame pair acquisition module is used for sequencing key frame pairs in the matched key frame pair set based on the position offset of the key frame pairs, and then merging the key frame pairs with the difference of the position offset in a threshold value to acquire a starting key frame pair and a stopping key frame pair of the similar video segment; the method specifically comprises the following steps:

group-wise filtering of keyframes set G ^～ Sorting from small to large according to the offset size to obtain a key frame pair group filtering sorting setWherein the offset of each group is marked +.>

Establishing a key frame group list G-to-be initialized to be empty;

for the first groupAdding it to the list G and taking it as the current group of the list G while shifting it by +. >Set to the current offset o _c ；

Obtaining any one group remained in the key frame pair group filtering and sorting setCompare the group offset +.>From the current offset o _c A difference value; if it is smaller than the preset threshold o _th Merging the group with the current group of list G; otherwise, add it to list G and use it as the current group of list G from G ^* Delete group->

Will shiftSet to the current offset o _c Step S154 is repeatedly performed until G ^＊ Is empty;

output g+= { (G) _1star ，g _1end ) ₁ ，...，(g _nstar ，g _nend ) _n ，...，(g _Tstar ，g _Tend ) _T (wherein T is the number of similar fragments, g) _nstar A pair of start key frames representing an nth segment of similar segments g _nend A termination key frame pair representing an nth segment of similar segments;

5. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-3.

6. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-3.