CN113486788A

CN113486788A - Video similarity determination method and device, electronic equipment and storage medium

Info

Publication number: CN113486788A
Application number: CN202110757741.4A
Authority: CN
Inventors: 尹芳; 肖劲; 刘霄晨; 张晓刚; 罗永贵
Original assignee: Lianren Healthcare Big Data Technology Co Ltd
Current assignee: Lianren Healthcare Big Data Technology Co Ltd
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-10-08

Abstract

The embodiment of the invention discloses a method and a device for determining video similarity, electronic equipment and a storage medium. The method comprises the following steps: acquiring a first video and a second video with similarity to be determined; analyzing the first video and the second video, determining a first analyzed video frame corresponding to each shot in the first video, and determining a second analyzed video frame corresponding to each shot in the second video; determining shot similarity between each shot in the first video and each shot in the second video based on a first analysis video frame corresponding to each shot in the first video and a second analysis video frame corresponding to each shot in the second video; the technical scheme of the embodiment of the invention solves the technical problem that the video similarity calculation efficiency is low in a method for determining the video similarity in the prior art, improves the video similarity calculation efficiency and achieves the technical effect of more quickly determining the video similarity.

Description

Video similarity determination method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of video identification, in particular to a method and a device for determining video similarity, electronic equipment and a storage medium.

Background

Video, as a major media type, is increasingly becoming an indispensable information carrier in human life, education, entertainment, and the like. The video production process comprises the steps of shooting a single shot firstly, and then linking video clips shot by a plurality of shots together to form a complete video. However, since the amount of similar video data is huge, the storage space is wasted, and the retrieval, management and protection of the video are quite difficult.

In the prior art, the similarity of videos is generally determined through video fingerprints corresponding to the videos. The video fingerprint is to extract the characteristic information of the video content and then obtain a video digital sequence corresponding to the video based on the characteristic information. Since similar videos should have similar video fingerprints. Therefore, the similarity of the videos can be determined by comparing the video fingerprints of the videos.

However, the similarity of the video is determined through the video fingerprint corresponding to the video, and a large number of feature vectors are generated in the process of determining the video fingerprint, so that a large number of feature vectors need to be calculated in the whole video fingerprint matching process, and the problem of low video similarity calculation efficiency exists.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining video similarity, electronic equipment and a storage medium, which are used for improving the calculation efficiency of the video similarity so as to achieve the technical effect of more quickly determining the video similarity.

In a first aspect, an embodiment of the present invention provides a method for determining video similarity, where the method includes:

acquiring a first video and a second video with similarity to be determined;

analyzing the first video and the second video, determining a first analyzed video frame corresponding to each shot in the first video, and determining a second analyzed video frame corresponding to each shot in the second video;

determining shot similarity between each shot in the first video and each shot in the second video based on a first parsed video frame corresponding to each shot in the first video and a second parsed video frame corresponding to each shot in the second video;

and determining the video similarity of the first video and the second video based on the shot similarity.

In a second aspect, an embodiment of the present invention further provides a video similarity determining apparatus, where the apparatus includes:

the video acquisition module is used for acquiring a first video and a second video with similarity to be determined;

the video analysis module is used for analyzing the first video and the second video, determining a first analysis video frame corresponding to each lens in the first video, and determining a second analysis video frame corresponding to each lens in the second video;

a shot similarity determination module, configured to determine a shot similarity between each shot in the first video and each shot in the second video based on a first parsed video frame corresponding to each shot in the first video and a second parsed video frame corresponding to each shot in the second video;

and the video similarity determining module is used for determining the video similarity of the first video and the second video based on the shot similarity.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

storage means for storing one or more programs;

when executed by the processor, cause the processor to perform video similarity determination as provided by any of the embodiments of the present invention.

In a fourth aspect, embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements video similarity determination as provided in any of the embodiments of the present invention.

According to the technical scheme of the embodiment, the first video and the second video with the similarity to be determined are obtained. The method comprises the steps of analyzing a first video and a second video, determining a first analyzed video frame corresponding to each shot in the first video, and determining a second analyzed video frame corresponding to each shot in the second video. And determining shot similarity between each shot in the first video and each shot in the second video based on the first analysis video frame corresponding to each shot in the first video and the second analysis video frame corresponding to each shot in the second video. The shot similarity between each shot in the first video and each shot in the second video is calculated for the videos by taking the shot as a unit, so that the shot similarity between each shot in the first video and each shot in the second video can be determined more quickly. The video similarity of the first video and the second video is determined based on the shot similarity, the technical problem that the video similarity calculation efficiency is low in a method for determining the video similarity in the prior art is solved, the calculation efficiency of the video similarity is improved, and the technical effect of determining the video similarity more quickly is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, a brief description is given below of the drawings used in describing the embodiments. It should be clear that the described figures are only views of some of the embodiments of the invention to be described, not all, and that for a person skilled in the art, other figures can be derived from these figures without inventive effort.

Fig. 1 is a schematic flowchart of a video similarity determining method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a video similarity determination method according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of a video similarity determination apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a video similarity determining method according to an embodiment of the present invention, which is applicable to a case where a video similarity is obtained by calculating a shot similarity, where the method may be executed by a video similarity determining apparatus, the video similarity determining apparatus may be implemented by software and/or hardware, and the video similarity determining apparatus may be integrated in an electronic device such as a computer or a server.

As shown in fig. 1, the method of the present embodiment includes:

and S110, acquiring a first video and a second video with similarity to be determined.

The first video and the second video can understand two videos of which the video similarity at the current moment is to be determined.

Specifically, a first video and a second video with similarity to be determined are obtained, wherein there are many ways to obtain the first video and the second video, and the specific obtaining way is not limited herein. For example, a video input by a user may be received as the first video or the second video, or a video recorded by a camera may be received as the first video or the second video, or the first video or the second video may be acquired by web crawler technology. It should be noted that the video sources of the first video and the second video may be the same or different.

It should be noted that "first" and "second" in the first video and the second video are only used to distinguish the two videos, and are not limited to the order of the videos.

S120, analyzing the first video and the second video, determining a first analysis video frame corresponding to each shot in the first video, and determining a second analysis video frame corresponding to each shot in the second video.

The first analysis video frame may be understood as a video frame corresponding to each shot obtained by analyzing the first video. The number of first parsed video frames may be one or more frames. The second parsed video frame may be understood as a video frame corresponding to each shot obtained by parsing the second video. The number of second parsed video frames may be one or more frames.

Specifically, the first video and the second video may be analyzed based on a preset video analysis mode. When the first video parsing is completed, a first parsed video frame corresponding to each shot in the first video may be determined. When the second video parsing is completed, a second parsed video frame corresponding to each shot in the second video may be determined. The preset video parsing technology may be a video parsing method using a shot as a unit, for example, a shot-based video scene detection method, a shot-to-shot cutting method by a machine learning method, and the like.

It should be noted that "first" and "second" in the first parsed video frame and the second parsed video frame are only for distinguishing the parsed video frames, and are not limited to the order of parsing the video frames.

Optionally, the first video may be parsed according to the following steps, and a first parsed video frame corresponding to each shot in the first video is determined:

step one, analyzing a first video, and determining a first analyzed video frame included in the first video.

Specifically, the first video is decomposed, and when the first video is decomposed, all first analysis video frames included in the first video can be obtained.

And step two, calculating a histogram difference value of two adjacent first analysis video frames, and determining a first analysis video frame corresponding to each lens in the first video based on the histogram difference value.

Wherein, the histogram difference value can be used to reflect the difference between two adjacent first analysis video frame images.

In particular, two adjacent first parsed video frames may be determined from the sequence of frames of each first parsed video frame in the first video. And respectively calculating histograms corresponding to the two adjacent first analysis video frames according to the two adjacent first analysis video frames. Further, a histogram difference value of two adjacent first parsed video frames may be calculated from histograms of the two adjacent first parsed video frames. After determining the histogram difference, a first parsed video frame corresponding to each shot in the first video may be determined based on the histogram difference.

Optionally, to better reflect the difference between two adjacent first resolution video frames, an HSV space (Hue: H, preservation: S and Value: V) may be introduced, and the histogram difference of two adjacent first resolution video frames is calculated according to the following formula:

wherein, X²Representing the histogram difference of two adjacent first parsed video frames, h_m(i) And h_n(i) Histograms representing H components of the m-th frame and the n-th frame, respectively, k represents a color quantization level of the first parsed video frame. Note that Hue represents a Hue space, and an H component can be understood as a component of the Hue space.

Specifically, for each level of color quantization level of the first parsing video frame, a histogram of an adjacent first parsing video frame may be differentiated to obtain an inter-frame histogram difference value according to the frame sequence of the first parsing video frame. To reduce the computational complexity, the inter-frame histogram difference value may be squared to obtain an inter-frame histogram squared value. In order to improve the accuracy of the histogram difference of two adjacent first parsed video frames, the histogram difference value of two adjacent first parsed video frames of one color quantization level may be obtained by dividing by the maximum value of the histograms of two adjacent first parsed video frames. According to the histogram difference value of two adjacent first analysis video frames of one color quantization level, the histogram difference value of the two adjacent first analysis video frames can be obtained.

Optionally, the first parsed video frame corresponding to each shot in the first video is determined based on the histogram difference by:

comparing the histogram difference with a preset video frame segmentation threshold, if the histogram difference is greater than the video frame segmentation threshold, determining that a shot change occurs at the position, and using a first analysis video frame corresponding to the histogram difference greater than the video frame segmentation threshold as a shot change position of the first video. Therefore, the first analysis video frame corresponding to each shot in the first video can be determined according to the shot change position.

S130, determining shot similarity between each shot in the first video and each shot in the second video based on a first analysis video frame corresponding to each shot in the first video and a second analysis video frame corresponding to each shot in the second video.

The shot similarity may be a similarity between one shot in the first video and one shot in the second video, and may be used to reflect a difference between the one shot in the first video and the one shot in the second video.

Specifically, feature extraction may be performed on a first analysis video frame corresponding to each shot in the first video and a second analysis video frame corresponding to each shot in the second video by using a video frame feature extraction method, so as to determine a feature value corresponding to each shot in the first video and a feature value corresponding to each shot in the second video. Thus, the shot similarity between each shot in the first video and each shot in the second video may be determined based on the feature value corresponding to each shot in the first video and the feature value corresponding to each shot in the second video.

Optionally, the shot similarity between each shot in the first video and each shot in the second video may be an average of the similarity between each two of the first analysis video frame corresponding to each shot in the first video and the second analysis video frame corresponding to each shot in the second video.

And S140, determining the video similarity between the first video and the second video based on the shot similarity.

The video similarity can be understood as the similarity of the first video and the second video, and can be used for reflecting the difference between the first video and the second video.

Specifically, after determining the shot similarity between each shot in the first video and each shot in the second video, the average value of the shot similarities between each shot in the first video and each shot in the second video may be used as the video similarity between the first video and the second video.

Optionally, in order to improve the accuracy of the video similarity between the first video and the second video, the video similarity between the first video and the second video may be determined based on the shot similarity in the following manner:

and presetting a shot similarity threshold, and if the shot similarity between each shot in the first video and each shot in the second video is greater than the shot similarity threshold, keeping the shot similarity greater than the shot similarity threshold. And averaging and summing the reserved shot similarities to obtain a shot similarity mean value. After the shot similarity mean value is obtained, the shot similarity mean value can be used as the video similarity of the first video and the second video.

Example two

Fig. 2 is a schematic flow chart of a method for determining video similarity according to a second embodiment of the present invention, where on the basis of the foregoing embodiments, optionally, the determining the shot similarity between each shot in the first video and each shot in the second video based on a first analytic video frame corresponding to each shot in the first video and a second analytic video frame corresponding to each shot in the second video includes: filtering a first analysis video frame corresponding to each lens in the first video to obtain a first target video frame corresponding to each lens in the first video; filtering a second analysis video frame corresponding to each lens in the second video to obtain a second target video frame corresponding to each lens in the second video; determining shot similarity between each shot in the first video and each shot in the second video based on the first target video frame and the second target video frame.

Optionally, the filtering the first parsed video frame corresponding to each shot in the first video to obtain a first target video frame corresponding to each shot in the first video includes: calculating the brightness value and the darkness value of the current first analysis video frame based on the weight values of three preset and defined color channels aiming at each first analysis video frame corresponding to each lens in the first video; and if the brightness value of the current first analysis video frame reaches the brightness threshold value of a preset frame, taking the current video frame as the first target video frame.

Optionally, the filtering the first parsed video frame corresponding to each shot in the first video to obtain a first target video frame corresponding to each shot in the first video further includes: aiming at each first analysis video frame corresponding to each lens in the first video, carrying out graying processing on the current first analysis video frame, and calculating a fuzzy value of the grayed current first analysis video frame based on a preset transverse gradient value and a preset longitudinal gradient value; and if the fuzzy value of the current first analysis video frame reaches the fuzzy threshold value of a preset frame, taking the current first analysis video frame as the first target video frame.

Optionally, the filtering the first parsed video frame corresponding to each shot in the first video to obtain a first target video frame corresponding to each shot in the first video further includes: performing histogram equalization processing on a current first analysis video frame aiming at each first analysis video frame corresponding to each lens in the first video, and determining a reference gray level corresponding to each pixel point in the current first analysis video frame; determining at least one target gray level corresponding to the current first analysis video frame based on each reference gray level corresponding to the current first analysis video frame and a preset proportion range; calculating the sum of the target gray levels to obtain an image balance value corresponding to the current first analysis video frame; and if the image balance value corresponding to the current first analysis video frame reaches the balance degree threshold value of a preset frame, taking the current first analysis as the first target video frame.

The technical terms that are the same as or corresponding to the above embodiments are not repeated herein.

As shown in fig. 2, the method of the embodiment may specifically include:

s210, acquiring a first video and a second video with similarity to be determined.

S220, analyzing the first video and the second video, determining a first analysis video frame corresponding to each shot in the first video, and determining a second analysis video frame corresponding to each shot in the second video.

S230, filtering the first analysis video frame corresponding to each shot in the first video to obtain a first target video frame corresponding to each shot in the first video.

The first target video frame may be a video frame to be processed corresponding to each shot in the first video, and may be used to calculate a shot similarity between each shot in the first video and each shot in the second video.

Specifically, a low-quality video frame, such as a video frame with poor definition, a blurred video frame, a video frame with uneven gray level, and the like, may exist in the first parsed video frame corresponding to each shot in the first video. In order to reduce the occupation of the resource amount and reduce the complexity of the subsequent shot similarity calculation, the first analysis video frame corresponding to each shot in the first video may be filtered. So that a first target video frame corresponding to each shot in the first video can be obtained.

It should be noted that "first" and "second" in the first target video frame and the second target video frame are only for distinguishing the parsed video frames, and are not for limiting the order of the target video frames.

Illustratively, the first parsing video frame corresponding to a shot in the first video includes: video frame 1, video frame 2, and video frame 3. The video frame 2 is a video frame with a large degree of blur. Then, the first target video frame corresponding to one shot in the first video includes: video frame 1 and video frame 3.

Taking the following three manners as examples, an optional manner of how to filter the first parsed video frame corresponding to each shot in the first video to obtain the first target video frame corresponding to each shot in the first video is described:

1. filtering first parsed video frames with low shading (less sharp)

Step one, aiming at each first analysis video frame corresponding to each shot in a first video, calculating the brightness value of the current first analysis video frame based on the preset and defined weight values of three color channels.

Color channels are understood to be RGB (red, green, blue) channels. The current first parsed video frame may be understood as the first parsed video frame at the current time.

Specifically, the weight values of three color channels are preset. For each first analysis video frame corresponding to each shot in the first video, the current first analysis video frame may be determined according to the current time. After determining the current first parsed video frame, the current first parsed video frame may be subjected to a channel splitting process, such as: the current first parsed video frame is decomposed using OpenCV or the like. And determining the brightness value of the current first analysis video frame according to the decomposed channel frames and the weight values of the three color channels.

Optionally, the shading value of the current first parsed video frame may be calculated according to the following formula:

wherein, y₁Representing the light and shade value of the current first analytic video frame, I representing the ith row of the current first analytic video frame, j representing the jth column of the current first analytic video frame, m representing the height of the current first analytic video frame, n representing the width of the current first analytic video frame, I_rRed channel map, ω, representing the current first parsed video frame_rRepresents the weight corresponding to the red channel map, I_gGreen channel map, ω, representing the current first parsed video frame_gRepresents the weight corresponding to the green channel map, I_bBlue channel map, ω, representing the current first parsed video frame_bRepresenting the corresponding weight of the blue channel map.

Specifically, weighted summation is performed on three color channels of each pixel point of the current first analytic video frame, and the sum values of all the pixel points are accumulated to obtain the light and shade sum value of the current first analytic video frame. Furthermore, in order to facilitate comparison of the brightness parameters of different initial video frames, the sum of brightness and darkness is divided by 255 to obtain a normalized value, and then divided by m × n to obtain an average value, and the average value is used as the brightness value of the current first analysis video frame.

The range of the light and dark values is [0,1 ]]The larger the shading value is, the brighter the current first analysis video frame is, and the smaller the shading value is, the darker the current first analysis video frame is. The weighted values of the three color channels may be set according to actual requirements, for example: omega_r＝0.2，ω_g＝0.7，ω_b0.1, etc.

And step two, if the brightness value of the current first analysis video frame reaches the brightness threshold value of the preset frame, taking the current video frame as a first target video frame.

Wherein the shading threshold may be a threshold set according to a sharpness requirement for the first target video frame.

Specifically, a brightness threshold of the frame is set in advance. If the shading value of the current first parsed video frame reaches the frame shading threshold, the current video frame may be taken as the first target video frame.

Illustratively, the light and shade threshold of the frame is set to 0.8 in advance. The light and dark value of the current first parsed video frame is 0.84, then the current first parsed video frame is the first target video frame.

2. Filtering first analysis video frame with higher fuzzy value

Step one, aiming at each first analysis video frame corresponding to each lens in a first video, carrying out graying processing on the current first analysis video frame, and calculating a fuzzy value of the grayed current first analysis video frame based on a preset transverse gradient value and a preset longitudinal gradient value.

The horizontal gradient value may be a horizontal gradient of each pixel point in the first analytic video frame in the horizontal direction. And the vertical gradient value is used for analyzing the vertical gradient of each pixel point in the video frame in the vertical direction.

Specifically, a lateral gradient value and a longitudinal gradient value are set in advance. For each first analysis video frame corresponding to each lens in the first video, graying the current first analysis video frame may be performed to obtain a grayed current first analysis video frame. Based on the grayed current first analytic video frame, the transverse gradient value and the longitudinal gradient value, the fuzzy value of the current first analytic video frame can be obtained.

Optionally, the blur value of the current first parsed video frame may be determined according to the following formula:

or,

wherein, y₂Representing a blur value of a current first parsed video frame, I representing an ith row of the current first parsed video frame, j representing a jth column of the current first parsed video frame, m representing a height of the current first parsed video frame, n representing a width of the current first parsed video frame, I_grayGray scale map, Δ x, representing the current first resolution video frame_ijRepresents the transverse gradient, delta y, of the pixel points in the ith row and the jth column of the current first analytic video frame_ijAnd the vertical gradient of the pixel points in the ith row and the jth column of the current first analysis video frame is represented.

Specifically, the gray value of each pixel point of the current first analytic video frame may be multiplied by the horizontal gradient and the vertical gradient of the pixel point, respectively, to obtain a horizontal gradient gray value Δ x_ijI_gray(i, j) and longitudinal gradient gray value Δ y_ijI_gray(i, j). Furthermore, the transverse gradient gray value and the longitudinal gradient gray value can be respectively squared and then summed and then subjected to evolution processing, and the contrast value of the current pixel point is determined. Or respectively calculating absolute values of the horizontal gradient gray value and the longitudinal gradient gray value, and then summing to determine the contrast value of the current pixel point. Both of the above-mentioned methods are to make the contrast value non-negative. After the contrast value corresponding to each pixel point is determined, the sum of all the contrast values may be divided by the size (height × width) of the current first analysis video frame to obtain an average value, so as to obtain a blur value of the current first analysis video frame.

It should be noted that the larger the parameter value of the contrast parameter is, the clearer the current video frame is, and the smaller the parameter value is, the more blurred the current video frame is.

And step two, if the fuzzy value of the current first analysis video frame reaches the fuzzy threshold value of the preset frame, taking the current first analysis video frame as a first target video frame.

Wherein the blur threshold may be a threshold set according to a blur level requirement for the first target video frame.

Specifically, a blurring threshold of the frame is set in advance. And if the fuzzy value of the current first analysis video frame reaches the fuzzy threshold value, taking the current video frame as the first target video frame.

3. Filtering first parsed video frames with lower image equalization values (non-uniform gray scale)

Step one, aiming at each first analysis video frame corresponding to each lens in a first video, carrying out histogram equalization processing on the current first analysis video frame, and determining a reference gray level corresponding to each pixel point in the current first analysis video frame.

The reference gray level may be a gray level of each pixel point after histogram equalization processing.

Specifically, for each first analysis video frame corresponding to each shot in the first video, the current first analysis video frame may be determined according to the current time. After the current first analysis video frame is determined, histogram equalization processing is carried out on the current first analysis video frame, so that the current first analysis video frame is approximately uniform. And then, the reference gray level corresponding to each pixel point in the current first analysis video frame can be obtained.

And secondly, determining at least one target gray level corresponding to the current first analysis video frame based on each reference gray level corresponding to the current first analysis video frame and a preset proportion range.

Wherein the preset proportion range may be a preset percentage. The target gray level may be a reference gray level that meets a preset scale range.

Specifically, a proportional range (e.g., 5%) is preset, each reference gray level corresponding to the current first analysis video frame, and if the current reference gray level corresponding to the current first analysis video frame belongs to the proportional range, the current reference gray level corresponding to the current first analysis video frame is used as the target gray level. The current reference gray level may be a reference gray level corresponding to the pixel point at the current time.

And step three, calculating the sum of the gray levels of all the targets to obtain an image balance value corresponding to the current first analysis video frame.

Wherein the image equalization value may be a frequency value corresponding to each gray level.

Specifically, after each target gray level is determined, the target gray levels may be summed, and then, an image equalization value corresponding to the current first analysis video frame may be obtained.

Optionally, the image equalization value corresponding to the current first parsing video frame is determined according to the following formula:

y₃＝top_per(norm_hist(I_gray))

wherein, y₃Representing the corresponding image equalization value, I, of the current first-resolution video frame_grayGray scale map, norm _ hist (I), representing the current first-resolution video frame_gray) Representing the reference gray level, top, corresponding to each pixel point in the current first-resolution video frame_perRepresenting a preset gray scale range.

Specifically, the reference gray levels corresponding to the current first analysis video frame are arranged from large to small, and the gray level in the preset proportion range at the front of the sequence is obtained as the image balance value. If the image balance value is smaller, the current first analysis video frame is more uniform, and if the image balance value is larger, the current first analysis video frame is more non-uniform.

It should be noted that the preset equalization ratio may be a value set according to a requirement, for example: top is_5％And the like. The larger the sum of the image equalization values in the preset equalization ratio is, the more concentrated the gray level of the current first analysis video frame is, and the more uneven the gray level of the current first analysis video frame is.

And step four, if the image balance value corresponding to the current first analysis video frame reaches the balance degree threshold value of the preset frame, taking the current first analysis as the first target video frame.

Wherein, the threshold value of the degree of balance can be a threshold value set according to the uniform requirement of the gray scale of the first target video frame.

Specifically, an equalization degree threshold is preset. And if the image balance value of the current first analysis video frame reaches the balance degree threshold value, taking the current video frame as a first target video frame.

S240, filtering the second analysis video frame corresponding to each lens in the second video to obtain a second target video frame corresponding to each lens in the second video.

The second target video frame may be a to-be-processed video frame corresponding to each shot in the second video, and may be used to calculate a shot similarity between each shot in the first video and each shot in the second video.

Specifically, the second parsed video frames corresponding to each shot in the second video are filtered. And obtaining a second target video frame corresponding to each shot in the second video.

It should be noted that the method for filtering the second parsed video frame corresponding to each shot in the second video may be the same as the method for filtering the first parsed video frame corresponding to each shot in the first video.

And S250, determining the shot similarity between each shot in the first video and each shot in the second video based on the first target video frame and the second target video frame.

Specifically, according to the first target video frame and the second target video frame, the shot similarity between each shot in the first video and each shot in the second video can be calculated, and further, the shot similarity between each shot in the first video and each shot in the second video is determined.

Optionally, the following steps are described to determine the shot similarity between each shot in the first video and each shot in the second video based on the first target video frame and the second target video frame:

the method comprises the steps of firstly, determining a first shot space-time feature vector corresponding to each shot in a first video based on a first target video frame corresponding to each shot in the first video, and determining a second shot space-time feature vector corresponding to each shot in a second video based on a second target video frame corresponding to each shot in the second video.

Wherein the first shot spatiotemporal feature vector may be a feature vector for each shot in the first video. The second shot spatiotemporal feature vector may be a feature vector for each shot in the second video.

Specifically, based on a video frame feature extraction method, feature extraction operation may be performed on a first target video frame corresponding to each shot in a first video, and feature extraction operation may be performed on a second target video frame corresponding to each shot in a second video. Further, a first shot space-time feature vector corresponding to each shot in the first video can be obtained, and a second shot space-time feature vector corresponding to each shot in the second video can be determined.

Optionally, the following may be introduced to determine a first shot space-time feature vector corresponding to each shot in the first video based on a first target video frame corresponding to each shot in the first video, and determine a second shot space-time feature vector corresponding to each shot in the second video based on a second target video frame corresponding to each shot in the second video:

inputting a first target video frame corresponding to each shot in a first video into a pre-trained feature determination model, and determining a first shot space-time feature vector corresponding to each shot in the first video; and inputting a second target video frame corresponding to each shot in the second video into the feature determination model, and determining a second shot space-time feature vector corresponding to each shot in the second video.

The feature determination model can be used for determining a shot space-time feature vector corresponding to each shot in the video.

Specifically, a first target video frame corresponding to each shot in the first video is input to a pre-trained feature determination model, so that the feature determination model performs feature extraction on each shot in the first video, and further, a first shot space-time feature vector corresponding to each shot in the first video can be obtained. And inputting the second target video frame corresponding to each shot in the second video into the feature determination model so that the feature determination model performs feature extraction on each shot in the second video, and further, a second shot space-time feature vector corresponding to each shot in the second video can be obtained.

Alternatively, the feature determination model may be obtained by:

and inputting the video frames corresponding to the shot as a unit into a pre-constructed deep learning network model to obtain the actual output space-time characteristic vector corresponding to the shot. And taking the standard space-time feature vector corresponding to the shot as an expected output space-time feature vector. And adjusting the initial values of the grid parameters of the deep learning network model according to the actual output space-time characteristic vector and the expected output space-time characteristic vector of the deep learning network model. Further, a feature determination model can be obtained. The pre-constructed deep learning network model is resnet 50. Wherein the second last feature layer is a category layer.

It should be noted that, the video frames in the lens may be extracted based on a preset number of frame intervals (e.g., extracting 1 frame every 5 frames), and the extracted video frames are used as sample data of the training feature determination model. In this example, the number of frame intervals is not limited, such as 1 frame, 2 frames, or 3 frames.

Optionally, the lens space-time feature vector corresponding to the lens is obtained according to the following formula:

wherein, the embed represents the space-time feature vector of the shot corresponding to the shot, CNN (×) is the feature extraction model, and m represents the number of video frames in the shot.

And secondly, determining the shot similarity between each shot in the first video and each shot in the second video based on the first shot space-time feature vector corresponding to each shot in the first video and the second shot space-time feature vector corresponding to each shot in the second video.

Specifically, the shot similarity between each shot in the first video and each shot in the second video is calculated according to a first shot space-time feature vector corresponding to each shot in the first video and a second shot space-time feature vector corresponding to each shot in the second video. Further, shot similarity between each shot in the first video and each shot in the second video is determined.

Optionally, determining a shot similarity between each shot in the first video and each shot in the second video based on a first shot space-time feature vector corresponding to each shot in the first video and a second shot space-time feature vector corresponding to each shot in the second video may be described by:

step one, calculating the cosine distance between a first lens space-time feature vector corresponding to each lens in a first video and a second lens space-time feature vector corresponding to each lens in a second video.

The cosine distance between the first shot space-time feature vector corresponding to each shot in the first video and the second shot space-time feature vector corresponding to each shot in the second video can be calculated according to the following formula:

wherein sim (X, Y) represents the cosine distance between the first shot space-time feature vector corresponding to each shot in the first video and the second shot space-time feature vector corresponding to each shot in the second video,

a first shot spatio-temporal feature vector representing a shot in the first video corresponding to the first shot,

representing a shot in a second videoAnd the corresponding second lens space-time feature vector.

And step two, determining the shot similarity between each shot in the first video and each shot in the second video according to the cosine distance.

Specifically, after the cosine distance between the first lens space-time feature vector corresponding to each lens in the first video and the second lens space-time feature vector corresponding to each lens in the second video is obtained through calculation, the cosine distance obtained through calculation can be used as the lens similarity between each lens in the first video and each lens in the second video.

And S260, determining the video similarity between the first video and the second video based on the shot similarity.

According to the technical scheme of the embodiment, the first analysis video frame corresponding to each shot in the first video is filtered to obtain the first target video frame corresponding to each shot in the first video. And filtering a second analysis video frame corresponding to each lens in the second video to obtain a second target video frame corresponding to each lens in the second video. Based on the first target video frame and the second target video frame, the shot similarity between each shot in the first video and each shot in the second video is determined, and the technical effect of reducing the complexity of calculating the shot similarity is achieved.

EXAMPLE III

Fig. 3 is a schematic block diagram of a video similarity determination apparatus according to a third embodiment of the present invention, where the present invention provides a video similarity determination apparatus, including: a video acquisition module 310, a video parsing module 320, a shot similarity determination module 330, and a video similarity determination module 340.

The video obtaining module 310 is configured to obtain a first video and a second video with similarity to be determined; a video parsing module 320, configured to parse the first video and the second video, determine a first parsed video frame corresponding to each shot in the first video, and determine a second parsed video frame corresponding to each shot in the second video; a shot similarity determination module 330, configured to determine a shot similarity between each shot in the first video and each shot in the second video based on a first parsed video frame corresponding to each shot in the first video and a second parsed video frame corresponding to each shot in the second video; the video similarity determination module 340 is configured to determine the video similarity between the first video and the second video based on the shot similarity.

According to the technical scheme of the embodiment, the first video and the second video with the similarity to be determined are obtained through the video obtaining module. The first video and the second video are analyzed through the video analysis module, a first analysis video frame corresponding to each lens in the first video is determined, and a second analysis video frame corresponding to each lens in the second video is determined. And determining the shot similarity between each shot in the first video and each shot in the second video based on the first analysis video frame corresponding to each shot in the first video and the second analysis video frame corresponding to each shot in the second video through a shot similarity determination module. The shot similarity between each shot in the first video and each shot in the second video is calculated for the videos by taking the shot as a unit, so that the shot similarity between each shot in the first video and each shot in the second video can be determined more quickly. The video similarity determining module determines the video similarity between the first video and the second video based on the shot similarity, so that the technical problem that the video similarity calculating efficiency is low in the method for determining the video similarity in the prior art is solved, the calculating efficiency of the video similarity is improved, and the technical effect of determining the video similarity more quickly is achieved.

Optionally, the video parsing module 320 is configured to parse a first video, parse the first video, and determine a first parsed video frame included in the first video; calculating histogram difference values of two adjacent first analysis video frames, and determining a first analysis video frame corresponding to each lens in the first video based on the histogram difference values.

Optionally, the lens similarity determining module 330 includes: the device comprises a first target video frame obtaining unit, a second target video frame obtaining unit and a lens similarity determining unit. The first target video frame obtaining unit is used for filtering a first analysis video frame corresponding to each lens in the first video to obtain a first target video frame corresponding to each lens in the first video; a second target video frame obtaining unit, configured to filter a second analysis video frame corresponding to each shot in the second video to obtain a second target video frame corresponding to each shot in the second video; a shot similarity determination unit configured to determine a shot similarity between each shot in the first video and each shot in the second video based on the first target video frame and the second target video frame.

Optionally, the lens similarity determining unit includes: and the lens space-time feature vector determining subunit and the lens similarity determining subunit. The device comprises a shot space-time feature vector determining subunit, a first video processing subunit and a second video processing subunit, wherein the shot space-time feature vector determining subunit is used for determining a first shot space-time feature vector corresponding to each shot in the first video based on a first target video frame corresponding to each shot in the first video, and determining a second shot space-time feature vector corresponding to each shot in the second video based on a second target video frame corresponding to each shot in the second video; and the shot similarity determining subunit is used for determining the shot similarity between each shot in the first video and each shot in the second video based on a first shot space-time feature vector corresponding to each shot in the first video and a second shot space-time feature vector corresponding to each shot in the second video.

Optionally, the shot space-time feature vector determining subunit is configured to input a first target video frame corresponding to each shot in the first video to a pre-trained feature determination model, and determine a first shot space-time feature vector corresponding to each shot in the first video; and inputting a second target video frame corresponding to each shot in the second video into the feature determination model, and determining a second shot space-time feature vector corresponding to each shot in the second video.

Optionally, the shot similarity determining subunit is configured to calculate a cosine distance between a first shot space-time feature vector corresponding to each shot in the first video and a second shot space-time feature vector corresponding to each shot in the second video; and determining the shot similarity between each shot in the first video and each shot in the second video according to the cosine distance.

Optionally, the first target video frame obtaining unit is configured to calculate, for each first analysis video frame corresponding to each shot in the first video, a light and dark value of a current first analysis video frame based on preset and defined weight values of three color channels; and if the brightness value of the current first analysis video frame reaches the brightness threshold value of a preset frame, taking the current video frame as the first target video frame.

Optionally, the first target video frame obtaining unit is configured to perform graying processing on a current first analysis video frame for each first analysis video frame corresponding to each lens in the first video, and calculate a blur value of the grayed current first analysis video frame based on a preset horizontal gradient value and a preset vertical gradient value; and if the fuzzy value of the current first analysis video frame reaches the fuzzy threshold value of a preset frame, taking the current first analysis video frame as the first target video frame.

Optionally, the first target video frame obtaining unit is configured to perform histogram equalization on a current first analysis video frame for each first analysis video frame corresponding to each lens in the first video, and determine a reference gray level corresponding to each pixel point in the current first analysis video frame; determining at least one target gray level corresponding to the current first analysis video frame based on each reference gray level corresponding to the current first analysis video frame and a preset proportion range; calculating the sum of the target gray levels to obtain an image balance value corresponding to the current first analysis video frame; and if the image balance value corresponding to the current first analysis video frame reaches the balance degree threshold value of a preset frame, taking the current first analysis as the first target video frame.

The device can execute the video similarity determining method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the video similarity determining method.

It should be noted that, the units and modules included in the video similarity determination apparatus are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiment of the invention.

Example four

Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. FIG. 4 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing any of the embodiments of the present invention. The electronic device 12 shown in fig. 4 is only an example and should not bring any limitation to the function and the scope of use of the embodiment of the present invention. The device 12 is typically an electronic device that undertakes the processing of configuration information.

As shown in FIG. 4, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a memory 28, and a bus 18 that couples the various components (including the memory 28 and the processing unit 16).

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

Electronic device 12 typically includes a variety of computer-readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer device readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), Digital Video disk (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product 40, with program product 40 having a set of program modules 42 configured to carry out the functions of embodiments of the invention. Program product 40 may be stored, for example, in memory 28, and such program modules 42 include, but are not limited to, one or more application programs, other program modules, and program data, each of which examples or some combination may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, mouse, camera, etc., and display), one or more devices that enable a user to interact with electronic device 12, and/or any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network such as the internet) via the Network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) devices, tape drives, and data backup storage devices, to name a few.

The processor 16 executes various functional applications and data processing by running a program stored in the memory 28, for example, implementing the video similarity determination method provided by the above-described embodiment of the present invention, the method including:

acquiring a first video and a second video with similarity to be determined; analyzing the first video and the second video, determining a first analyzed video frame corresponding to each shot in the first video, and determining a second analyzed video frame corresponding to each shot in the second video; determining shot similarity between each shot in the first video and each shot in the second video based on a first parsed video frame corresponding to each shot in the first video and a second parsed video frame corresponding to each shot in the second video; and determining the video similarity of the first video and the second video based on the shot similarity.

Of course, those skilled in the art will appreciate that the processor may also implement the video similarity determination method provided in any embodiment of the present invention.

EXAMPLE five

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor, and is characterized in that, for example, the video similarity determining method provided in the foregoing embodiment of the present invention includes:

acquiring a first video and a second video with similarity to be determined;

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A video similarity determination method is characterized by comprising the following steps:

acquiring a first video and a second video with similarity to be determined;

2. The method of claim 1, wherein parsing the first video and determining a first parsed video frame corresponding to each shot in the first video comprises:

analyzing a first video, and determining a first analyzed video frame included in the first video;

calculating histogram difference values of two adjacent first analysis video frames, and determining a first analysis video frame corresponding to each lens in the first video based on the histogram difference values.

3. The method of claim 1, wherein determining a shot similarity between each shot in the first video and each shot in the second video based on a first parsed video frame corresponding to each shot in the first video and a second parsed video frame corresponding to each shot in the second video comprises:

filtering a first analysis video frame corresponding to each lens in the first video to obtain a first target video frame corresponding to each lens in the first video;

filtering a second analysis video frame corresponding to each lens in the second video to obtain a second target video frame corresponding to each lens in the second video;

determining shot similarity between each shot in the first video and each shot in the second video based on the first target video frame and the second target video frame.

4. The method of claim 3, wherein the determining a shot similarity between each shot in the first video and each shot in the second video based on the first target video frame and the second target video frame comprises:

determining a first shot space-time feature vector corresponding to each shot in the first video based on a first target video frame corresponding to each shot in the first video, and determining a second shot space-time feature vector corresponding to each shot in the second video based on a second target video frame corresponding to each shot in the second video;

determining shot similarity between each shot in the first video and each shot in the second video based on a first shot spatio-temporal feature vector corresponding to each shot in the first video and a second shot spatio-temporal feature vector corresponding to each shot in the second video.

5. The method of claim 4, wherein determining a first shot spatio-temporal feature vector for each shot in the first video based on a first target video frame for each shot in the first video, and determining a second shot spatio-temporal feature vector for each shot in the second video based on a second target video frame for each shot in the second video comprises:

inputting a first target video frame corresponding to each shot in the first video to a pre-trained feature determination model, and determining a first shot space-time feature vector corresponding to each shot in the first video;

and inputting a second target video frame corresponding to each shot in the second video into the feature determination model, and determining a second shot space-time feature vector corresponding to each shot in the second video.

6. The method of claim 4, wherein determining a shot similarity between each shot in the first video and each shot in the second video based on a first shot spatio-temporal feature vector corresponding to each shot in the first video and a second shot spatio-temporal feature vector corresponding to each shot in the second video comprises:

calculating cosine distance between a first lens space-time feature vector corresponding to each lens in the first video and a second lens space-time feature vector corresponding to each lens in the second video;

and determining the shot similarity between each shot in the first video and each shot in the second video according to the cosine distance.

7. The method according to claim 3, wherein the filtering the first parsed video frame corresponding to each shot in the first video to obtain the first target video frame corresponding to each shot in the first video comprises:

calculating the brightness value and the darkness value of the current first analysis video frame based on the weight values of three preset and defined color channels aiming at each first analysis video frame corresponding to each lens in the first video;

and if the brightness value of the current first analysis video frame reaches the brightness threshold value of a preset frame, taking the current video frame as the first target video frame.

8. The method according to claim 3, wherein the filtering the first parsed video frame corresponding to each shot in the first video to obtain the first target video frame corresponding to each shot in the first video, further comprises:

aiming at each first analysis video frame corresponding to each lens in the first video, carrying out graying processing on the current first analysis video frame, and calculating a fuzzy value of the grayed current first analysis video frame based on a preset transverse gradient value and a preset longitudinal gradient value;

and if the fuzzy value of the current first analysis video frame reaches the fuzzy threshold value of a preset frame, taking the current first analysis video frame as the first target video frame.

9. The method according to claim 3, wherein the filtering the first parsed video frame corresponding to each shot in the first video to obtain the first target video frame corresponding to each shot in the first video, further comprises:

performing histogram equalization processing on a current first analysis video frame aiming at each first analysis video frame corresponding to each lens in the first video, and determining a reference gray level corresponding to each pixel point in the current first analysis video frame;

determining at least one target gray level corresponding to the current first analysis video frame based on each reference gray level corresponding to the current first analysis video frame and a preset proportion range;

calculating the sum of the target gray levels to obtain an image balance value corresponding to the current first analysis video frame;

and if the image balance value corresponding to the current first analysis video frame reaches the balance degree threshold value of a preset frame, taking the current first analysis as the first target video frame.

10. A video similarity determination apparatus, comprising: