CN117112831A

CN117112831A - Massive video comparison method, device and equipment based on video event information network

Info

Publication number: CN117112831A
Application number: CN202310692997.0A
Authority: CN
Inventors: 汪昭辰; 刘世章; 王全宁
Original assignee: Qingdao Chenyuan Technology Information Co ltd
Current assignee: Qingdao Chenyuan Technology Information Co ltd
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-11-24

Abstract

The invention provides a massive video comparison method, device and equipment based on a video event information network, and relates to the field of video processing. The method and the device can be used for carrying out quick similarity comparison on the appointed video event in the resource video event set of the massive video resources, find out the resource video event set similar to the appointed video event, and can improve the accuracy and efficiency of video event similarity comparison.

Description

Massive video comparison method, device and equipment based on video event information network

Technical Field

The present invention relates to the field of video processing, and in particular, to a method, apparatus, and device for comparing massive videos based on a video event information network.

Background

With the development of self-media, the video infringement mode is increased, for example, copyrighted images or videos are directly moved; cutting long video into a plurality of short video transmission; deleting the head and the tail of the original work piece, directly cutting or assembling the core picture into a new video for transmission; performing secondary creation on the original video; mosaic, scaling the picture, changing the aspect ratio, changing the resolution of the image, etc. These phenomena seriously impair the legal rights of copyrighted parties and prevent the development of cultural utilities.

In the prior art, the comparison of similar videos mostly adopts a digital image watermarking technology or a machine learning and digital image watermarking technology combination mode, and copyright information is extracted through a watermark extraction algorithm to be used as main evidence of digital image attribution. However, the technology is easy to be attacked by representation, robustness and interpretation, so that partial or even all watermark information is lost in the digital image watermark, and difficulty is brought to watermark information extraction, thereby affecting copyright protection. The method of combining machine learning and digital watermarking is dependent on training on a sample library, has high cost and huge energy consumption, and is difficult to meet the requirements of complex and changeable video contents to be detected.

Therefore, providing a new video content comparison method is a problem to be solved.

Disclosure of Invention

In view of the above, the invention aims to provide a massive video comparison method, device and equipment based on a video event information network, which can pointedly solve the problems of low comparison speed and low accuracy in the existing massive videos.

Based on the above object, in a first aspect, the present invention provides a massive video comparison method based on a video event information network, where a video event refers to a set of all content frames in a shot, and a content frame refers to a frame representing the content of the shot, and includes a first frame, a last frame, and N intermediate frames, where N is a natural number, and the intermediate frames are obtained when the difference rate is greater than a preset threshold by performing difference rate calculation on all subframes of a shot except for the first frame and the last frame and the previous content frame; the video event information network is a forest structure constructed based on a multi-level tree set based on a video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, the video event feature vectors are obtained by calculating after extracting feature matrixes from a content frame set under the same coordinate system, and the method comprises the following steps: acquiring a target video, preprocessing and granulating the target video to obtain a video event sequence corresponding to the target video, wherein the video event sequence comprises at least one target video event; traversing root nodes in the video event information network according to the content frame number and the video event feature vector of the target video event, judging whether the currently traversed root node is an alternative root node, and if so, calculating the similarity ratio of the target video event and the alternative video event corresponding to the alternative root node; judging whether the target video event is similar to the alternative video event according to the similarity ratio, if so, adding the alternative video event into a similar video event set; and after the traversing root node is finished, outputting the similar video event set, wherein the similar video event set comprises all video events similar to the target video event in a video event information network.

Optionally, before acquiring the target video, the method includes: acquiring an original video in a video resource library; preprocessing and granulating the original video to obtain a video event sequence of the original video; taking the video event of the original video as a root node or a child node of the video event information network to construct the video event information network; wherein, the pretreatment comprises: normalizing the original video to obtain a normalized video; de-framing the normalized video to obtain a normalized video frame sequence; wherein, the granulating treatment comprises: performing shot segmentation and content frame extraction on the normalized video frame sequence to obtain a shot sequence and a content frame sequence; and obtaining a video event sequence corresponding to the original video according to the shot sequence and the content frame sequence.

Optionally, traversing the root node in the video event information network according to the content frame number and the video event feature vector of the target video event, and determining whether the currently traversed root node is an alternative root node, including: the absolute value of the difference value of the content frame number of the target video event and the content frame number of the video event corresponding to the root node traversed currently is smaller than or equal to a first preset threshold value to be used as a first judgment condition; calculating the feature vector difference rate of the video event corresponding to the root node traversed currently according to the content frame feature vector of the target video event and the content frame feature vector of the video event corresponding to the root node traversed currently; taking the difference rate of the feature vector of the video event corresponding to the target video event and the root node traversed currently as a second judgment condition, wherein the difference rate of the feature vector of the video event is smaller than or equal to a second preset threshold value; and when the currently traversed root node meets the first judging condition and the second judging condition simultaneously, determining the currently traversed root node as an alternative root node.

Optionally, in the case that the currently traversed root node is determined to be an alternative root node, the method further comprises: comparing the video content of the target video event with that of the alternative video event, judging whether any content frame of the target video event has a corresponding matching frame in the alternative video event, and whether the difference rate of each content frame of the target video event and the content frame of the matching frame is smaller than or equal to a third preset threshold value, if so, calculating the similarity rate of the target video event and the alternative video event; if not, determining that the target video event and the alternative video event are dissimilar.

Optionally, determining whether any content frame of the target video event has a corresponding matching frame in the candidate video event includes: calculating the content frame difference rate between the first content frame sequence of the target video event and each content frame in the alternative video event, and judging whether the content frame difference rate between the first content frame sequence of the target video event and the first content frame of the target video event is smaller than or equal to a third preset threshold value; if yes, taking a content frame with the content frame difference rate smaller than or equal to a third preset threshold value from the first content frame of the target video event in the candidate video event as a first matching frame; sequentially acquiring content frames with the same number as that of the content frames of the target video event in the alternative video event by taking the first matching frame as a starting frame, and sequentially taking the content frames as the matching frames of the content frames of the target video event; if the number of the initial frames and the following content frames in the alternative video event is smaller than that of the target video event, determining that the target video event has the content frames without the matched frames; if the number of the initial frames and the following content frames in the alternative video event is greater than or equal to the number of the content frames of the target video event, determining that any content frame of the target video event has a corresponding matching frame in the alternative video event.

Optionally, the feature data further includes a content frame feature matrix of a video event, and determining, according to the similarity ratio, whether the target video event is similar to the candidate video event includes: judging whether the similarity ratio of the target video event and the alternative video event is larger than or equal to a fourth preset threshold value, if so, determining that the target video event is similar to the alternative video event, and if not, determining that the target video event is dissimilar to the alternative video event; the similarity of the target video event and the alternative video event is obtained according to a content frame feature matrix of the target video event and a content frame feature matrix of the alternative video event, and a similarity calculation formula of the target video event and the alternative video event is as follows:

where q represents a target video event, fcnt _q For the number of content frames of the target video event, p represents the alternative video event, fcnt _p For the number of content frames of the candidate video event, simEV (p, q) represents the similarity of the target video event q and the candidate video event p, i is the content frame number of the target video event, dis (i) is the first of the target video event qi the difference rate of the content frame and the corresponding matching frame, wherein the matching frame belongs to the alternative video event.

Optionally, after traversing the root node, the method further comprises: and obtaining all sub-nodes associated with a root node similar to the target video event, serving as similar sub-nodes, calculating the similarity rate of the video event corresponding to the target video event and the similar sub-nodes, and adding the video event corresponding to the similar sub-nodes and the similarity rate to the similar video event set.

Optionally, the method further comprises: and under the condition that the total number of the similar video events in the similar event set is larger than 1, sorting the video events in the similar event set according to the similarity rate of the target video event and each similar video event.

Optionally, the method further comprises: for any target video event, acquiring the position information of the video event in the similar video event set in the original video; and outputting the position information of the target video and the video event similar to the target video event in the original video.

In a second aspect, there is provided a massive video contrast device based on a video event information network, the device comprising: the video processing module is used for acquiring a target video, preprocessing and granulating the target video to obtain a video event sequence corresponding to the target video, wherein the video event sequence comprises at least one target video event; the searching module is used for traversing root nodes in the video event information network according to the content frame number and the video event feature vector of the target video event, judging whether the currently traversed root node is an alternative root node or not, and if yes, calculating the similarity of the target video event and the alternative video event corresponding to the alternative root node; the comparison module is used for judging whether the target video event is similar to the alternative video event according to the similarity ratio, and if so, adding the alternative video event into a similar video event set; the result output module is used for outputting the similar video event set, wherein the similar video event set comprises all video events similar to the target video event in a video event information network; the video event is a set of all content frames in a shot, the content frames are frames representing shot content, the frames comprise a first frame, a last frame and N intermediate frames, N is a natural number, the intermediate frames are obtained when the difference rate is larger than a preset threshold value through difference rate calculation of all subframes of the shot except the first frame and the last frame and the previous content frame; the video event information network is a forest structure constructed based on a multi-level tree set based on a video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, and the video event feature vectors are obtained by calculating after extracting feature matrixes from a content frame set under the same coordinate system.

In a third aspect, there is also provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor runs the computer program to implement the method of the first aspect.

In a fourth aspect, there is also provided a computer readable storage medium having stored thereon a computer program for execution by a processor to perform the method of any of the first aspects.

In general, the present invention has at least the following benefits:

according to the massive video comparison method based on the video event information network, a target video event sequence corresponding to a target video is obtained through preprocessing and granulating the target video, a root node in the video event information network is traversed according to the number of content frames and the video event feature vector of the target video event, whether the currently traversed root node is an alternative root node is judged, and if yes, the similarity ratio of the target video event to the alternative video event corresponding to the alternative root node is calculated; judging whether the target video event is similar to the alternative video event according to the similarity, if so, adding the alternative video event into a similar video event set; and after the traversing root node is finished, outputting a similar video event set. The method and the device can be used for carrying out quick similarity comparison on the appointed video event in the resource video event set of the massive video resources, find out the resource video event set similar to the appointed video event, and can improve the accuracy and efficiency of video event similarity comparison.

Drawings

In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the disclosure and are not therefore to be considered limiting of its scope. The exemplary embodiments of the present invention and the descriptions thereof are for explaining the present invention and do not constitute an undue limitation of the present invention. In the drawings:

FIG. 1 is a schematic diagram of an application environment of a massive video comparison method based on a video event information network according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating an application environment of a massive video comparison method based on a video event information network according to another embodiment of the present invention;

FIG. 3 shows a schematic view of a granulating structure provided by an embodiment of the invention;

FIG. 4 is a schematic diagram of content frame extraction according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a video event information space according to an embodiment of the present invention;

FIG. 6 shows a tree structure creation process provided by an embodiment of the present invention;

FIG. 7 is a flowchart showing steps of a method for comparing massive videos based on a video event information network according to an embodiment of the present invention;

Fig. 8 is a schematic structural diagram of a massive video comparing device based on a video event information network according to an embodiment of the present invention;

fig. 9 shows a schematic diagram of an electronic device of an embodiment of the invention in one example.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In one aspect of the embodiment of the present invention, a method for comparing massive videos based on a video event information network is provided, as an optional implementation manner, where the method for comparing massive videos based on a video event information network may be applied, but is not limited to, in an application environment as shown in fig. 1. The application environment comprises the following steps: a terminal device 102, a network 104 and a server 106 which interact with a user in a man-machine manner. Human-machine interaction can be performed between the user 108 and the terminal device 102, and a video content comparison application program based on a video event information network runs in the terminal device 102. The terminal device 102 includes a man-machine interaction screen 1022, a first processor 1024 and a first memory 1026. The man-machine interaction screen 1022 is used for displaying images; the first processor 1024 is configured to acquire a target video and perform a massive video comparison method based on a video event information network. The first memory 1026 is used to store video.

In addition, the server 106 includes a database 1062 and a processing engine 1064, and the database 1062 is used to store images. The processing engine 1064 is configured to perform a massive video comparison method based on a video event information network.

In one or more embodiments, the massive video comparison method based on the video event information network of the present invention can be applied to the application environment shown in fig. 2. As shown in fig. 2, human-machine interaction may be performed between user 108 and user device 204. The user device 204 includes a second memory 206 and a second processor 208. The user equipment 204 in this embodiment may, but is not limited to, perform the massive video comparison method based on the video event information network with reference to performing the operations performed by the terminal equipment 102.

Optionally, the terminal device 102 and the user device 204 include, but are not limited to, a mobile phone, a tablet computer, a notebook computer, a PC, a vehicle-mounted electronic device, a wearable device, and the like, and the network 104 may include, but is not limited to, a wireless network or a wired network. Wherein the wireless network comprises: WIFI and other networks that enable wireless communications. The wired network may include, but is not limited to: wide area network, metropolitan area network, local area network. The server 106 may include, but is not limited to, any hardware device that may perform calculations. The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is merely an example, and is not limited in any way in the present embodiment.

In the related art, the comparison of similar videos mostly adopts a digital image watermarking technology or a machine learning and digital image watermarking technology combination mode, and copyright information is extracted through a watermark extraction algorithm to be used as main evidence of digital image attribution. However, the technology is easy to be attacked by representation, robustness and interpretation, so that partial or even all watermark information is lost in the digital image watermark, and difficulty is brought to watermark information extraction, thereby affecting copyright protection. The method of combining machine learning and digital watermarking is dependent on training on a sample library, has high cost and huge energy consumption, and is difficult to meet the requirements of complex and changeable video contents to be detected.

In order to solve the above technical problems, as an optional implementation manner, the embodiment of the invention provides a method, a device and equipment for comparing mass videos based on a video event information network.

In this embodiment, the video event refers to a set of all content frames in a shot, the content frames refer to frames representing the shot content, including a first frame, a last frame and N intermediate frames, where N is a natural number, and the intermediate frames are obtained when the difference rate is greater than a preset threshold by performing difference rate calculation on all subframes of a shot except for the first frame and the last frame and the previous content frame; the video event information network is a forest structure based on a video event information space based on a multi-level tree set, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, and the video event feature vectors are obtained by calculating after feature matrices are extracted from a content frame set under the same coordinate system.

The content frame, the video event information space, and the video event information network of the present embodiment are described below.

In this embodiment, the content frames can be obtained by performing granulation processing on the video, fig. 3 shows a schematic diagram of a granulation structure, and referring to fig. 3, the granulation structure of one video includes a video, a frame sequence, a lens and a content frame, the frame sequence is all frames representing the video content, the lens is a continuous frame segment captured by the camera between one start-up and stop, the continuous frame segment is a basic unit of video composition, and the content frame is a frame representing the lens content.

In this embodiment, the granulating process refers to performing shot segmentation on a video to obtain a granulating structure of the video, where the principle of obtaining the granulating structure is as follows: the video content is composed of continuous frame sequences, the continuous frame sequences can be divided into a plurality of groups according to the continuity of the video content, shot detection is carried out according to the frame sequences, each group of continuous frame sequences is a shot, the shot sequences comprise at least one shot, a small number of frames are selected from the continuous frame sequences to represent the shot content by analyzing the difference of the content in the video shots, the frames are content frames, namely, the video frame sequences of each shot in the shot sequences are subjected to content frame extraction to obtain the content frame sequences of each shot, and then the video event sequences are obtained according to the shot sequences and the content frame sequences. Wherein, the content frames at least comprise the first and last two frames (shot frames) of the shot, so the content frame number of one shot is more than or equal to 2.

In this embodiment, the video event sequence includes at least one video event, where the video event refers to a set of all content frames in a shot, fig. 4 is a schematic diagram of a content frame extraction according to an embodiment of the present invention, and as shown in fig. 4, the first frame is the first content frame, and then the 2 nd and 3 rd frames are calculated. And then calculating the difference rates of the 5 th, 6 th and 4 th frames until the difference rate is larger than a preset threshold, and if the difference rates of the 5 th, 6 th and 7 th frames and the first frame are smaller than the preset threshold and the 8 th frame is larger than the preset threshold, the 8 th frame is the third content frame. And by analogy, calculating the content frames in all subframes between all the first frames and all the tail frames. The end frame is selected directly as the last content frame without having to calculate the rate of difference with its previous content frame. The difference rate is the calculated difference rate between two frames of images.

For example, a surveillance video, with few people and few cars during the night, the video frame changes little, and the content frames will be few, for example, only a single number of content frames are extracted within 10 hours. The number of people and vehicles in the daytime is large, the change of people and objects in the video picture is frequent, and the content frames calculated according to the method are much more than those in the evening. Thus, the content frames are guaranteed not to lose all of the content information of the shot video relative to the key frames, as the key frames may lose part of the shot content. Compared with the scheme that each frame of the video is calculated and considered, the selection of the content frames only selects partial video image frames, so that the image calculation amount is greatly reduced on the premise of not losing the content.

Fig. 5 shows a schematic structural diagram of a video event information space in which each video event has its coordinates, from which distances between video events can be calculated, the same video events having the same coordinates, similar video events having small distances, and different video events having large distances. By calculating the distance between video events, the video event information space can be divided into a plurality of areas, the video event content at the center of each area represents the main content of the whole area, and the relationship of each circular area in the video event information space comprises three relationships of separation, tangency and intersection, wherein the separation is that no common area exists between the areas, the tangency is that there is one common point between the areas, the point is a tangent point, and the intersection is that there is a common area between the areas.

As shown in fig. 5, four points A, B, C, D in fig. 5 are central positions of respective circular areas, a radius of a circle represents a maximum distance from a center of a circle in the video event information space, and video event content of A, B, C, D represents main content of each circular area. c1, C2 are video events similar to the content of video event C, B1, B2 are video events similar to the content of video event B, D1, D2, D3 are video events similar to the content of video event D, and the distances between C1, C2, B1, B2, D1, D2, D3 and the circle centers of the respective areas are not greater than the radius.

Based on the video event information space shown in fig. 5, the whole video event information space can be zoned by selecting a center point and a designated radius to divide the region, and then a tree structure can be established according to the zoning characteristic to record the relationship among the regions, so as to form a multi-level tree set. The above embodiment can know that the video event information space is a multidimensional vector space in which the video event feature vector is located, and has a regional characteristic, so that a forest structure can be constructed based on a multi-level tree set to form a video event information network, that is, the video event information network is a forest structure constructed based on the multi-level tree set based on the video event information space.

Fig. 6 illustrates a tree structure creation process, where the tree structure may be divided into two stages according to the relationship between the regions in the video event information space, the first stage being the root node corresponding to the center of each spatial region, and the second stage being the child node corresponding to the non-center point in each spatial region. If the space region is subdivided into multiple sub-regions, the tree structure will also generate corresponding multi-level sub-nodes, where the number of levels of the tree structure corresponds to the number of levels of the space region in the information space, and in this embodiment, a 2-level tree structure is described as an example.

As shown in fig. 6, a plurality of multi-level tree structures can be obtained according to the video event information space, the multi-level tree includes a root node and sub-nodes, a forest structure constructed based on a multi-level tree set formed by the plurality of multi-level tree structures is a video event information network, each sub-node in the video event information network at least belongs to 1 root node, and no sub-node exists under the root node.

Based on the video event information space and the video event information network, the video event information network of the embodiment is constructed based on video events, and the association relationship between the tree structures of the video event information network can be known, because the video events of the child nodes are similar to the video events of the root nodes, and the video events corresponding to different root nodes are dissimilar, the video events in the video event information network have the association relationship based on the similarity of the video events, so that the comparison of the appointed video content can be performed based on the video event information network, when the similarity comparison of the appointed video is performed by using the video event information network, the similarity comparison can be performed between the appointed video and the root nodes of the video event information network, and because the child nodes of the root nodes and the root nodes have the similarity, all the video events similar to the appointed video in the video event information network can be found, and further, the video set similar to the appointed video is obtained.

Fig. 7 shows a flowchart of the steps of a method for massive video comparison based on a video event information network according to an embodiment of the invention. As shown in fig. 7, the massive video comparison method based on the video event information network includes the following steps S701 to S704:

s701, acquiring a target video, and preprocessing and granulating the target video to obtain a video event sequence corresponding to the target video.

In this embodiment, the sequence of video events of the target video includes at least one target video event.

In one example, before acquiring the target video, the method of the present embodiment includes: acquiring an original video in a video resource library; preprocessing and granulating an original video to obtain a video event sequence of the original video; and taking the video event of the original video as a root node or a child node of the video event information network to construct the video event information network.

The video resource library may be one or more, and a huge amount of videos, namely, original videos, are stored in the video resource library, and it can be understood that the image resolution, color space and the like of the original videos are inconsistent, so that preprocessing and granulating are performed on the original videos to obtain a video event sequence of the original videos, and video events of the original videos are taken as root nodes or child nodes of a video event information network to construct the video event information network, and the construction of the specific video event information network is described above and is not repeated herein.

Wherein, the pretreatment comprises: normalizing the original video to obtain a normalized video; and granulating the normalized video to obtain a shot sequence and a content frame sequence. A granulation process comprising: and performing shot segmentation and content frame extraction on the normalized video frame sequence to obtain a shot sequence and a content frame sequence, and obtaining a video event sequence corresponding to the original video according to the shot sequence and the content frame sequence. The granulation process is specifically as shown in fig. 3.

After the video event information network is constructed, the target video can be compared with the video events in the video event information network, and it can be understood that the video format, the image resolution, the color space and the like of the target video may be inconsistent.

S702, traversing root nodes in a video event information network according to the number of content frames of the target video event and the feature vector of the video event, judging whether the currently traversed root node is an alternative root node, and if so, calculating the similarity ratio of the target video event and the alternative video event corresponding to the alternative root node.

It will be appreciated that, when the video is granulated as shown in fig. 3, the number of content frames included in one video event is obtained, so in this embodiment, the feature data of the target video event includes the number of content frames of the video event, and similarly, the feature data of the video event in the video event information network includes the number of content frames.

In this embodiment, traversing root nodes in a video event information network according to the number of content frames of a target video event and a video event feature vector, and determining whether the currently traversed root node is an alternative root node includes: the absolute value of the difference value of the content frame number of the target video event and the content frame number of the video event corresponding to the root node traversed currently is smaller than or equal to a first preset threshold value to be used as a first judgment condition; calculating the feature vector difference rate of the video event corresponding to the root node traversed by the target video event according to the content frame feature vector of the target video event and the content frame feature vector of the video event corresponding to the root node traversed by the current time; taking the difference rate of the feature vector of the video event corresponding to the target video event and the currently traversed root node as a second judgment condition, wherein the difference rate of the feature vector of the video event is smaller than or equal to a second preset threshold value; and when the currently traversed root node simultaneously meets the first judging condition and the second judging condition, determining the currently traversed root node as an alternative root node.

It will be appreciated that if two shots are similar, the number of corresponding content frames should be similar for the event occurring in each shot, so the present embodiment performs the primary screening on the root node by the difference in the number of content frames of the two video events. For example, fcnt _p Representing the number of content frames fcnt of a video event corresponding to the root node of the currently traversed video event information network _q Content frame number, diff, representing a target video event _max For a first preset threshold value, then |fcnt _p -fcnt _q |≤diff _max As a first judgment condition.

However, for a video event information network composed of massive original videos, the number of video events included in the video event information network is huge, and the number of video events close to the number of content frames of a target video event is also great, so that the root node is further filtered through the video event feature vector to obtain an alternative root node.

In this embodiment, the feature vector of the video event may be obtained according to the feature vector of each content frame of the video event, and in particular, the feature vector of the content frame may be FV _k (i) Representation, FV _k (i) Vector values in the k dimension for the ith content frame of the video event. Video frequencyThe event feature vector is denoted by EV, fcnt is the number of content frames in the video event, and equation 1 for the video event feature vector EV is as follows:

EV＝(v ₁ ，v ₂ ，...v _k ，v ₃₄₈₁ ) (equation 1)

Wherein v is _k Values representing the k dimension in vector EV, v _k The calculation formula 2 of (2) is as follows:

then, the feature vector difference ratio DisEV (p, q) of the video event corresponding to the target video event and the currently traversed root node calculates the formula:

wherein DiffEV (p, q) represents a feature vector difference value of a target video event and a video event corresponding to a root node currently traversed, modEV (p) represents a modulus of a feature vector of a video event corresponding to a root node currently traversed, modEV (q) represents a modulus of a feature vector of a target video event, min (modEV (p), modEV (q)) represents taking the minimum value of modEV (p) and modEV (q), and modEV (p) is not 0 as a denominator; when the modevs (p) and the modevs (q) are all 0, diev (p, q) =0.

The modulo modEV calculation formula of the event feature vector is as follows:

it will be appreciated that the smaller the feature vector difference rate of two video events, the more similar the two video events are characterized, assuming DisEV _max If the threshold value is the second preset threshold value, disEV (p, q) is less than or equal to DisEV _max As a second judgment condition.

After the root node is screened according to the first judgment condition and the second judgment condition, the video event dissimilar to the target video event in the video event information network can be eliminated, and when the currently traversed root node simultaneously meets the first judgment condition and the second judgment condition, the currently traversed root node is determined to be an alternative root node.

After the candidate root node is obtained, the candidate video event corresponding to the candidate root node may be similar to the target video event or dissimilar to the target video event, so that the similarity ratio of the target video event and the candidate video event corresponding to the candidate root node needs to be further calculated to judge whether the target video event is similar to the candidate video event.

It should be noted that, in order to reduce the amount of calculation and avoid performing similarity calculation with the target video event for each candidate video event, the embodiment further includes, when determining that the currently traversed root node is the candidate root node: and comparing the video content of the target video event with that of the alternative video event, judging whether any content frame of the target video event has a corresponding matching frame in the alternative video event, and judging whether the difference rate of each content frame of the target video event and the content frame of the corresponding matching frame is smaller than or equal to a third preset threshold value, if so, calculating the similarity rate of the target video event and the alternative video event, and if not, determining that the target video event and the alternative video event are dissimilar.

In this embodiment, determining whether any content frame of the target video event has a corresponding matching frame in the candidate video event includes: calculating the content frame difference rate of the first content frame sequence of the target video event and each content frame in the alternative video event, and judging whether the content frame difference rate of the first content frame of the target video event and the alternative video event is smaller than or equal to a third preset threshold value; if yes, taking a content frame with the content frame difference rate smaller than or equal to a third preset threshold value from the first content frame of the target video event in the candidate video event as a first matching frame; sequentially acquiring content frames with the same number as that of the content frames of the target video event in the alternative video event by taking the first matching frame as a starting frame, and sequentially taking the first matching frame as the matching frame of the content frames of the target video event; if the number of the initial frames and the following content frames in the alternative video event is smaller than that of the target video event, determining that the target video event has the content frames without the matching frames; if the number of the initial frames and the following content frames in the alternative video event is greater than or equal to the number of the content frames of the target video event, determining that any content frame of the target video event has a corresponding matching frame in the alternative video event.

For example, if the number of content frames of the target video event is 5 and the number of content frames of the candidate video event is 7, calculating the difference between the first content frame of the target video event and the first content frame, the second content frame … …, the sixth content frame and the seventh content frame of the candidate video event, and if the difference rate between the first content frame of the target video event and the second content frame of the candidate video event is smaller than the third preset threshold, using the second content frame of the candidate video event as a first matching frame, using the first matching frame as a starting frame, and sequentially using the third content frame of the candidate video event, the fourth content frame of the candidate video event, the fifth content frame of the candidate video event, the sixth content frame of the candidate video event and the seventh content frame of the candidate video event as matching frames of the second content frame of the target video event, the third content frame of the target video event, the fourth content frame of the target video event and the fifth content frame of the target video event, so that each matching frame of the target video event can be found in the corresponding matching frames of the candidate video event.

However, if the number of start frames and subsequent content frames in the candidate video event is less than the number of content frames of the target video event, then it is determined that there is a content frame of the target video event that does not have a matching frame, e.g., the number of content frames of the target video event is 5, the number of content frames of the candidate video event is 7, the first matching frame is the fourth content frame of the candidate video event, then the number of content frames of the fourth and subsequent content frames is 4, and the number of content frames of the target video event is 5, the fifth content frame of the target video event will not have a matching frame, then it is determined that the target video event has a content frame that does not have a matching frame. If the first matching frame is the first content frame of the alternative video event, the number of the fourth content frame and the following content frames is 7, and it is determined that any content frame of the target video event has a corresponding matching frame in the alternative video event.

After determining that any content frame of the target video event has a corresponding matching frame in the candidate video event, judging whether the content frame difference rate of each content frame of the target video event and the corresponding matching frame is smaller than or equal to a third preset threshold value so as to detect whether each content frame is identical, and improving the contrast precision.

When the difference rate of any content frame of the target video event and the content frame of the corresponding matching frame is larger than a third preset threshold, the difference rate of the content frame and the matching frame is too large, and the content frame and the matching frame are dissimilar, and the partial contents of the two video events are different, and when the difference rate of any content frame of the target video event and the content frame of the corresponding matching frame is larger than the third preset threshold, the target video event and the alternative video event are dissimilar.

The content frame difference rate Dis (i) of the target video event and the candidate video event has a calculation formula as follows:

wherein f _pi J e [ 1..fcnt for j content frames of alternative video event p _p ]，f _qi I content frame for target video event q, i e [ 1..fcnt _q ]，dis(f _pj ，f _qi ) For the original difference rate between the j content frame of event p and the i content frame of event q, θ is the inherent error, For calculating the preset threshold value of the error, discox is a third preset threshold value.

In this embodiment, dis (f _pj ，f _qi ) The calculation formula of (2) is as follows:

wherein diff (f _pj ，f _qi ) For the content frame difference value of the i content frame of the target video event q and the j content frame of the alternative video event p, modULBPM (f _pj ) Modulo the feature matrix of the i content frame of the alternative video event q, modULBPM (f _qi ) Modulo the feature matrix of the j content frame of the target video event p, modULBPM (f _pj )，modULBPM(f _qi ) As the denominator is not 0, when modULBPM (f _pj ) And modULBPM (f) _qi ) Dis (f) at 0 _pj ，f _qi )＝0。

According to the method, the content frame difference rate of each content frame of the target video event and the corresponding matching frame can be calculated, whether the content frame difference rate of each content frame of the target video event and the corresponding matching frame is smaller than or equal to a third preset threshold value is further judged, and if not, the target video event and the alternative video event are dissimilar; if so, the difference rate is smaller under the condition that the content frames of the target video event and the content frames of the alternative video event are compared in a single frame difference rate, but the difference rate of the single frame difference rate cannot represent the difference rate of the whole video event in consideration that the video event is formed by a plurality of content frames according to a continuous sequence, so that in order to obtain a more accurate judgment result, matching frames exist in any content frame of the target video event in the alternative video event, and the difference rate of each content frame of the target video event and the content frame of the corresponding matching frame is smaller than or equal to a third threshold value dis _max And (3) calculating the similarity ratio of the target video event and the alternative video event, and judging whether the target video event is similar to the alternative video event according to the similarity ratio of the target video event and the alternative video event.

S703, judging whether the target video event is similar to the alternative video event according to the similarity, if so, adding the alternative video event into a similar video event set.

The judging whether the target video event is similar to the alternative video event according to the similarity rate comprises the following steps: and judging whether the similarity ratio of the target video event and the alternative video event is greater than or equal to a fourth preset threshold value, if so, determining that the target video event is similar to the alternative video event, and if not, determining that the target video event is dissimilar to the alternative video event.

In one example, the feature data further includes a content frame feature matrix of the video event, and the similarity between the target video event and the candidate video event is obtained according to the content frame feature matrix of the target video event and the content frame feature matrix of the candidate video event, where the similarity calculation formula of the target video event and the candidate video event is:

where q represents a target video event, fcnt _q For the number of content frames of the target video event, p represents the alternative video event, fcnt _p For the number of content frames of the candidate video event, simEV (p, q) represents the similarity of the target video event q and the candidate video event p, i is the content frame number of the target video event, dis (i) is the difference rate between the ith content frame of the target video event q and the corresponding matching frame, and the matching frame belongs to the candidate video event.

It is noted that, at fcnt _p ≤fcnt _q In the time-course of which the first and second contact surfaces,

then at SimEV (p, q) gtoreq.gtoreq.SimEV _min In the case of (2) SimEV _min If the fourth preset threshold value indicates that the similarity rate of the target video event and the alternative video event is larger within the error allowable range, determining that the target video event is similar to the alternative video event, if SimEV (p, q)<SimEV _min And if the difference rate of the target video event and the alternative video event is larger, determining that the target video event and the alternative video event are dissimilar.

As described above, the video events corresponding to all root nodes in the video event information network may be determined, so as to obtain an alternative video event similar to the target video event, and the alternative video event is added to the similar video event set.

And S704, after traversing the root node, outputting a similar video event set.

After traversing the root node, video events corresponding to all root nodes similar to the target video can be obtained, however, the video event information network further comprises child nodes and video events corresponding to the child nodes, so that after traversing the root node, the method further comprises: and obtaining all sub-nodes associated with the root node similar to the target video event, serving as similar sub-nodes, calculating the similarity rate of the video event corresponding to the target video event and the similar sub-nodes, and adding the video event corresponding to the similar sub-nodes and the similarity rate to a similar video event set.

It can be understood that, because the video event information network is dissimilar between the root node and the root node, but the child nodes are similar to the root node, when a video event corresponding to a certain root node is obtained and is similar to a target video, the child nodes can be directly judged to be similar to the target video event, and as similar child nodes, the video event corresponding to the similar child nodes is added to a similar video event set, so that the obtained similar video event set comprises all video events similar to the target video event in the video event information network.

In addition, after obtaining the similar video event set, the method of the embodiment further includes: under the condition that the total number of the similar video events in the similar event set is larger than 1, the video events in the similar event set are ordered according to the similarity between the target video event and each similar video event, for example, the similar video event with the maximum similarity to the target video event is set at the first position, so that a comparison result is more visual, and the subsequent staff can process the video events conveniently.

In an example, the above method for comparing a large amount of videos based on the video event information network may be applied in a copyright detection scenario, and if the target video is a copyrighted video, a video event similar to the copyrighted video may be detected, then the method of this embodiment further includes: for any target video event, acquiring the position information of the video event in the similar video event set in the original video; and outputting the position information of the target video and the video event similar to the target video event in the original video.

For example, for the target video event a, if 13 video events similar to the target video event a exist in the obtained similar video event set, the position information of the 13 video events similar to the target video event a in the original video is obtained respectively, for example, the video event V is similar to the target video event a, and the video event V is located at the playing position of the 3 rd minute to the 5 th minute in the original video. And outputting the position information of the target video and the video event similar to the target video event in the original video, so that copyright detection personnel can quickly find out the infringement video, and positioning the infringement video is realized.

According to the massive video comparison method based on the video event information network, the target video event sequence corresponding to the target video is obtained through preprocessing and granulating the target video, the root node in the video event information network is traversed according to the content frame number of the target video event and the video event feature vector, whether the currently traversed root node is an alternative root node is judged, and if yes, the similarity ratio of the target video event and the alternative video event corresponding to the alternative root node is calculated; judging whether the target video event is similar to the alternative video event according to the similarity, if so, adding the alternative video event into a similar video event set; and after the traversing root node is finished, outputting a similar video event set. The method and the device can be used for carrying out quick similarity comparison on the appointed video event in the resource video event set of the massive video resources, find out the resource video event set similar to the appointed video event, and can improve the accuracy and efficiency of video event similarity comparison.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

The following is a massive video comparison device based on a video event information network, which can be used for executing the method embodiment of the invention. For details not disclosed in the embodiment of the massive video comparison device based on the video event information network, please refer to the embodiment of the method of the present invention.

Fig. 8 is a schematic structural diagram of a massive video comparing device based on a video event information network according to an exemplary embodiment of the present invention. The massive video comparison device based on the video event information network can be realized into all or part of the terminal through software, hardware or a combination of the software and the hardware. The massive video comparison device 700 based on the video event information network comprises:

The video processing module 701 is configured to obtain a target video, perform preprocessing and granulating on the target video, and obtain a video event sequence corresponding to the target video, where the video event sequence includes at least one target video event.

The searching module 702 is configured to traverse root nodes in the video event information network according to the number of content frames of the target video event and the feature vector of the video event, determine whether the currently traversed root node is an alternative root node, and if yes, calculate a similarity ratio of the target video event and an alternative video event corresponding to the alternative root node.

And the comparison module 703 is configured to determine whether the target video event is similar to the candidate video event according to the similarity, and if so, add the candidate video event to the similar video event set.

The result output module 704 is configured to output a similar video event set, where the similar video event set includes all video events similar to the target video event in the video event information network.

The video event is a set of all content frames in a shot, the content frames are frames representing the shot content, the frames comprise a first frame, a last frame and N intermediate frames, N is a natural number, the intermediate frames are obtained when the difference rate is larger than a preset threshold value through difference rate calculation between all sub-frames of the shot except the first frame and the last frame and the previous content frame; the video event information network is a forest structure constructed based on a multi-level tree set based on a video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, and the video event feature vectors are obtained by calculating after extracting feature matrixes from a content frame set under the same coordinate system.

It should be noted that, when the video content comparing device of the video event information network provided in the above embodiment performs the video content comparing method of the video event information network, only the division of the above functional modules is used for illustration, in practical application, the above functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the video content comparison device of the video event information network provided in the above embodiment belongs to the same concept as the video content comparison method embodiment of the video event information network, and the implementation process is embodied in the method embodiment, which is not described herein again.

The embodiment of the invention also provides electronic equipment corresponding to the massive video comparison method based on the video event information network provided by the embodiment of the invention, so as to execute the massive video comparison method based on the video event information network.

Fig. 9 shows a schematic diagram of an electronic device according to an embodiment of the invention. As shown in fig. 9, the electronic device 800 includes: a memory 801 and a processor 802, the memory 801 storing a computer program executable on said processor 802, the processor 802 executing the method provided by any of the previous embodiments of the invention when said computer program is executed.

Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in this embodiment, the processor may be configured to execute the steps of the massive video comparison method based on the video event information network by using a computer program.

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 9 is only schematic, and the electronic device may also be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 9 is not limited to the structure of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

The memory 801 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for comparing a massive video based on a video event information network in the embodiment of the present invention, and the processor 802 executes various functional applications and data processing by running the software programs and modules stored in the memory 801, that is, implements the method for comparing a massive video based on a video event information network. The memory 801 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory 801 may further include memory remotely located relative to the processor 802, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. Wherein the memory 801 may be, but is not limited to, a network for storing video event information. As an example, the memory 801 may include, but is not limited to, a video processing module, a searching module, a comparing module, and a result outputting module in the massive video comparing device based on the video event information network. In addition, other module units in the massive video comparing device based on the video event information network may be included, but are not limited to, and are not described in detail in this example.

Optionally, the electronic device comprises transmission means 803, the transmission means 803 being adapted to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 803 includes a network adapter (Network Interface Controller, NIC) that can be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 803 is a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

In addition, the electronic device further includes: a display 804, configured to display a comparison result of the video content comparison based on the video event information network; and a connection bus 805 for connecting the respective module parts in the above-described electronic apparatus.

The present embodiments provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer readable storage medium by a processor of a computer device, which computer instructions are executed by the processor, causing the computer device to perform the above-described massive video contrast method based on a video event information network, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of the massive video comparison method based on the video event information network.

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method of the various embodiments of the present invention.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided by the present invention, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and are merely a logical functional division, and there may be other manners of dividing the apparatus in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The massive video comparison method based on the video event information network is characterized in that the video event refers to a set of all content frames in one shot, wherein the content frames refer to frames representing shot content, the frames comprise a first frame, a last frame and N middle frames, N is a natural number, the middle frames are obtained when the difference rate is larger than a preset threshold value through calculating the difference rate of all subframes of one shot except the first frame and the last frame and the previous content frame; the video event information network is a forest structure constructed based on a multi-level tree set based on a video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, the video event feature vectors are obtained by calculating after extracting feature matrixes from a content frame set under the same coordinate system, and the method comprises the following steps:

Acquiring a target video, preprocessing and granulating the target video to obtain a video event sequence corresponding to the target video, wherein the video event sequence comprises at least one target video event;

traversing root nodes in the video event information network according to the content frame number and the video event feature vector of the target video event, judging whether the currently traversed root node is an alternative root node, and if so, calculating the similarity ratio of the target video event and the alternative video event corresponding to the alternative root node;

judging whether the target video event is similar to the alternative video event according to the similarity ratio, if so, adding the alternative video event into a similar video event set;

and after the traversing root node is finished, outputting the similar video event set, wherein the similar video event set comprises all video events similar to the target video event in a video event information network.

2. The method of claim 1, wherein prior to capturing the target video, the method comprises:

acquiring an original video in a video resource library;

preprocessing and granulating the original video to obtain a video event sequence of the original video;

Taking the video event of the original video as a root node or a child node of the video event information network to construct the video event information network;

wherein, the pretreatment comprises:

normalizing the original video to obtain a normalized video;

de-framing the normalized video to obtain a normalized video frame sequence;

wherein, the granulating treatment comprises:

performing shot segmentation and content frame extraction on the normalized video frame sequence to obtain a shot sequence and a content frame sequence;

and obtaining a video event sequence corresponding to the original video according to the shot sequence and the content frame sequence.

3. The method according to claim 1, wherein traversing the root node in the video event information network according to the number of content frames and the video event feature vector of the target video event, and determining whether the currently traversed root node is an alternative root node, comprises:

the absolute value of the difference value of the content frame number of the target video event and the content frame number of the video event corresponding to the root node traversed currently is smaller than or equal to a first preset threshold value to be used as a first judgment condition;

calculating the feature vector difference rate of the video event corresponding to the root node traversed currently according to the content frame feature vector of the target video event and the content frame feature vector of the video event corresponding to the root node traversed currently;

Taking the difference rate of the feature vector of the video event corresponding to the target video event and the root node traversed currently as a second judgment condition, wherein the difference rate of the feature vector of the video event is smaller than or equal to a second preset threshold value;

and when the currently traversed root node meets the first judging condition and the second judging condition simultaneously, determining the currently traversed root node as an alternative root node.

4. The method of claim 1, wherein in the event that the root node currently traversed to is determined to be an alternate root node, the method further comprises:

comparing the video content of the target video event with that of the alternative video event, judging whether any content frame of the target video event has a corresponding matching frame in the alternative video event, and whether the difference rate of each content frame of the target video event and the content frame of the matching frame is smaller than or equal to a third preset threshold value, if so, calculating the similarity rate of the target video event and the alternative video event; if not, determining that the target video event and the alternative video event are dissimilar.

5. The method of claim 4, wherein determining whether any of the content frames of the target video event have corresponding matching frames in the alternative video event comprises:

Calculating the content frame difference rate between the first content frame sequence of the target video event and each content frame in the alternative video event, and judging whether the content frame difference rate between the first content frame sequence of the target video event and the first content frame of the target video event is smaller than or equal to a third preset threshold value; if yes, taking a content frame with the content frame difference rate smaller than or equal to a third preset threshold value from the first content frame of the target video event in the candidate video event as a first matching frame;

sequentially acquiring content frames with the same number as that of the content frames of the target video event in the alternative video event by taking the first matching frame as a starting frame, and sequentially taking the content frames as the matching frames of the content frames of the target video event;

if the number of the initial frames and the following content frames in the alternative video event is smaller than that of the target video event, determining that the target video event has the content frames without the matched frames;

if the number of the initial frames and the following content frames in the alternative video event is greater than or equal to the number of the content frames of the target video event, determining that any content frame of the target video event has a corresponding matching frame in the alternative video event.

6. The method of claim 1, wherein the feature data further comprises a content frame feature matrix of video events, and wherein determining whether the target video event is similar to the alternative video event based on the similarity ratio comprises:

judging whether the similarity ratio of the target video event and the alternative video event is larger than or equal to a fourth preset threshold value, if so, determining that the target video event is similar to the alternative video event, and if not, determining that the target video event is dissimilar to the alternative video event;

the similarity of the target video event and the alternative video event is obtained according to a content frame feature matrix of the target video event and a content frame feature matrix of the alternative video event, and a similarity calculation formula of the target video event and the alternative video event is as follows:

where q represents a target video event, fcnt _q For the number of content frames of the target video event, p represents the alternative video event, fcnt _p For the number of content frames of the candidate video event, simEV (p, q) represents the similarity of the target video event q and the candidate video event p, i is the content frame number of the target video event, and Dis (i) is the difference rate between the ith content frame of the target video event q and the corresponding matching frame, where the matching frame belongs to the candidate video event.

7. The method of claim 1, wherein after traversing the root node, the method further comprises:

and obtaining all sub-nodes associated with a root node similar to the target video event, serving as similar sub-nodes, calculating the similarity rate of the video event corresponding to the target video event and the similar sub-nodes, and adding the video event corresponding to the similar sub-nodes and the similarity rate to the similar video event set.

8. The method according to claim 1, wherein the method further comprises:

and under the condition that the total number of the similar video events in the similar event set is larger than 1, sorting the video events in the similar event set according to the similarity rate of the target video event and each similar video event.

9. The method according to claim 2, wherein the method further comprises:

for any target video event, acquiring the position information of the video event in the similar video event set in the original video;

and outputting the position information of the target video and the video event similar to the target video event in the original video.

10. A massive video contrast device based on a video event information network, the device comprising:

The video processing module is used for acquiring a target video, preprocessing and granulating the target video to obtain a video event sequence corresponding to the target video, wherein the video event sequence comprises at least one target video event;

the searching module is used for traversing root nodes in the video event information network according to the content frame number and the video event feature vector of the target video event, judging whether the currently traversed root node is an alternative root node or not, and if yes, calculating the similarity of the target video event and the alternative video event corresponding to the alternative root node;

the comparison module is used for judging whether the target video event is similar to the alternative video event according to the similarity ratio, and if so, adding the alternative video event into a similar video event set;

the result output module is used for outputting the similar video event set, wherein the similar video event set comprises all video events similar to the target video event in a video event information network;

the video event is a set of all content frames in a shot, the content frames are frames representing shot content, the frames comprise a first frame, a last frame and N intermediate frames, N is a natural number, the intermediate frames are obtained when the difference rate is larger than a preset threshold value through difference rate calculation of all subframes of the shot except the first frame and the last frame and the previous content frame; the video event information network is a forest structure constructed based on a multi-level tree set based on a video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, and the video event feature vectors are obtained by calculating after extracting feature matrixes from a content frame set under the same coordinate system.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor runs the computer program to implement the method of any one of claims 1-9.