CN117152650A

CN117152650A - Video content analysis method and video event information network for massive videos

Info

Publication number: CN117152650A
Application number: CN202310441704.1A
Authority: CN
Inventors: 汪昭辰; 刘世章
Original assignee: Qingdao Chenyuan Technology Information Co ltd
Current assignee: Qingdao Chenyuan Technology Information Co ltd
Priority date: 2023-04-23
Filing date: 2023-04-23
Publication date: 2023-12-01

Abstract

The application provides a video content analysis method and a video event information network of massive videos, which relate to the field of video processing, and the method comprises the following steps: preprocessing the obtained video to be processed to obtain a normalized video, granulating the normalized video to obtain a video event sequence comprising at least one video event, analyzing and calculating the content frames of the video event to obtain feature data of the video event, constructing a video event information network in a video event information space according to the feature data of the video event, wherein the video event of the video event information network has an association relation based on the similarity of the video event, and when the similarity analysis is carried out on the content of the video event by using the video event information network, the analysis result can be obtained quickly, and the video content analysis efficiency is improved.

Description

Video content analysis method and video event information network for massive videos

Technical Field

The application relates to the field of video processing, in particular to a video content analysis method and a video event information network for massive videos.

Background

At present, a method for screening and analyzing massive videos is mostly adopted by a video fingerprint comparison method, wherein the video fingerprint comparison method comprises the steps of taking image color histogram features as video fingerprints and carrying out content analysis of the massive videos and content association analysis among different videos according to the video fingerprints based on video image two-dimensional discrete cosine transform.

However, the noise immunity of the above manner is poor, the video detection accuracy is low, and the video relevance judging result is affected, for example, after a video is subjected to amplitude-to-form ratio conversion or frame graphic conversion of a video with watermark added, the substantial content of the video is still the same, but the converted video fingerprint is changed compared with the original video fingerprint, that is, the detection accuracy of the existing video event relevance analysis judging method is not high.

In addition, the existing video analysis comparison method based on deep learning has high dependence on a sample library, model training is needed according to a large number of sample videos, training cost is high, training time is long, noise resistance is poor, and video content analysis efficiency is low.

Disclosure of Invention

In view of the above, the application aims to provide a video content analysis method and a video event information network for massive videos, which can pointedly solve the problems of low precision and efficiency in the content association analysis of the massive videos in the prior art.

Based on the above object, in a first aspect, the present application provides a method for analyzing video content of a massive video, where the method includes: acquiring a video to be processed, and preprocessing the video to be processed to obtain a normalized video; granulating the normalized video to obtain a video event sequence, wherein the video event sequence comprises at least one video event, the video event refers to a set of all content frames in a shot, the content frames refer to frames representing shot contents, the frames comprise a first frame, a last frame and N middle frames, N is a natural number, and the middle frames are obtained when the difference rate is larger than a preset threshold value by carrying out difference rate calculation on all subframe sequences of the shot except the first frame and the last frame and the previous content frame; analyzing and calculating the content frames of the video event to obtain the characteristic data of the video event; according to the feature data of the video event, a video event information network is constructed in a video event information space, wherein the video event information network is a forest structure which is constructed based on a multi-level tree set based on the video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, and the video event feature vectors are obtained by calculating after feature matrices are extracted from a content frame set under the same coordinate system.

Optionally, granulating the normalized video to obtain a video event sequence, including: performing shot detection according to the frame sequence in the normalized video to obtain a shot sequence, wherein the shot sequence comprises at least one shot; extracting the content frames of the video frame sequence of each shot in the shot sequence to obtain the content frame sequence of each shot; and obtaining the video event sequence according to the shot sequence and the content frame sequence.

Optionally, analyzing and calculating the content frame of the video event to obtain feature data of the video event, including: obtaining the number of the content frames of the video event according to the number of the content frames in the content frame set of the video event; and obtaining the feature vector of the video event according to the feature matrix of each content frame in the content frame set of the video event.

Optionally, the video event is a first video event in the video event sequence, and the constructing a video event information network in a video event information space according to the feature data of the video event includes: and taking the first video event as a first root node of a video event information network, and constructing the video event information network.

Optionally, the video event is a second video event in the sequence of video events, after the first video event is used as a first root node of a video event information network, the method further includes: judging whether the second video event is similar to the first video event, if so, determining that the second video event is a child node of the first root node; if not, determining the second video event as a second root node; a second video event is added to the video event information network as a child of the first root node or the second root node.

Optionally, determining whether the second video event is similar to the first video event includes: judging whether the absolute value of the difference value between the content frame number of the second video event and the content frame number of the first video event is smaller than or equal to a first threshold value, if not, determining that the second video event is dissimilar to the first video event; if yes, judging whether the difference rate of the feature vectors of the second video event and the first video event is smaller than or equal to a second threshold value, and if not, determining that the second video event and the first video event are dissimilar; if yes, judging whether any content frame of the second video event has a target content frame in the first video event, and judging whether the difference rate of each content frame of the second video event and the corresponding content frame of the target content frame is smaller than or equal to a third threshold value, if not, determining that the second video event is dissimilar to the first video event; the target content frame is obtained by sequentially carrying out content frame difference rate calculation on a first content frame of the second video event and a content frame in the first video event; if yes, judging whether the similarity ratio of the second video event to the first video event is larger than or equal to a fourth threshold value, if yes, determining that the second video event is similar to the first video event, and if not, determining that the second video event is dissimilar to the first video event.

Optionally, the video event is an mth video event in the video event sequence, M is an integer greater than or equal to 3, and the method further includes: traversing nodes of the video event information network to obtain all root nodes and root node video events corresponding to all root nodes; judging whether the Mth video event is similar to at least one root node video event, if so, determining that the Mth video event is a new child node of the root node; if not, determining the Mth video event as a new root node; the new root node is added to a set of root nodes of the video event information network, or the new child node is added to a set of child nodes of the video event information network.

In a second aspect, there is also provided a video content analysis apparatus for mass video, the apparatus comprising: the video processing module is used for acquiring a video to be processed, preprocessing the video to be processed and obtaining a normalized video; the granulating module is used for granulating the normalized video to obtain a video event sequence, wherein the video event sequence comprises at least one video event, the video event refers to a set of all content frames in a shot, the content frames refer to frames representing shot contents, the frames comprise a first frame, a last frame and N middle frames, N is a natural number, and the middle frames are obtained when the difference rate is larger than a preset threshold value by calculating the difference rate of all subframes of the shot except the first frame and the last frame and the previous content frame; the computing module is used for analyzing and computing the content frames of the video event to obtain the characteristic data of the video event; the construction module is used for constructing a video event information network in a video event information space according to the feature data of the video event, wherein the video event information network is a forest structure constructed based on a multi-level tree set based on the video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, and the video event feature vectors are obtained by calculating after extracting feature matrixes from a content frame set under the same coordinate system.

In a third aspect, there is also provided a video event information network constructed based on the video content analysis method of massive video according to any one of the first aspects.

In a fourth aspect, there is also provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor runs the computer program to implement the method of the first aspect.

In a fifth aspect, there is also provided a computer readable storage medium having stored thereon a computer program for execution by a processor to implement the method of any of the first aspects.

In general, the present application has at least the following benefits:

according to the video content analysis method for massive videos, the videos to be processed are preprocessed and granulated to obtain the video events, feature data are obtained according to the video events, a video event information network is built in a video event information space according to the feature data of the video events, and because the video events in the built video event information network have incidence relations based on the similarity of the video events, after content analysis is carried out according to the massive video events, the video event information network comprising the massive video events can be obtained, the root node video events in the video event information network have most video event features of the child node video events thereof, when the similarity comparison of the target video events is carried out by using the video event information network, the similarity comparison of the target video events and the root nodes of the video event information network can be carried out, if the target video events are similar to the root nodes, then all the child nodes of the target video events similar to the similar root nodes are carried out the similarity comparison, and all the video events similar to the target video events in the video event information network can be obtained quickly, and the analysis result can be obtained quickly, and the video event analysis efficiency can be improved.

Drawings

In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the disclosure and are not therefore to be considered limiting of its scope. The exemplary embodiments of the present application and the descriptions thereof are for explaining the present application and do not constitute an undue limitation of the present application. In the drawings:

FIG. 1 shows a schematic diagram of an application environment of an alternative video content analysis method for massive video according to an embodiment of the application;

FIG. 2 is a schematic diagram of an application environment of another alternative method for video content analysis of massive video according to an embodiment of the present application;

FIG. 3 shows a flow chart of steps of a method for video content analysis of a mass video according to an embodiment of the application;

fig. 4 shows a schematic view of a granulating structure according to an embodiment of the application;

FIG. 5 shows a schematic diagram of content frame extraction according to an embodiment of the application;

fig. 6 illustrates a schematic structure of a video event information space according to an embodiment of the present application;

FIG. 7 shows a tree structure creation process;

FIG. 8 illustrates a method of determining whether a second video event is similar to a first video event according to the present embodiment;

FIG. 9 shows a general step schematic of a video content analysis method for a massive video event in one example;

fig. 10 is a schematic diagram showing the structure of a video content analysis apparatus for mass videos according to an exemplary embodiment of the present application;

FIG. 11 shows a schematic diagram of a video event information network;

fig. 12 shows a schematic diagram of an electronic device according to an embodiment of the application.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In one aspect of the embodiment of the present invention, a method for analyzing video content of a massive video is provided, and as an alternative implementation manner, the method for analyzing video content of a massive video may be applied to, but is not limited to, an application environment as shown in fig. 1. The application environment comprises the following steps: a terminal device 102, a network 104 and a server 106 which interact with a user in a man-machine manner. Human-machine interaction can be performed between the user 108 and the terminal device 102, and a video content analysis application program of massive videos runs in the terminal device 102. The terminal device 102 includes a man-machine interaction screen 1022, a processor 1024 and a memory 1026. The man-machine interaction screen 1022 is used for displaying video; the processor 1024 is configured to obtain a video to be processed and construct a video event information network based on the video to be processed. Memory 1026 is used to store the built video event information network.

In addition, the server 106 includes a database 1062 and a processing engine 1064, where the database 1062 is used to store the video event information network. The processing engine 1064 is configured to: acquiring a video to be processed, and preprocessing the video to be processed to obtain a normalized video; granulating the normalized video to obtain a video event sequence, wherein the video event sequence comprises at least one video event, and analyzing and calculating the content frames of the video event to obtain characteristic data of the video event; and constructing a video event information network in a video event information space according to the characteristic data of the video event.

In one or more embodiments, the video content analysis of the massive video of the present application described above may be applied in the application environment shown in fig. 2. As shown in fig. 2, a human-machine interaction may be performed between a user 202 and a user device 204. The user device 204 includes a memory 206 and a processor 208. The user equipment 204 in this embodiment may, but is not limited to, construct a video event information network with reference to performing the operations performed by the terminal equipment 102.

Optionally, the terminal device 102 and the user device 204 include, but are not limited to, a mobile phone, a tablet computer, a notebook computer, a PC, a vehicle-mounted electronic device, a wearable device, and the like, and the network 104 may include, but is not limited to, a wireless network or a wired network. Wherein the wireless network comprises: WIFI and other networks that enable wireless communications. The wired network may include, but is not limited to: wide area network, metropolitan area network, local area network. The server 106 may include, but is not limited to, any hardware device that may perform calculations. The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is merely an example, and is not limited in any way in the present embodiment.

In the related art, a method for screening and analyzing massive videos is mostly adopted, wherein the method for comparing video fingerprints comprises the steps of taking image color histogram features as video fingerprints and carrying out content analysis of the massive videos and content association analysis among different videos according to the video fingerprints based on video image two-dimensional discrete cosine transform. However, the noise immunity of the above manner is poor, the video detection accuracy is low, and the video relevance judging result is affected, for example, after a video is subjected to amplitude-to-form ratio conversion or frame graphic conversion of a video with watermark added, the substantial content of the video is still the same, but the converted video fingerprint is changed compared with the original video fingerprint, that is, the detection accuracy of the existing video event relevance analysis judging method is not high. In addition, the existing analysis comparison method of the video based on deep learning has high dependence on a sample library, model training is needed to be carried out according to a large number of sample videos, training cost is high, training time is long, noise resistance is poor, and efficiency of video content analysis is low.

In order to solve the above technical problems, as an optional implementation manner, the embodiment of the present invention provides a video content analysis method for massive videos.

Fig. 3 shows a flow chart of steps of a video content analysis method for a mass video according to an embodiment of the application. As shown in fig. 3, the video content analysis method of the massive videos includes the following steps S301 to S304:

s301, acquiring a video to be processed, and preprocessing the video to be processed to obtain a normalized video.

In this embodiment, the video to be detected may be a video from one or more resource libraries, may be a video specified by a user, and may be a video from the internet.

The method comprises the steps of preprocessing a video to be processed, including but not limited to performing video frame decomposition, picture-in-picture image extraction, frame removal and the like on the video to be detected, and performing normalization conversion on resolution, amplitude-shape ratio, color space and the like on the image of the video, so that the obtained normalized video has the video with the same dimension, and granulation processing is facilitated.

S302, granulating the normalized video to obtain a video event sequence.

In this embodiment, the granulating the normalized video to obtain a video event sequence includes: and carrying out shot detection according to the frame sequences in the normalized video to obtain shot sequences, wherein the shot sequences comprise at least one shot, extracting content frames from the video frame sequences of each shot in the shot sequences to obtain the content frame sequences of each shot, and obtaining a video event sequence according to the shot sequences and the content frame sequences.

Fig. 4 shows a schematic diagram of a video granulating structure, referring to fig. 4, a video granulating structure includes a video, a frame sequence, a shot and a content frame, the frame sequence is all frames representing video content, the shot is a continuous picture segment shot by a camera between a start-up and a stop, the continuous picture segment is a basic unit of video composition, and the content frame is a frame representing shot content.

In this embodiment, the granulating process refers to performing shot segmentation on a video to obtain a granulating structure of the video, where the principle of obtaining the granulating structure is as follows: the video content is composed of continuous frame sequences, the continuous frame sequences can be divided into a plurality of groups according to the continuity of the video content, shot detection is carried out according to the frame sequences, each group of continuous frame sequences is a shot, the shot sequences comprise at least one shot, a small number of frames are selected from the continuous frame sequences to represent the shot content by analyzing the difference of the content in the video shots, the frames are content frames, namely, the video frame sequences of each shot in the shot sequences are subjected to content frame extraction to obtain the content frame sequences of each shot, and then the video event sequences are obtained according to the shot sequences and the content frame sequences. Wherein, the content frames at least comprise the first and last two frames (shot frames) of the shot, so the content frame number of one shot is more than or equal to 2.

In this embodiment, the video event sequence includes at least one video event, where the video event refers to a set of all content frames in a shot, the content frames include a first frame, a last frame, and N intermediate frames, N is a natural number, and the intermediate frames are obtained when the difference rate is greater than a preset threshold by performing difference rate calculation on all subframes of the shot except for the first frame and the last frame and the previous content frame.

Fig. 5 is a schematic diagram of content frame extraction according to an embodiment of the present invention, as shown in fig. 5, the first frame is the first content frame, and then the 2 nd and 3 rd frames are calculated. And then calculating the difference rates of the 5 th, 6 th and 4 th frames until the difference rate is larger than a preset threshold, and if the difference rates of the 5 th, 6 th and 7 th frames and the first frame are smaller than the preset threshold and the 8 th frame is larger than the preset threshold, the 8 th frame is the third content frame. And by analogy, calculating the content frames in all subframes between all the first frames and all the tail frames. The end frame is selected directly as the last content frame without having to calculate the rate of difference with its previous content frame. The difference rate is the calculated difference rate between two frames of images.

For example, a surveillance video, with few people and few cars during the night, the video frame changes little, and the content frames will be few, for example, only a single number of content frames are extracted within 10 hours. The number of people and vehicles in the daytime is large, the change of people and objects in the video picture is frequent, and the content frames calculated according to the method are much more than those in the evening. Thus, the content frames are guaranteed not to lose all of the content information of the shot video relative to the key frames, as the key frames may lose part of the shot content. Compared with the scheme that each frame of the video is calculated and considered, the selection of the content frames only selects partial video image frames, so that the image calculation amount is greatly reduced on the premise of not losing the content.

S303, analyzing and calculating the content frames of the video event to obtain the feature data of the video event.

In this embodiment, the feature data of the video event includes the number of content frames of the video event and the feature matrix of the content frames of the video event, and the feature matrix of the content frames of the video event includes the feature matrix of each content frame included in the video event. The feature matrix of each content frame may be obtained according to the uniformity lbp feature of the content frame, which may better reflect the content feature of the video event, and in an alternative example, the feature matrix of the content frame may be obtained according to other features of the content frame, such as a histogram feature, a sift feature, a hog feature, and a haar feature, which are not listed herein.

In this embodiment, analysis and calculation are performed on a content frame of a video event to obtain feature data of the video event, including: obtaining the number of the content frames of the video event according to the number of the frames of the content frames in the content frame set of the video event, and obtaining the feature vector of the video event according to the feature matrix of each content frame in the content frame set of the video event.

For example, if the number of frames of the content frame set of the video event a is 5, the number of frames of the content frame set of the video event a is 5. Because the video event refers to a set of all content frames in a shot, the feature vector of the video event can be obtained according to the feature matrix of each content frame in the set of content frames of the video event.

The feature vector of the video event is represented by EV, the dimension of EV is 3481, and the formula of EV is as follows:

representation vector->The value of k dimensions>The calculation formula of (2) is as follows:

wherein the method comprises the steps ofFor the number of content frames in a video event, +.>Vector values in the k dimension for the ith content frame of the video event.

The dimension of k dimensions refers to the dimension of a video event in a video event information space, the video event information space refers to a multidimensional vector space in which a video event feature vector is located, and the video event feature vector is obtained by calculating after extracting a feature matrix from a content frame set under the same coordinate system. Specifically, the feature vector of each content frame may be obtained according to the feature matrix of each content frame in the content frame set of the video event, and the feature vector of the video event may be obtained according to the feature vector of each content frame.

In the video event information space, each video event has its coordinates, through which the distance between video events can be calculated, the same video event has the same coordinates, the similar video event distance is small, and the different video event distances are large. By calculating the distance between video events, the video event information space can be divided into a plurality of areas, the video event content at the center of each area represents the main content of the whole area, and the relationship of each circular area in the video event information space comprises three relationships of separation, tangency and intersection, wherein the separation is that no common area exists between the areas, the tangency is that there is one common point between the areas, the point is a tangent point, and the intersection is that there is a common area between the areas.

This results in a video event information space as shown in fig. 6, where fig. 6 shows a schematic structural diagram of the video event information space, and four points A, B, C, D in fig. 6 are central positions of respective circular areas, the radius of the circle represents the maximum distance from the center of the circle in the video event information space, and the video event content A, B, C, D represents the main content of each circular area, as shown in fig. 6. c1, C2 are video events similar to the content of video event C, B1, B2 are video events similar to the content of video event B, D1, D2, D3 are video events similar to the content of video event D, and the distances between C1, C2, B1, B2, D1, D2, D3 and the circle centers of the respective areas are not greater than the radius.

Based on the video event information space shown in fig. 6, the whole video event information space can be zoned by selecting a center point and a designated radius to divide the region, and then a tree structure can be established according to the zoning characteristic to record the relationship among the regions, so as to form a multi-level tree set.

S304, constructing a video event information network in a video event information space according to the characteristic data of the video event.

The above embodiment can know that the video event information space is a multidimensional vector space in which the video event feature vector is located, and has a regional characteristic, so that a forest structure can be constructed based on a multi-level tree set to form a video event information network, that is, the video event information network is a forest structure constructed based on the multi-level tree set based on the video event information space.

Fig. 7 illustrates a tree structure creation process, where the tree structure may be divided into two stages according to the relationship between the regions in the video event information space, the first stage being the root node corresponding to the center of each spatial region, and the second stage being the child node corresponding to the non-center point in each spatial region. If the space region is subdivided into multiple sub-regions, the tree structure will also generate corresponding multi-level sub-nodes, where the number of levels of the tree structure corresponds to the number of levels of the space region in the information space, and in this embodiment, a 2-level tree structure is described as an example.

As shown in fig. 7, a plurality of multi-level tree structures can be obtained according to the video event information space, the multi-level tree includes a root node and sub-nodes, a forest structure constructed based on a multi-level tree set formed by the plurality of multi-level tree structures is a video event information network, each sub-node in the video event information network at least belongs to 1 root node, and no sub-node exists under the root node.

Based on the video event information space and the video event information network, the video event information network of the present embodiment is constructed based on the video event, so that after the feature data of the video event is obtained, the video event information network is constructed in the video event information space according to the feature data of the video event. As can be seen from the association relation between the tree structures of the video event information, because the video events of the child nodes are similar to the root nodes of the child nodes, and the video events corresponding to different root nodes are dissimilar, the video events in the video event information network have association relation based on the similarity of the video events, so that when the video event content analysis is carried out according to the massive video events, the video event information network comprising the massive video events can be obtained, the root node video events in the video event information network have most video event characteristics of the child node video events, when the video event information network is used for carrying out similarity comparison of the target video events, the root nodes of the target video events and the video event information network can be compared, if the target video events are similar to the root nodes, the target video events are compared with all the child nodes of the similar root nodes, and all the video events similar to the target video events in the video event information network can be obtained quickly.

Specifically, when the video event information network is first constructed, since there is no video event in the video event information network, the first video event can be used as the first root node of the video event information network to construct the video event information network.

Therefore, in this embodiment, when the video event is the first video event in the video event sequence, according to the feature data of the video event, a video event information network is constructed in the video event information space, including: and taking the first video event as a first root node of the video event information network to construct the video event information network.

Further, when a root node exists in the video event information network, the current video event may be similar to the first video event or dissimilar to the first video event, that is, the node corresponding to the current video event may be the root node or the child node when the video event information network is constructed according to the feature data of the video event.

Thus, in case the video event is a second video event in the sequence of video events, after constructing the video event information network with the first video event as a first root node of the video event information network, further comprising: judging whether the second video event is similar to the first video event, if so, determining that the second video event is a child node of the first root node; if not, determining that the second video event is a second root node; the second video event is added to the video event information network as a child of the first root node or as a second root node.

In this embodiment, when the second video event is similar to the first video event, the second video event is used as a child node of the first video event, and when the second video event is dissimilar to the first video event, the second video event is used as a separate root node different from the first video event, so that the video events between the root node and the root node are dissimilar from each other in the constructed video event information network, that is, the difference rate between the video events corresponding to any two root nodes is greater than a preset threshold, and the difference rate between the child node of each root node and the video event corresponding to the root node thereof is less than or equal to the preset threshold.

Fig. 8 shows a method for determining whether a second video event is similar to a first video event according to the present embodiment, and referring to fig. 8, determining whether the second video event is similar to the first video event includes the following steps S601 to S604:

s601, judging whether the absolute value of the difference value between the number of the content frames of the second video event and the number of the content frames of the first video event is smaller than or equal to a first threshold value, if not, determining that the second video event is dissimilar to the first video event; if yes, go to step S602.

For example, p represents a first video event, q represents a second video event,for the number of content frames of the first video event, < >>For the number of content frames of the second video event, the first threshold is +.>。

Then, atIn the case of (a), a first video event and a second video event are describedThe number of frames of the content included in the video event is large, for example, the number of frames of the content of the first video event is 9, the number of frames of the content of the second video event is 20, the number of frames of the content is large, and the two frames are directly judged to be dissimilar.

And atIn the case of (a), it is explained that the number of content frames included in the first video event and the second video event is small, the first video event and the second video event may be similar or dissimilar, for example, the number of content frames of the first video event is 5, and the number of content frames of the second video event is 7, at this time, 2 content frames of the second video event more than the first video event may be a picture overlapping with 5 content frames of the first video event or a picture different from 5 content frames of any one of the first video event, and therefore, step S602 is performed.

S602, judging whether the difference rate of the feature vectors of the second video event and the first video event is smaller than or equal to a second threshold value, if not, determining that the second video event is dissimilar to the first video event; if yes, go to step S603.

In this embodiment, the feature vector difference rate of the second video event and the first video eventThe calculation formula is as follows:

wherein,feature vector difference value representing second video event and first video event,/or->Modulo representing a feature vector of a first video event,/->A modulus of the feature vector representing the second video event,representing the minimum of taking the modulus of the feature vector of the first video event and the modulus of the feature vector of the second video event,/for>The denominator cannot be 0.

The modulo modEV calculation formula of the event feature vector is as follows:

wherein,representation vector->A value of k dimensions in (a).

Feature vector difference value of second video event and first video eventThe calculation formula of (2) is as follows:

where p represents a first video event, q represents a second video event,a value representing k dimensions of the first video event, < >>A value representing the k-dimension of the second video event.

The feature vector difference rate of the second video event and the first video event can be obtained by the above formula 3, formula 4 and formula 5In case the feature vector difference ratio of the second video event and the first video event is larger than the second threshold, i.e. in + ∈ ->In the case of->And determining that the second video event is dissimilar to the first video event for a second threshold value, wherein the feature vectors of the second video event and the first video event are larger in difference, and determining that the second video event is dissimilar to the first video event.

In the case that the difference rate of the feature vectors of the second video event and the first video event is less than or equal to the second threshold value, namelyIn the case of (2), the feature vector difference describing the second video event and the first video event is small, but due to +.>Cannot be directly used as a criterion for whether events p and q are similar, when->The value hour events p and q of (c) may not be similar, and therefore step S603 is also performed.

S603, judging whether any content frame of the second video event has a target content frame in the first video event, judging whether the difference rate of each content frame of the second video event and the content frame of the corresponding target content frame is smaller than or equal to a third threshold value, if not, determining that the first video event is dissimilar to the second video event, and if so, executing step S604.

It should be noted that, the number of content frames of the first video event is greater than or equal to the number of content frames of the second video event in this embodiment. It can be understood that when the number of content frames included in the a video event is greater than the number of content frames included in the B video event, the B video event cannot be considered to include the a video event due to the presence of more content frames than the B video event in the a video event, and thus the a video event cannot be judged to be similar to the B video event, but whether the a video event is similar to the B video event can be judged by judging whether the a video event includes all content frames of the B video event. That is, in the present embodiment, when the number of content frames of two video events is the same, the comparison can be directly performed, and when the number of content frames of two video events is different, the video event with a smaller number of content frames is compared with the video event with a larger number of content frames.

In this embodiment, the number of content frames of the first video event is greater than or equal to the number of content frames of the second video event, and in this embodiment, the target content frame is obtained by sequentially performing content frame difference rate calculation on the first content frame of the second video event and the content frames in the first video event.

For example, if the number of content frames of the first video event is 7 and the number of content frames of the second video event is 5, calculating the difference between the first content frame of the second video event and the first content frame, the second content frame … …, the sixth content frame, and the seventh content frame of the first video event, respectively, and assuming that the difference rate between the first content frame of the second video event and the first content frame of the first video event is greater than the preset difference rate threshold (the third threshold), and the difference rate between the first content frame of the second video event and the second content frame of the first video event is less than the preset difference rate threshold, taking the second content frame of the first video event as the target content frame corresponding to the first content frame of the second video event, and taking the second content frame of the first video event as the first target content frame, sequentially obtaining 5 target content frames, thereby knowing that the target content frames belong to the first video event, that is, the third content frame of the first video event, the fourth content frame of the first video event, the fifth content frame of the first video event, the sixth content frame of the first video event and the seventh content frame of the first video event are sequentially used as the second content frame of the second video event, the third content frame of the second video event, the fourth content frame of the second video event and the target content frame of the fifth content frame of the second video event, respectively, wherein, assuming that the difference rate of the first content frame of the second video event to the first content frame of the first video event, the second content frame and the third content frame of the first video event is larger than the preset difference rate threshold, 5 target content frames cannot be found in the first video event, that is, the target content frame cannot be found for the content frame of each second video event in the first video event, it is determined that the second video event is dissimilar to the first video event if the second video event is not included in the first video event.

In this embodiment, the content frame difference rate between any one of the content frames of the second video event and the content frame of the first video eventThe calculation formula is as follows:

wherein,for the second video event->And first video event->Content frame difference rate,/-, of (2)>J content frame for event p, +.>，/>I content frame for event q, +.>，/>For the original difference rate between the j content frame of event p and the i content frame of event q,/>Is an inherent error->For calculating the preset threshold value of the error, +.>Is a third threshold.

In the present embodiment of the present invention,the calculation formula of (2) is as follows:

wherein,content frame difference value for i content frame of the second video event and j content frame of the first video event,/for i>Modulo the feature matrix of the j content frame of the first video event, < >>Modulo the feature matrix of the i content frame of the second video event, +.>As the denominator is not 0, whenAnd->All 0 +.>。

It will be appreciated that when any content frame of the second video event does not have a target content frame in the first video event, it is stated that the second video event and the first video event have distinct content frames, which are dissimilar.

When each content frame of the second video event has a target content frame in the first video event, but the difference rate of any content frame in the second video event and all content frames of the first video event is larger than the fourth threshold, the difference rate of any content frame in the second video event and all content frames in the first video event is larger, and the second video event is distinguished from the first video event, and then the second video event and the first video event are determined to be dissimilar.

In the first video event, there are target content frames corresponding to each content frame of the second video event one by one, and the content frame difference rate between each content frame of the second video event and the corresponding target content frame is less than or equal to the third threshold valueIn the case of the second video event, the difference rate may be smaller when the content frames of the second video event are compared with the content frames of the first video event in the single-frame difference rate, but considering that the video event is formed by a plurality of content frames in a continuous order, the single-frame difference rate cannot represent the difference rate of the whole video event, so that in order to obtain a more accurate determination result, in any content frame of the second video event, a target content frame exists in the first video event, and the difference rate of each content frame of the second video event and the content frame of the corresponding target content frame is less than or equal to a third threshold value>In the case of (2), step S604 is performed.

S604, judging whether the similarity ratio of the second video event and the first video event is larger than or equal to a fourth threshold value, if so, determining that the second video event is similar to the first video event, and if not, determining that the second video event is dissimilar to the first video event.

In the present embodiment of the present invention,the similarity calculation formula of the second video event and the first video event is as follows:

wherein,representing a first video event, q representing a second video event,/->Representing the similarity of the second video event and the first video event,/and>for the second video event q and the first video event p +.>Content frame difference rate,/->Is the number of content frames for the second video event.

Then atIn the case of->For the fourth threshold, if the similarity of the second video event and the first video event is larger within the error allowable range, determining that the second video event is similar to the first video event, if +.>And determining that the second video event is dissimilar to the first video event if the difference rate of the second video event and the first video event is larger.

It should be noted that, the calculation accuracy of the steps S601 to S604 is sequentially increased, and the calculation amount is also sequentially increased, so when the similarity determination of the video event a and the video event B is performed in the above manner, the dissimilar comparison result of the two may be obtained through any one of the steps, or the accurate comparison result may be directly obtained through the step S604. However, in the case that there are many video events to be compared, for example, the content relevance analysis is performed on the a video event and the massive B video events, the B video event dissimilar to the a video event portion may be removed through any one or more steps S601 to S603, and then the content relevance analysis is performed on the a video event and the rest of the B video events through step S604.

Continuing the above example, after the video event information network is constructed according to the second video event, there may be one root node and one child node, and there may also be two root nodes in the obtained video event information network, so when the video event information network is constructed through the third video event and the fourth video event … … mth video event, the node of the current video event information network needs to be traversed, and the mth video event is compared with the root node video events corresponding to all the root nodes.

In the case that the video event is the mth video event in the video event sequence, M is an integer greater than or equal to 3, that is, when there are a large number of video events to be subjected to content relevance analysis, the method of this embodiment further includes: traversing nodes of the video event information network to obtain all root nodes and root node video events corresponding to all root nodes; judging whether the Mth video event is similar to at least one root node video event, if so, determining that the Mth video event is a new child node of the root node; if not, determining the Mth video event as a new root node; the new root node is added to the set of root nodes of the video event information network or the new child node is added to the set of child nodes of the video event information network.

For example, after traversing the root node in the video event information network, determining that the video event of the root node is A, and if the M video event is similar to A, taking the M video event as a child node of the video event A, and if the M video event is dissimilar to A, taking the M video event as a new root node B.

In this embodiment, the method for determining whether the mth video event is similar to the at least one root node video event is identical to the method for determining whether the second video event is similar to the first video event in steps S601 to S604, so that repetition is avoided and detailed description is omitted.

The embodiment can realize the construction of the video event information network according to massive video events, find the root node by traversing the nodes in the current video event information network, analyze and compare the video events to be added into the network with the content of the root node video events, and judge whether the video events to be added into the network are root nodes or child nodes, wherein the video events between the root nodes and the root nodes of the constructed video event information network are dissimilar, and the video events between the root nodes and the child nodes are similar, so that the video events in the video event information network have an association relationship based on the similarity of the video events.

Fig. 9 shows a general step diagram of a video content analysis method for a massive number of video events in one example. As shown in fig. 9, preprocessing and granulating a video to obtain a video event, calculating a feature vector of the video event, for example, obtaining feature vectors of the video event through formula 1 and formula 2, traversing to find root nodes similar to the video event in a root node set, adding the video event as child nodes of the similar root node if the similar root node exists, and adding the video event as the root node if the similar root node does not exist. The feature vector of the video event is used for searching whether the similar root node exists.

According to the video content analysis method for massive videos provided by the embodiment, the video to be processed is preprocessed and granulated to obtain the video events, feature data is obtained according to the video events, a video event information network is built in a video event information space according to the feature data of the video events, and because the video events in the video event information network which are built are related based on the similarity of the video events, after content analysis is carried out according to the massive video events, the video event information network comprising the massive video events can be obtained, the root node video events in the video event information network have most of the video event features of the child node video events thereof, when the similarity comparison of the target video events is carried out by using the video event information network, the similarity comparison of the target video events and the root node of the video event information network can be carried out, if the target video events are similar to the root node, then all the child nodes of the target video events similar to the similar root node are carried out the similarity comparison, and all the video events similar to the target video events in the video event information network can be obtained quickly, and the analysis result can be obtained quickly, and the video event analysis efficiency can be improved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

The following embodiments of the video content analysis device for massive videos of the present invention may be used to execute the method embodiments of the present invention. For details not disclosed in the embodiment of the video content analysis device for mass videos of the present invention, please refer to the embodiment of the method of the present invention.

Fig. 10 is a schematic diagram showing the structure of a video content analysis apparatus for mass videos according to an exemplary embodiment of the present invention. The video content analysis device of the massive video can be realized by software, hardware or a combination of the software and the hardware to be all or part of the terminal. The video content analysis device 1000 of the mass video includes:

the video processing module 1001 is configured to obtain a video to be processed, and perform preprocessing on the video to be processed to obtain a normalized video.

The granulating module 1002 is configured to perform granulating processing on a normalized video to obtain a video event sequence, where the video event sequence includes at least one video event, the video event refers to a set of all content frames in a shot, the content frames refer to frames representing content of the shot, including a first frame, a last frame, and N intermediate frames, where N is a natural number, and the intermediate frames are obtained when the difference rate is greater than a preset threshold by performing difference rate calculation on all subframes of the shot except the first frame and the last frame and the previous content frame.

The calculation module 1003 is configured to perform analysis and calculation on the content frame of the video event, so as to obtain feature data of the video event.

The construction module 1004 is configured to construct a video event information network in a video event information space according to feature data of a video event, where the video event information network is a forest structure constructed based on a multi-level tree set based on the video event information space, the video event information space is a multi-dimensional vector space in which feature vectors of the video event are located, and the feature vectors of the video event are calculated after feature matrices are extracted from a content frame set under the same coordinate system.

It should be noted that, when the video content analysis device for a massive video according to the foregoing embodiment performs the video content analysis method for a massive video, only the division of the foregoing functional modules is used as an example, and in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the video content analysis device for massive videos provided in the foregoing embodiments belongs to the same concept as the video content analysis method embodiment for massive videos, and the implementation process is embodied in the method embodiment, which is not described herein again.

The embodiment provides a video event information network, which is constructed based on the video content analysis method of massive video events, wherein the video event information network is a forest structure constructed based on a multi-level tree set as a basis based on a video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, and the video event feature vectors are obtained by calculating after extracting feature matrixes from a content frame set under the same coordinate system.

Fig. 11 shows a schematic structural diagram of a video event information network, where the video event information network includes a root node set and a child node set, the root node set includes at least one root node, and the child node set may be null, where the root node and the root node are independent, one root node may have multiple child nodes or no child nodes, and one child node may also belong to multiple root nodes. Each root node and child node corresponds to a video event.

The embodiment of the application also provides an electronic device corresponding to the video content analysis method of the massive videos provided by the previous embodiment, so as to execute the video content analysis method of the massive videos.

Fig. 12 shows a schematic diagram of an electronic device according to an embodiment of the application. As shown in fig. 12, the electronic device 800 includes: a memory 801 and a processor 802, the memory 801 storing a computer program executable on the processor 802, the processor 802 executing the method provided by any of the preceding embodiments of the application when the computer program is executed.

Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the steps of the video content analysis method of the above-described massive video by a computer program.

Alternatively, as will be appreciated by those skilled in the art, the structure shown in fig. 12 is merely illustrative, and the electronic device may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, or other terminal devices. Fig. 12 is not limited to the structure of the electronic device and the electronic apparatus described above. For example, the electronics can also include more or fewer components (e.g., network interfaces, etc.) than shown in fig. 12, or have a different configuration than shown in fig. 12.

The memory 801 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for analyzing video content of a massive video in the embodiment of the present invention, and the processor 802 executes the software programs and modules stored in the memory 801, thereby executing various functional applications and data processing, that is, implementing the method for analyzing video content of a massive video described above. The memory 801 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory 801 may further include memory remotely located relative to the processor 802, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. Wherein the memory 801 may be, but is not limited to, a network for storing video event information. As an example, the memory 801 may include, but is not limited to, a video processing module, a granulating module, a calculating module, and a constructing module in a video content analyzing apparatus including the massive video. In addition, other module units in the video content analysis device of the massive video may be included, but are not limited to, and are not described in detail in this example.

Optionally, the electronic device comprises transmission means 803, the transmission means 803 being adapted to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 803 includes a network adapter (Network Interface Controller, NIC) that can be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 803 is a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

In addition, the electronic device further includes: a display 804, configured to display an analysis result of video content analysis of the massive video; and a connection bus 805 for connecting the respective module parts in the above-described electronic apparatus.

The present embodiments provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer program is configured to, when executed, perform the steps of any of the method embodiments described above.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of the image content analysis method of a massive image.

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method of the various embodiments of the present invention.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and are merely a logical functional division, and there may be other manners of dividing the apparatus in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method for analyzing video content of a mass of videos, the method comprising:

acquiring a video to be processed, and preprocessing the video to be processed to obtain a normalized video;

granulating the normalized video to obtain a video event sequence, wherein the video event sequence comprises at least one video event, the video event refers to a set of all content frames in a shot, the content frames refer to frames representing shot contents, the frames comprise a first frame, a last frame and N middle frames, N is a natural number, and the middle frames are obtained when the difference rate is larger than a preset threshold value by carrying out difference rate calculation on all subframe sequences of the shot except the first frame and the last frame and the previous content frame;

Analyzing and calculating the content frames of the video event to obtain the characteristic data of the video event;

according to the feature data of the video event, a video event information network is constructed in a video event information space, wherein the video event information network is a forest structure which is constructed based on a multi-level tree set based on the video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, and the video event feature vectors are obtained by calculating after feature matrices are extracted from a content frame set under the same coordinate system.

2. The method of claim 1, wherein the granulating the normalized video to obtain a sequence of video events comprises:

performing shot detection according to the frame sequence in the normalized video to obtain a shot sequence, wherein the shot sequence comprises at least one shot;

extracting the content frames of the video frame sequence of each shot in the shot sequence to obtain the content frame sequence of each shot;

and obtaining the video event sequence according to the shot sequence and the content frame sequence.

3. The method of claim 1, wherein analyzing the content frames of the video event to obtain the feature data of the video event comprises:

Obtaining the number of the content frames of the video event according to the number of the content frames in the content frame set of the video event;

and obtaining the feature vector of the video event according to the feature matrix of each content frame in the content frame set of the video event.

4. A method according to claim 3, wherein the video event is a first video event in the sequence of video events, constructing a video event information network in a video event information space from the feature data of the video event, comprising:

and taking the first video event as a first root node of a video event information network, and constructing the video event information network.

5. The method of claim 4, wherein the video event is a second video event in the sequence of video events, and wherein after constructing the video event information network with the first video event as a first root node of the video event information network, the method further comprises:

judging whether the second video event is similar to the first video event, if so, determining that the second video event is a child node of the first root node; if not, determining the second video event as a second root node;

A second video event is added to the video event information network as a child of the first root node or the second root node.

6. The method of claim 5, wherein determining whether the second video event is similar to the first video event comprises:

judging whether the absolute value of the difference value between the content frame number of the second video event and the content frame number of the first video event is smaller than or equal to a first threshold value, if not, determining that the second video event is dissimilar to the first video event;

if yes, judging whether the difference rate of the feature vectors of the second video event and the first video event is smaller than or equal to a second threshold value, and if not, determining that the second video event and the first video event are dissimilar;

if yes, judging whether any content frame of the second video event has a target content frame in the first video event, and judging whether the difference rate of each content frame of the second video event and the corresponding content frame of the target content frame is smaller than or equal to a third threshold value, if not, determining that the second video event is dissimilar to the first video event; the target content frame is obtained by sequentially carrying out content frame difference rate calculation on a first content frame of the second video event and a content frame in the first video event;

If yes, judging whether the similarity ratio of the second video event to the first video event is larger than or equal to a fourth threshold value, if yes, determining that the second video event is similar to the first video event, and if not, determining that the second video event is dissimilar to the first video event.

7. The method of claim 5, wherein the video event is an mth video event in the sequence of video events, M being an integer greater than or equal to 3, the method further comprising:

traversing nodes of the video event information network to obtain all root nodes and root node video events corresponding to all root nodes;

judging whether the Mth video event is similar to at least one root node video event, if so, determining that the Mth video event is a new child node of the root node; if not, determining the Mth video event as a new root node;

the new root node is added to a set of root nodes of the video event information network, or the new child node is added to a set of child nodes of the video event information network.

8. A video content analysis apparatus for mass video, the apparatus comprising:

The video processing module is used for acquiring a video to be processed, preprocessing the video to be processed and obtaining a normalized video;

the granulating module is used for granulating the normalized video to obtain a video event sequence, wherein the video event sequence comprises at least one video event, the video event refers to a set of all content frames in a shot, the content frames refer to frames representing shot contents, the frames comprise a first frame, a last frame and N middle frames, N is a natural number, and the middle frames are obtained when the difference rate is larger than a preset threshold value by calculating the difference rate of all subframes of the shot except the first frame and the last frame and the previous content frame;

the computing module is used for analyzing and computing the content frames of the video event to obtain the characteristic data of the video event;

the construction module is used for constructing a video event information network in a video event information space according to the feature data of the video event, wherein the video event information network is a forest structure constructed based on a multi-level tree set based on the video event information space, the video event information space is a multi-dimensional vector space in which video event feature vectors are located, and the video event feature vectors are obtained by calculating after extracting feature matrixes from a content frame set under the same coordinate system.

9. A video event information network, characterized in that it is built on the basis of the video content analysis method of massive video according to any of claims 1-8.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor runs the computer program to implement the method of any one of claims 1-7.

11. A computer readable storage medium having stored thereon a computer program, wherein the program is executed by a processor to implement the method of any of claims 1-7.