CN111126126B

CN111126126B - Intelligent video strip splitting method based on graph convolution neural network

Info

Publication number: CN111126126B
Application number: CN201910999726.3A
Authority: CN
Inventors: 王中元; 裴盈娇; 黄宝金; 陈何玲
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2022-02-01
Anticipated expiration: 2039-10-21
Also published as: CN111126126A

Abstract

The invention discloses a video intelligent stripping method based on a graph convolution neural network, which comprises three steps of key frame extraction, inter-frame similarity calculation and frame clustering. Firstly, extracting key frames based on inter-frame difference, and taking adjacent frames with larger difference as key frames; then, obtaining a similarity matrix through a Siamase twin network, wherein elements in the matrix are Euclidean distances between key frames; and finally, constructing a topological graph by utilizing the time sequence relation between the similarity matrix and the key frames, clustering the key frames through a convolutional neural network, and realizing story classification, thereby achieving the purpose of intelligently splitting the video. The method can accurately divide the video into the video segments with specific semantics, and has remarkable application value.

Description

Intelligent video strip splitting method based on graph convolution neural network

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to a video intelligent strip removing method, and particularly relates to a video intelligent strip removing method based on a graph convolution neural network.

Technical Field

With the deep development of the mobile internet and the change of the use habits of users, the demands of the users on short videos are increasing day by day. At present, most of video splitting is manual splitting of manual frame-by-frame previewing, time and labor are consumed, and the timeliness requirement of rapid release of new media audio-visual programs is not met. Therefore, the application of the intelligent video strip detaching technology can greatly improve the working efficiency and effectively improve the propagation speed of new media. The method is characterized in that unstructured video data are analyzed in a characteristic or structure mode, a long video is rapidly split into a plurality of independent short video segments with specific semantics according to content plots by adopting a video splitting technology.

Existing methods for splitting news videos are roughly divided into two categories. The first category uses shot space-time characteristics of the video when theme units are transformed to divide news videos, such as voice pause, speaker change, occurrence of anchor, and the like. However, the traditional method is not universal and can only be used for certain specific videos.

The second type uses text recognition and stitching or audio processing to merge footage in the same story to detect story boundaries. The criterion for judging whether the two shots belong to the same story is whether the corresponding semantics of the two shots have relevance. The story segmentation method based on semantic similarity evaluation is based on visual similarity among a plurality of shots and time domain distance. Existing research utilizes various audio-video features extracted from shots to assess semantic similarity. In news video, however, the principle of semantic similarity is not strictly followed between shots of a story. Thus, for story segments that lack semantic correlation between shots, this approach cannot be accurately segmented.

In summary, the conventional story segmentation methods are limited to a specific video scene and content, and cannot be applied to general video scenes.

Disclosure of Invention

In order to solve the technical problem, the invention provides a method based on graph convolution to carry out intelligent splitting on videos, characteristics are extracted and the videos are split through a Siamase network and the graph convolution network, and the videos are divided into video segments with specific semantics.

The technical scheme adopted by the invention is as follows: a video intelligent strip splitting method based on a graph convolution neural network is characterized by comprising the following steps:

step 1: extracting key frames by using an interframe difference method aiming at an original input video;

step 2: constructing a Simase deep learning network, extracting a feature vector of a key frame, and establishing a similarity matrix by calculating an inter-frame Euclidean distance;

and step 3: and clustering the key frames by using a graph convolution network to realize intelligent video strip splitting.

Compared with the existing video intelligent strip splitting scheme, the method has the following advantages and positive effects:

1) the method does not depend on semantic association between a specific scene switching mark (such as a moderator) and a shot of the news video, can process the news video of any scene, and has the advantage of strong universality.

2) According to the method, the inter-frame similarity is calculated, the time sequence relation between the key frames is analyzed, the key frames are clustered by using the graph convolution network, and the intelligent splitting of the video can be rapidly and accurately realized.

Drawings

FIG. 1: a flow chart of an embodiment of the invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and the implementation examples, it is to be understood that the implementation examples described herein are only for the purpose of illustration and explanation and are not to be construed as limiting the present invention.

Referring to fig. 1, the method for intelligently splitting video strips based on the graph convolution neural network provided by the invention comprises the following steps:

step 1: key frame extraction is realized by using an interframe difference method;

the principle of the method is to differentiate two frame images, and the change size of the two frame images is measured by using the average pixel intensity of the obtained images. Whenever a frame in the video has a large change from the previous frame, it is considered as a key frame and extracted. The algorithm flow is briefly described as follows:

step 1.1: and reading the video, and sequentially calculating the interframe difference between every two frames to further obtain the average interframe difference strength. The calculation formula is as follows:

wherein f is_k(x, y) and f_k+1(x, y) are images of the k-th frame and k + 1-th frame, respectively, and w and h are the length and width of the image.

Step 1.2: and selecting a frame with the local maximum value of the average inter-frame difference intensity as a key frame of the video based on the average inter-frame difference intensity obtained in the step 1.1.

Step 1.3: and saving the key frame by utilizing an OpenCV library, and naming the key frame by using the frame number of the key frame in the video.

the Siamase network can be used to better measure the degree of similarity of two inputs, it has two inputs, each of which is fed into two identical neural networks (CNN1 and CNN2), which share parameters, each of which maps an input to a new space, forming a representation of the input in the new space. Through calculation of the loss function, the similarity of the two inputs is evaluated. The method comprises the following steps:

step 2.1: and constructing a Simase deep learning network model, and constructing two CNN networks which are the same and share the weight in the network model.

Step 2.2: inputting paired picture training network models, extracting characteristic vectors, and calculating the Euclidean distance between frames until the similarity between frames can be judged through the characteristic vectors.

Step 2.3: two adjacent key frames are input into the network model in pairs, and two 128-dimensional vectors are output after convolution, activation, pooling and full connection.

Step 2.4: and calculating Euclidean distances of the two feature vectors, and comparing similarity. D (x)₁，x₂) The smaller the similarity between vectors, and vice versa. And then establishing an n x n similarity matrix through circulation and iteration, wherein each value in the matrix represents the similarity between two pictures, and n represents n key frames.

Where i represents the key frame sequence number.

The graph convolution neural network can process data with a generalized topological graph structure and deeply explore characteristics and rules of the data. The key frame in the video is equivalent to a node, and then a topological graph is constructed according to the similarity between frames and the time sequence relation of the frames and is input into a graph convolution neural network, so that similar nodes are gathered together, and the video segmentation effect is achieved. The method comprises the following steps:

step 3.1: each key frame is used as a pivot, and an example pivot sub-graph G is constructed according to the similarity matrix and the time sequence relation_p(V_p，E_p) Wherein V is_pRepresenting a set of junction p neighbors, E_pRepresenting an edge set of the p instance pivot subgraph; for any pivot p, if the similarity of a node and the node is more than 50% and the difference between the frame number of the node and the pivot frame number is within 55, the node is added to V_pThen search for V in the same manner_pMiddle node v_i(i denotes the node number) and at v_iAnd its neighbor nodes.

Step 3.2: and inputting the example hub sub-graph into a graph convolution neural network for processing, and outputting a score for measuring the connection possibility of each node and the hub node. The propagation mode of features between layers is formulated as follows:

Hⁱ＝f(H^i-1a), wherein H⁰＝X

f(Hⁱ，A)＝σ(AHⁱWⁱ)

Wherein HⁱIs the feature matrix of the ith layer, when i is 0, H⁰The node characteristic matrix of the input graph is represented. A is the adjacency matrix of the input graph, WⁱThe weight matrix of the ith layer is represented, and σ represents the nonlinear activation function. By abutment torqueAnd the matrix is multiplied by the feature matrix on the left side to realize the aggregation operation of the features, and then is multiplied by the weight matrix on the right side to realize the weighting operation. The optimization function uses a cross entropy loss function.

Step 3.3: and converting the vector into probability by using a probability distribution function Softmax to obtain a weight matrix of the whole graph, wherein each weight represents the possibility that a link exists between the node and the hub. The pseudo-label is then propagated using a breadth-first search algorithm (BFS algorithm) to merge all possible connected nodes. And finally, cutting off edges among the nodes with low link possibility to obtain final clusters.

The video intelligent stripping technology based on the graph convolution network, provided by the invention, can be used for clustering key frames by utilizing the interframe similarity and time sequence relation and processing news videos of any scene.

It should be understood that parts of the specification not set forth in detail are prior art; the above description of the preferred embodiments is intended to be illustrative, and not to be construed as limiting the scope of the invention, which is defined by the appended claims, and all changes and modifications that fall within the metes and bounds of the claims, or equivalences of such metes and bounds are therefore intended to be embraced by the appended claims.

Claims

1. A video intelligent strip splitting method based on a graph convolution neural network is characterized by comprising the following steps:

and step 3: clustering the key frames by using a graph convolution network to realize intelligent video strip splitting;

the specific implementation of the step 3 comprises the following substeps:

step 3.1: each key frame is used as a pivot, and an example pivot is constructed according to the similarity matrix and the time sequence relationSubfigure G_p(V_p,E_p) Wherein V is_pRepresenting a set of junction p neighbors, E_pRepresenting an edge set of the p instance pivot subgraph; for any pivot p, if the similarity of a node and the node is more than 50% and the difference between the frame number of the node and the pivot frame number is within 55, the node is added to V_pThen search for V in the same manner_pMiddle node v_lAnd at v_lEstablishing an edge with a neighbor node of the node, wherein l represents a node serial number;

step 3.2: inputting the example hub subgraph into a graph convolution neural network for processing, and outputting a score for measuring the connection possibility of each node and the hub node;

the propagation mode of features between layers is formulated as follows:

Hⁱ＝f(H^i-1a), wherein H⁰＝X；

F(Hⁱ,A)＝σ(AHⁱWⁱ)；

Wherein HⁱIs the feature matrix of the ith layer, when i is 0, H⁰A node feature matrix X representing the input graph; a is the adjacency matrix of the input graph, WⁱA weight matrix representing the ith layer, σ represents a nonlinear activation function; the feature aggregation operation is realized by multiplying the feature matrix by the adjacency matrix on the left, and then multiplying the weight matrix on the right to realize the weighting operation; the optimization function uses a cross entropy loss function;

step 3.3: converting the vector into probability by using a probability distribution function Softmax to obtain a weight matrix of the whole graph, wherein each weight represents the possibility that a link exists between a node and a hub; then, a breadth-first search algorithm is used for propagating pseudo labels to merge all nodes which are possibly connected; and finally, cutting off edges among the nodes with low link possibility to obtain final clusters.

2. The intelligent video striping method based on the graph convolution neural network as claimed in claim 1, wherein the specific implementation of step 1 includes the following sub-steps:

step 1.1: reading a video and sequentially calculating the interframe difference between every two frames to further obtain the average interframe difference strength D (x, y);

wherein f is_k(x, y) and f_k+1(x, y) are images of the k-th frame and the k + 1-th frame, respectively, and w and h are the length and width of the images;

3. The intelligent video striping method based on the graph convolution neural network as claimed in claim 2, wherein the specific implementation of step 2 comprises the following sub-steps:

step 2.1: constructing a Simase deep learning network model, and building two identical weight-shared CNN networks in the network model;

step 2.2: inputting paired picture training network models, extracting characteristic vectors, and calculating the Euclidean distance between frames until the similarity between the frames can be judged through the characteristic vectors;

step 2.3: inputting two adjacent key frames into a network model in pairs, and outputting two 128-dimensional vectors after convolution, activation, pooling and full connection;

step 2.4: calculating Euclidean distance D (x) of the two eigenvectors₁,x₂) Comparing the similarity; d (x)₁,x₂) The smaller the vector is, the greater the similarity between vectors is, and otherwise, the smaller the similarity is; then, an n multiplied by n similarity matrix is established through circulation and iteration, each value in the matrix represents the similarity between two pictures, wherein n represents n key frames;

where j represents the key frame sequence number.