CN111126126B - Intelligent video strip splitting method based on graph convolution neural network - Google Patents

Intelligent video strip splitting method based on graph convolution neural network Download PDF

Info

Publication number
CN111126126B
CN111126126B CN201910999726.3A CN201910999726A CN111126126B CN 111126126 B CN111126126 B CN 111126126B CN 201910999726 A CN201910999726 A CN 201910999726A CN 111126126 B CN111126126 B CN 111126126B
Authority
CN
China
Prior art keywords
frame
node
video
matrix
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910999726.3A
Other languages
Chinese (zh)
Other versions
CN111126126A (en
Inventor
王中元
裴盈娇
黄宝金
陈何玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201910999726.3A priority Critical patent/CN111126126B/en
Publication of CN111126126A publication Critical patent/CN111126126A/en
Application granted granted Critical
Publication of CN111126126B publication Critical patent/CN111126126B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video intelligent stripping method based on a graph convolution neural network, which comprises three steps of key frame extraction, inter-frame similarity calculation and frame clustering. Firstly, extracting key frames based on inter-frame difference, and taking adjacent frames with larger difference as key frames; then, obtaining a similarity matrix through a Siamase twin network, wherein elements in the matrix are Euclidean distances between key frames; and finally, constructing a topological graph by utilizing the time sequence relation between the similarity matrix and the key frames, clustering the key frames through a convolutional neural network, and realizing story classification, thereby achieving the purpose of intelligently splitting the video. The method can accurately divide the video into the video segments with specific semantics, and has remarkable application value.

Description

Intelligent video strip splitting method based on graph convolution neural network
Technical Field
The invention belongs to the technical field of artificial intelligence, relates to a video intelligent strip removing method, and particularly relates to a video intelligent strip removing method based on a graph convolution neural network.
Technical Field
With the deep development of the mobile internet and the change of the use habits of users, the demands of the users on short videos are increasing day by day. At present, most of video splitting is manual splitting of manual frame-by-frame previewing, time and labor are consumed, and the timeliness requirement of rapid release of new media audio-visual programs is not met. Therefore, the application of the intelligent video strip detaching technology can greatly improve the working efficiency and effectively improve the propagation speed of new media. The method is characterized in that unstructured video data are analyzed in a characteristic or structure mode, a long video is rapidly split into a plurality of independent short video segments with specific semantics according to content plots by adopting a video splitting technology.
Existing methods for splitting news videos are roughly divided into two categories. The first category uses shot space-time characteristics of the video when theme units are transformed to divide news videos, such as voice pause, speaker change, occurrence of anchor, and the like. However, the traditional method is not universal and can only be used for certain specific videos.
The second type uses text recognition and stitching or audio processing to merge footage in the same story to detect story boundaries. The criterion for judging whether the two shots belong to the same story is whether the corresponding semantics of the two shots have relevance. The story segmentation method based on semantic similarity evaluation is based on visual similarity among a plurality of shots and time domain distance. Existing research utilizes various audio-video features extracted from shots to assess semantic similarity. In news video, however, the principle of semantic similarity is not strictly followed between shots of a story. Thus, for story segments that lack semantic correlation between shots, this approach cannot be accurately segmented.
In summary, the conventional story segmentation methods are limited to a specific video scene and content, and cannot be applied to general video scenes.
Disclosure of Invention
In order to solve the technical problem, the invention provides a method based on graph convolution to carry out intelligent splitting on videos, characteristics are extracted and the videos are split through a Siamase network and the graph convolution network, and the videos are divided into video segments with specific semantics.
The technical scheme adopted by the invention is as follows: a video intelligent strip splitting method based on a graph convolution neural network is characterized by comprising the following steps:
step 1: extracting key frames by using an interframe difference method aiming at an original input video;
step 2: constructing a Simase deep learning network, extracting a feature vector of a key frame, and establishing a similarity matrix by calculating an inter-frame Euclidean distance;
and step 3: and clustering the key frames by using a graph convolution network to realize intelligent video strip splitting.
Compared with the existing video intelligent strip splitting scheme, the method has the following advantages and positive effects:
1) the method does not depend on semantic association between a specific scene switching mark (such as a moderator) and a shot of the news video, can process the news video of any scene, and has the advantage of strong universality.
2) According to the method, the inter-frame similarity is calculated, the time sequence relation between the key frames is analyzed, the key frames are clustered by using the graph convolution network, and the intelligent splitting of the video can be rapidly and accurately realized.
Drawings
FIG. 1: a flow chart of an embodiment of the invention.
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and the implementation examples, it is to be understood that the implementation examples described herein are only for the purpose of illustration and explanation and are not to be construed as limiting the present invention.
Referring to fig. 1, the method for intelligently splitting video strips based on the graph convolution neural network provided by the invention comprises the following steps:
step 1: key frame extraction is realized by using an interframe difference method;
the principle of the method is to differentiate two frame images, and the change size of the two frame images is measured by using the average pixel intensity of the obtained images. Whenever a frame in the video has a large change from the previous frame, it is considered as a key frame and extracted. The algorithm flow is briefly described as follows:
step 1.1: and reading the video, and sequentially calculating the interframe difference between every two frames to further obtain the average interframe difference strength. The calculation formula is as follows:
Figure BDA0002240908820000021
wherein f isk(x, y) and fk+1(x, y) are images of the k-th frame and k + 1-th frame, respectively, and w and h are the length and width of the image.
Step 1.2: and selecting a frame with the local maximum value of the average inter-frame difference intensity as a key frame of the video based on the average inter-frame difference intensity obtained in the step 1.1.
Step 1.3: and saving the key frame by utilizing an OpenCV library, and naming the key frame by using the frame number of the key frame in the video.
Step 2: constructing a Simase deep learning network, extracting a feature vector of a key frame, and establishing a similarity matrix by calculating an inter-frame Euclidean distance;
the Siamase network can be used to better measure the degree of similarity of two inputs, it has two inputs, each of which is fed into two identical neural networks (CNN1 and CNN2), which share parameters, each of which maps an input to a new space, forming a representation of the input in the new space. Through calculation of the loss function, the similarity of the two inputs is evaluated. The method comprises the following steps:
step 2.1: and constructing a Simase deep learning network model, and constructing two CNN networks which are the same and share the weight in the network model.
Step 2.2: inputting paired picture training network models, extracting characteristic vectors, and calculating the Euclidean distance between frames until the similarity between frames can be judged through the characteristic vectors.
Step 2.3: two adjacent key frames are input into the network model in pairs, and two 128-dimensional vectors are output after convolution, activation, pooling and full connection.
Step 2.4: and calculating Euclidean distances of the two feature vectors, and comparing similarity. D (x)1,x2) The smaller the similarity between vectors, and vice versa. And then establishing an n x n similarity matrix through circulation and iteration, wherein each value in the matrix represents the similarity between two pictures, and n represents n key frames.
Figure BDA0002240908820000031
Where i represents the key frame sequence number.
And step 3: and clustering the key frames by using a graph convolution network to realize intelligent video strip splitting.
The graph convolution neural network can process data with a generalized topological graph structure and deeply explore characteristics and rules of the data. The key frame in the video is equivalent to a node, and then a topological graph is constructed according to the similarity between frames and the time sequence relation of the frames and is input into a graph convolution neural network, so that similar nodes are gathered together, and the video segmentation effect is achieved. The method comprises the following steps:
step 3.1: each key frame is used as a pivot, and an example pivot sub-graph G is constructed according to the similarity matrix and the time sequence relationp(Vp,Ep) Wherein V ispRepresenting a set of junction p neighbors, EpRepresenting an edge set of the p instance pivot subgraph; for any pivot p, if the similarity of a node and the node is more than 50% and the difference between the frame number of the node and the pivot frame number is within 55, the node is added to VpThen search for V in the same mannerpMiddle node vi(i denotes the node number) and at viAnd its neighbor nodes.
Step 3.2: and inputting the example hub sub-graph into a graph convolution neural network for processing, and outputting a score for measuring the connection possibility of each node and the hub node. The propagation mode of features between layers is formulated as follows:
Hi=f(Hi-1a), wherein H0=X
f(Hi,A)=σ(AHiWi)
Wherein HiIs the feature matrix of the ith layer, when i is 0, H0The node characteristic matrix of the input graph is represented. A is the adjacency matrix of the input graph, WiThe weight matrix of the ith layer is represented, and σ represents the nonlinear activation function. By abutment torqueAnd the matrix is multiplied by the feature matrix on the left side to realize the aggregation operation of the features, and then is multiplied by the weight matrix on the right side to realize the weighting operation. The optimization function uses a cross entropy loss function.
Step 3.3: and converting the vector into probability by using a probability distribution function Softmax to obtain a weight matrix of the whole graph, wherein each weight represents the possibility that a link exists between the node and the hub. The pseudo-label is then propagated using a breadth-first search algorithm (BFS algorithm) to merge all possible connected nodes. And finally, cutting off edges among the nodes with low link possibility to obtain final clusters.
The video intelligent stripping technology based on the graph convolution network, provided by the invention, can be used for clustering key frames by utilizing the interframe similarity and time sequence relation and processing news videos of any scene.
It should be understood that parts of the specification not set forth in detail are prior art; the above description of the preferred embodiments is intended to be illustrative, and not to be construed as limiting the scope of the invention, which is defined by the appended claims, and all changes and modifications that fall within the metes and bounds of the claims, or equivalences of such metes and bounds are therefore intended to be embraced by the appended claims.

Claims (3)

1. A video intelligent strip splitting method based on a graph convolution neural network is characterized by comprising the following steps:
step 1: extracting key frames by using an interframe difference method aiming at an original input video;
step 2: constructing a Simase deep learning network, extracting a feature vector of a key frame, and establishing a similarity matrix by calculating an inter-frame Euclidean distance;
and step 3: clustering the key frames by using a graph convolution network to realize intelligent video strip splitting;
the specific implementation of the step 3 comprises the following substeps:
step 3.1: each key frame is used as a pivot, and an example pivot is constructed according to the similarity matrix and the time sequence relationSubfigure Gp(Vp,Ep) Wherein V ispRepresenting a set of junction p neighbors, EpRepresenting an edge set of the p instance pivot subgraph; for any pivot p, if the similarity of a node and the node is more than 50% and the difference between the frame number of the node and the pivot frame number is within 55, the node is added to VpThen search for V in the same mannerpMiddle node vlAnd at vlEstablishing an edge with a neighbor node of the node, wherein l represents a node serial number;
step 3.2: inputting the example hub subgraph into a graph convolution neural network for processing, and outputting a score for measuring the connection possibility of each node and the hub node;
the propagation mode of features between layers is formulated as follows:
Hi=f(Hi-1a), wherein H0=X;
F(Hi,A)=σ(AHiWi);
Wherein HiIs the feature matrix of the ith layer, when i is 0, H0A node feature matrix X representing the input graph; a is the adjacency matrix of the input graph, WiA weight matrix representing the ith layer, σ represents a nonlinear activation function; the feature aggregation operation is realized by multiplying the feature matrix by the adjacency matrix on the left, and then multiplying the weight matrix on the right to realize the weighting operation; the optimization function uses a cross entropy loss function;
step 3.3: converting the vector into probability by using a probability distribution function Softmax to obtain a weight matrix of the whole graph, wherein each weight represents the possibility that a link exists between a node and a hub; then, a breadth-first search algorithm is used for propagating pseudo labels to merge all nodes which are possibly connected; and finally, cutting off edges among the nodes with low link possibility to obtain final clusters.
2. The intelligent video striping method based on the graph convolution neural network as claimed in claim 1, wherein the specific implementation of step 1 includes the following sub-steps:
step 1.1: reading a video and sequentially calculating the interframe difference between every two frames to further obtain the average interframe difference strength D (x, y);
Figure FDA0003401687890000021
wherein f isk(x, y) and fk+1(x, y) are images of the k-th frame and the k + 1-th frame, respectively, and w and h are the length and width of the images;
step 1.2: and selecting a frame with the local maximum value of the average inter-frame difference intensity as a key frame of the video based on the average inter-frame difference intensity obtained in the step 1.1.
3. The intelligent video striping method based on the graph convolution neural network as claimed in claim 2, wherein the specific implementation of step 2 comprises the following sub-steps:
step 2.1: constructing a Simase deep learning network model, and building two identical weight-shared CNN networks in the network model;
step 2.2: inputting paired picture training network models, extracting characteristic vectors, and calculating the Euclidean distance between frames until the similarity between the frames can be judged through the characteristic vectors;
step 2.3: inputting two adjacent key frames into a network model in pairs, and outputting two 128-dimensional vectors after convolution, activation, pooling and full connection;
step 2.4: calculating Euclidean distance D (x) of the two eigenvectors1,x2) Comparing the similarity; d (x)1,x2) The smaller the vector is, the greater the similarity between vectors is, and otherwise, the smaller the similarity is; then, an n multiplied by n similarity matrix is established through circulation and iteration, each value in the matrix represents the similarity between two pictures, wherein n represents n key frames;
Figure FDA0003401687890000022
where j represents the key frame sequence number.
CN201910999726.3A 2019-10-21 2019-10-21 Intelligent video strip splitting method based on graph convolution neural network Active CN111126126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910999726.3A CN111126126B (en) 2019-10-21 2019-10-21 Intelligent video strip splitting method based on graph convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910999726.3A CN111126126B (en) 2019-10-21 2019-10-21 Intelligent video strip splitting method based on graph convolution neural network

Publications (2)

Publication Number Publication Date
CN111126126A CN111126126A (en) 2020-05-08
CN111126126B true CN111126126B (en) 2022-02-01

Family

ID=70495424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910999726.3A Active CN111126126B (en) 2019-10-21 2019-10-21 Intelligent video strip splitting method based on graph convolution neural network

Country Status (1)

Country Link
CN (1) CN111126126B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111652899B (en) * 2020-05-29 2023-11-14 中国矿业大学 Video target segmentation method for space-time component diagram
CN111695531B (en) * 2020-06-16 2023-05-23 天津师范大学 Cross-domain pedestrian re-identification method based on heterogeneous convolution network
CN112288047B (en) * 2020-12-25 2021-04-09 成都索贝数码科技股份有限公司 Broadcast television news stripping method based on probability distribution transformation clustering
CN114979742B (en) * 2021-02-24 2024-04-09 腾讯科技(深圳)有限公司 Video processing method, device, equipment and storage medium
CN113610001B (en) * 2021-08-09 2024-02-09 西安电子科技大学 Indoor mobile terminal positioning method based on combination of depth camera and IMU
CN113873328B (en) * 2021-09-27 2023-06-27 四川效率源信息安全技术股份有限公司 Method for splitting multi-camera fusion video file into multiple single-camera video files
CN114155193B (en) * 2021-10-27 2022-07-26 北京医准智能科技有限公司 Blood vessel segmentation method and device based on feature enhancement
CN114727356B (en) * 2022-05-16 2022-08-26 北京邮电大学 Unmanned cluster networking method and device and electronic equipment
CN115909174A (en) * 2023-01-06 2023-04-04 中译文娱科技(青岛)有限公司 Video extraction method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279768A (en) * 2013-05-31 2013-09-04 北京航空航天大学 Method for identifying faces in videos based on incremental learning of face partitioning visual representations
EP3120296A1 (en) * 2014-03-21 2017-01-25 The Secretary of State for Defence Recognition of objects within a video
CN107657228A (en) * 2017-09-25 2018-02-02 中国传媒大学 Video scene similarity analysis method and system, video coding-decoding method and system
CN109887282A (en) * 2019-03-05 2019-06-14 中南大学 A kind of road network traffic flow prediction technique based on level timing diagram convolutional network
CN110321958A (en) * 2019-07-08 2019-10-11 北京字节跳动网络技术有限公司 Training method, the video similarity of neural network model determine method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279768A (en) * 2013-05-31 2013-09-04 北京航空航天大学 Method for identifying faces in videos based on incremental learning of face partitioning visual representations
EP3120296A1 (en) * 2014-03-21 2017-01-25 The Secretary of State for Defence Recognition of objects within a video
CN107657228A (en) * 2017-09-25 2018-02-02 中国传媒大学 Video scene similarity analysis method and system, video coding-decoding method and system
CN109887282A (en) * 2019-03-05 2019-06-14 中南大学 A kind of road network traffic flow prediction technique based on level timing diagram convolutional network
CN110321958A (en) * 2019-07-08 2019-10-11 北京字节跳动网络技术有限公司 Training method, the video similarity of neural network model determine method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Gaussian-Induced Convolution for Graphs;Jiatao Jiang et al.;《arXiv:1811.04393v1》;20181111;第1-9页 *
基于改进深度孪生网络的分类器及其应用;沈雁 等;《计算机工程与应用》;20181231;第54卷(第10期);第2.1-2.2节 *
基于自适应无参核密度估计算法的运动奶牛目标检测;宋怀波 等;《农业机械学报》;20190531;第50卷(第5期);第1.2.2节 *

Also Published As

Publication number Publication date
CN111126126A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN111126126B (en) Intelligent video strip splitting method based on graph convolution neural network
CN109829443B (en) Video behavior identification method based on image enhancement and 3D convolution neural network
Yin et al. Recurrent convolutional network for video-based smoke detection
CN111506773B (en) Video duplicate removal method based on unsupervised depth twin network
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
US7142602B2 (en) Method for segmenting 3D objects from compressed videos
CN109525892B (en) Video key scene extraction method and device
Sasithradevi et al. A new pyramidal opponent color-shape model based video shot boundary detection
CN112418012B (en) Video abstract generation method based on space-time attention model
CN112750129B (en) Image semantic segmentation model based on feature enhancement position attention mechanism
CN111026914A (en) Training method of video abstract model, video abstract generation method and device
CN106778686A (en) A kind of copy video detecting method and system based on deep learning and graph theory
Zhang et al. Coarse-to-fine object detection in unmanned aerial vehicle imagery using lightweight convolutional neural network and deep motion saliency
Kuang et al. Deep multimodality learning for UAV video aesthetic quality assessment
Gao et al. Video imprint segmentation for temporal action detection in untrimmed videos
CN114973112B (en) Scale self-adaptive dense crowd counting method based on countermeasure learning network
CN112200096B (en) Method, device and storage medium for realizing real-time abnormal behavior identification based on compressed video
CN109034953B (en) Movie recommendation method
CN113705384B (en) Facial expression recognition method considering local space-time characteristics and global timing clues
CN107729821B (en) Video summarization method based on one-dimensional sequence learning
Gao et al. A joint local–global search mechanism for long-term tracking with dynamic memory network
CN114821772A (en) Weak supervision time sequence action detection method based on time-space correlation learning
Ayadi et al. Movie scenes detection with MIGSOM based on shots semi-supervised clustering
Wang et al. Very important person localization in unconstrained conditions: A new benchmark
Hu et al. MmFilter: Language-guided video analytics at the edge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant