CN110097026A

CN110097026A - A kind of paragraph correlation rule evaluation method based on multidimensional element Video segmentation

Info

Publication number: CN110097026A
Application number: CN201910395119.6A
Authority: CN
Inventors: 胡燕祝; 田雯嘉
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-05-13
Filing date: 2019-05-13
Publication date: 2019-08-06
Anticipated expiration: 2039-05-13
Also published as: CN110097026B

Abstract

Present invention generally provides a kind of paragraph correlation rule evaluation methods based on multidimensional element Video segmentation, and particular content includes: step 1: video parsing；Step 2: the key-frame extraction in scene cut；Step 3: the scene cut based on key frame；Step 4, the audio segmentation of video；Step 5, the semantic segmentation of video；Step 6: the paragraph correlation rule evaluation method of the segmentation video of GNN network；Step 7: building related network.After the present invention carries out multidimensional segmentation to same section of video, corresponding multidimensional element is matched by the way of constructing paragraph correlation rule.Compared with the paragraph correlation rule evaluation method of other Video segmentations, pixel realizes good segmentation in image dimension to video in the variation in time-domain and the correlation between consecutive frame in present invention combination image sequence, the key message of video is remained, a kind of paragraph correlation rule evaluation method of effective multidimensional element Video segmentation can be provided.

Description

A kind of paragraph correlation rule evaluation method based on multidimensional element Video segmentation

Technical field

The invention mainly relates to a kind of paragraph correlation rule evaluation methods, are regarded more particularly to one kind based on multidimensional element The paragraph correlation rule evaluation method that frequency division is cut.

Background technique

It is directed to video structural problem at present, most of is all point in terms of carrying out this one-dimensional element of image to video It cuts, is related in the video structural technique study divided based on multidimensional less.And in practice, the audio letter for including in video Breath, text information etc. also play an important role video monitoring work.In addition, being split to the moving object in video When extracting key frame, in order to consider operation efficiency problem, it is only to take a certain frame in video as key frame, often neglects The important information for slightly including in video, or in such a way that threshold value be set video frame is successively carried out visual signature comparison come Key frame is chosen, above method does not account in image sequence pixel in the variation in time-domain and the phase between consecutive frame Closing property and the corresponding relationship between previous frame and present frame.Meanwhile scene, sound, text three are being carried out to same section of video After segmentation in a dimension, video in different time periods has been obtained.The video divided in these three dimensions can not be complete The case where being aligned entirely, intersection can be generated.Therefore, it is necessary to establish it is a kind of scene, sound, this third dimension element of text can be carried out it is complete Matched paragraph correlation rule evaluation method.

It is currently very widely used in terms of video structural.For example, being based on video structural fire-fighting in public places Video structural in application, public safety and video structural technology in facility monitoring system and in safe city Using etc..With the large scale deployment of city video monitoring system, video monitoring has goed deep into each corner in city, in intelligent friendship The all trades and professions such as logical, government regulation, enterprise operation generate a large amount of monitor video data.With edge calculations, cloud computing, big number According to deepening continuously for technology, the video data volume is huge, storage is difficult, retrieves the problems such as inconvenient becomes increasingly conspicuous, towards extensive Real-time video monitoring data, to carry out real-time space time information mark, character extraction, feature extraction, mesh to video stream data Mark classification, the image processing works such as structuring mark, and it is quickly transferred to center calculation processing, need to construct a kind of multidimensional element The paragraph correlation rule evaluation method of Video segmentation can quick and precisely match scene, sound, text realization, be the Chinese government And each enterprise operation provides the monitoring means of real-time high-efficiency.

Summary of the invention

For above-mentioned problems of the prior art, present invention generally provides a kind of based on multidimensional element Video segmentation Paragraph correlation rule evaluation method, detailed process are as shown in Figure 1.

Technical solution implementation steps are as follows:

Step 1: video parsing.

The first step of video parsing is data receiver, needs to do video the processing of one demultiplexing, is decomposed into picture track Road, audio track, subtitle track.

Step 2: the key-frame extraction in scene cut.

Extraction method of key frame is broadly divided into five classes, and specific method is as shown in Figure 2.

(1) it is based on Boundary Extraction key frame.This method is directly made each camera lens first frame and last frame or intermediate frame It is come out for key frame extraction.Operand is small in this way, is suitable for the camera lens that content activity is small or remains unchanged.

(2) view-based access control model feature extraction key frame.This method selects first frame as nearest key frame first, then, Successively visual signature compared with it, these features include color, movement, edge, shape and spatial relationship etc. to subsequent frame.If As soon as the difference between present frame and nearest key frame has been more than a scheduled threshold value, then present frame is chosen as key frame.

(3) key frame is extracted based on cluster.Such methods have used clustering technique, and all frames of a camera lens are gathered Class, the frame number then according to certain criterion, such as in classification choose crucial classification in these classifications, then in crucial classification The smallest frame of clustering parameter is chosen as key frame.

(4) key frame is extracted based on multi-mode.Such method is mainly imitated human perception ability and simplify in video Hold analysis, usually comprehensive video, audio, text etc..For example, the scene switching in the videos such as film, sport, video and sound Frequency content often changes simultaneously, so the extracting method with multi-mode is just needed, when the audio and video feature of shot boundary is same When changing greatly, which is new scene boundary.

(5) key frame is extracted based on compression domain.Method based on compression domain is not necessarily to decompress video flowing or only part is needed to solve Pressure, directly extracts key frame from mpeg compressed video stream, reduces the complexity of calculating.

Step 3: the scene cut based on key frame.

In terms of mainly including following three:

(1) it is detected based on inter-frame difference.Frame differential method is a kind of poor by making to two frame adjacent in sequence of video images The method that partite transport is calculated to obtain moving target profile, it can be perfectly suitable for, and there are multiple moving targets and video camera are mobile The case where.

(2) it is based on background Differential Detection.Background subtraction is the universal method that a kind of pair of static scene carries out motion segmentation, The picture frame currently obtained and background image are done calculus of differences by it, obtain the grayscale image of target moving region, to grayscale image into Row thresholding extracts moving region, and to avoid ambient lighting variation from influencing, background image according to current acquisition picture frame into Row updates.Particular content is as shown in Figure 3.

(3) it is detected based on optical flow method.Optical flow method utilizes variation and consecutive frame of the pixel in time-domain in image sequence Between correlation, according to the corresponding relationship between previous frame and present frame, be calculated object between consecutive frame movement letter Breath.

(4) video after dividing, can be represented as x₁,…,x_i, wherein x indicates the period of divided video, i table Show the number of divided video.

Step 4: the audio segmentation of video.

Audio frequency splitting method based on EMD, detailed process is as follows:

(1) original audio data sequence X (t) determines all maximum points, and is fitted to form original with cubic spline functions The coenvelope line of data.

(2) all minimum points are found out, and all minimum points are fitted to be formed by cubic spline functions The lower envelope line of data.

(3) mean value of coenvelope line and lower envelope line is denoted as ml, and former data sequence X (t) is subtracted average envelope ml, is obtained The audio data sequence hl new to one, as shown by the equation:

Hl=X (t)-ml

(4) audio data after decomposing to EMD carries out cluster segmentation.

(5) audio after dividing, can be represented as y₁..., y_j, wherein y indicates the period of divided audio, j table Show the number of divided audio.

Step 5: the semantic segmentation of video.

It is main comprising following aspects for the semantic segmentation of paragraph:

(1) semantic chunk is defined.Semantic chunk, which refers to, is divided into several relatively independent semantic primitives, length for a sentence Based on the meaning of a word sentence justice under；It is a kind of grammer, semanteme, the associated preprocessing means of pragmatic.Onrecurrent between each semantic chunk, It is non-nested, be not overlapped.

(2) sentence justice is divided.Natural language processing usually requires three aspects of analysis: grammer, semantic and context, therefore head Advanced this participle of style of writing and part of speech target statistical disposition work, after having carried out word classification, carry out it quickly to mark work, then Semantic recombination is carried out for word, finally according to the semantic chunk defined, carries out the segmentation of sentence justice.

(3) paragraph after dividing, can be represented as z₁..., z_k, wherein z indicates the period of divided audio, k table Show the number of divided audio.

Step 6: the paragraph correlation rule evaluation method of the segmentation video of GNN network.

Figure neural network (GNN, Graph Neural Network) mainly can effectively in modeling object it Between relationship or interaction.For same section of video, after carrying out scenario above, sound, being split in three dimensions of paragraph, obtain Video in different time periods has been arrived, can not be perfectly aligned in the video of three dimensions segmentation, the case where intersection can be generated, because This present invention uses GNN neural network, evaluates the relevance of the video paragraph after above-mentioned segmentation.T is indicated each second Video, GNN (t | x), GNN (t | y), GNN (t | z) refer to the feature vector currently extracted in each dimension segmentation video-frequency band.

Step 7: building related network.

Building related network is broadly divided into 2 steps.

(1) from single dimension, according to Euclidean distance or Hamming distance, the network associate rule in each video-frequency band are constructed Then, including between node intensity and direction.

(2) related network of three dimensions is combined with each other, forms a new oriented related network.

The present invention has the advantage that than the prior art:

(1) present invention combine in image sequence pixel in the variation in time-domain and the correlation between consecutive frame and Corresponding relationship between previous frame and present frame realizes good segmentation in image dimension to video, remains the key of video Information.

(2) after the present invention is split same section of video in three scene, sound, text dimensions, building is used The mode of paragraph correlation rule matches corresponding scene, sound and text.

Detailed description of the invention

For a better understanding of the present invention, it is further described with reference to the accompanying drawing.

Fig. 1 is the step flow chart for establishing the paragraph correlation rule evaluation method based on multidimensional element Video segmentation；

Fig. 2 is extraction method of key frame schematic diagram；

Fig. 3 is the content schematic diagram based on background difference detecting method；

Specific embodiment

Below by case study on implementation, invention is further described in detail.

Technical solution implementation steps are as follows:

Step 1: video parsing.

Demultiplexing process is carried out to the Traffic Surveillance Video in Beijing somewhere, video length 1 is divided 50 seconds, and figure is broken down into As track, audio track and subtitle track, when audio track after decomposition, subtitle track it is a length of 1 point 50 seconds.

Step 2: the key-frame extraction in scene cut.

In this example, video is handled using the method that cluster extracts key frame, is 5 major class by key frame cluster.

Step 3: the scene cut based on key frame.

In terms of mainly including following three:

(4) video after dividing, can be represented as x₁..., x_i, wherein x indicates the period of divided video, i table Show the number of divided video.

After carrying out key-frame extraction to video, video is split using optical flow method detection technique, the view after segmentation Frequency shares 25 sections, respectively x₁, x₂..., x₂₅。

Step 4: the audio segmentation of video.

Audio frequency splitting method based on EMD, detailed process is as follows:

Hl=X (t)-ml

(4) audio data after decomposing to EMD carries out cluster segmentation.

The maximum point for including in former audio data sequence X (t) has 2.3,2.1,2,1.9,1.8,1.7,0.9 respectively, 0.8.Minimum has -1.9 respectively, and -2.1, -2.6, -3.0,0, -1.0, -0.5.The mean value for calculating coenvelope line is 1.6875, The mean value of lower envelope line is -1.586.Audio number after segmentation is 25, respectively y₁, y₂..., y₂₅。

Step 5: the semantic segmentation of video.

Text number after segmentation is 25, respectively z₁, z₂..., z₂₅, particular content has " crossroad right-hand rotation ", " capable People halts ", " vehicle congestion phenomenon is serious " etc..

Scene characteristic vector is obtained as GNN after 5s moment each dimension divides the feature vector of video-frequency band by extracting (5|x₁, x₂..., x₂₅), sound characteristic vector be GNN (5 | y₁, y₂..., y₂₅), paragraph feature vector be GNN (5 | z₁, z₂..., z₂₅)。

Step 7: building related network.

Building related network is broadly divided into 2 steps.

Claims

1. present invention generally provides a kind of paragraph correlation rule evaluation method based on multidimensional element Video segmentation, feature exists In:

Step 1: video parsing.

The first step of video parsing is data receiver, needs to do video the processing of one demultiplexing, is decomposed into picture track, sound Frequency track, subtitle track.

Step 2: the key-frame extraction in scene cut.

(1) it is based on Boundary Extraction key frame.This method is each camera lens first frame and last frame or intermediate frame directly as pass Key frame, which selects, to be come.Operand is small in this way, is suitable for the camera lens that content activity is small or remains unchanged.

(2) view-based access control model feature extraction key frame.This method selects first frame as nearest key frame first, then, behind Frame successively visual signature compared with it, these features include color, movement, edge, shape and spatial relationship etc..If current As soon as the difference between frame and nearest key frame has been more than a scheduled threshold value, then present frame is chosen as key frame.

(3) key frame is extracted based on cluster.Such methods have used clustering technique, and all frames of a camera lens are clustered, Then the frame number according to certain criterion, such as in classification is chosen crucial classification in these classifications, then is chosen in crucial classification The smallest frame of clustering parameter is as key frame.

(4) key frame is extracted based on multi-mode.Such method mainly imitates human perception ability and carries out simplifying video content point Analysis, usually comprehensive video, audio, text etc..For example, in the scene switching in the videos such as film, sport, video and audio Appearance often changes simultaneously, so the extracting method with multi-mode is just needed, when the same time-varying of audio and video feature of shot boundary When changing larger, which is new scene boundary.

(5) key frame is extracted based on compression domain.Method based on compression domain is not necessarily to decompress video flowing or only part is needed to decompress, directly It connects and extracts key frame from mpeg compressed video stream, reduce the complexity of calculating.

Step 3: the scene cut based on key frame.

In terms of mainly including following three:

(1) it is detected based on inter-frame difference.Frame differential method is one kind by making difference fortune to two frame adjacent in sequence of video images The method for calculating to obtain moving target profile, it can be perfectly suitable for the feelings mobile there are multiple moving targets and video camera Condition.

(2) it is based on background Differential Detection.Background subtraction is the universal method that a kind of pair of static scene carries out motion segmentation, it will The picture frame and background image currently obtained does calculus of differences, obtains the grayscale image of target moving region, carries out threshold to grayscale image Moving region is extracted in value, and to avoid ambient lighting variation from influencing, and background image carries out more according to the current picture frame that obtains Newly.Particular content is as shown in Figure 3.

(3) it is detected based on optical flow method.Optical flow method using pixel in image sequence in time-domain variation and consecutive frame between Correlation the motion information of object between consecutive frame is calculated according to the corresponding relationship between previous frame and present frame.

(4) video after dividing, can be represented as x₁,…,x_i, wherein x indicates the period of divided video, and i indicates quilt Divide the number of video.

Step 4: the audio segmentation of video.

Audio frequency splitting method based on EMD, detailed process is as follows:

(1) original audio data sequence X (t) determines all maximum points, and is fitted to form former data with cubic spline functions Coenvelope line.

(2) all minimum points are found out, and all minimum points are fitted to form data by cubic spline functions Lower envelope line.

(3) mean value of coenvelope line and lower envelope line is denoted as ml, and former data sequence X (t) is subtracted average envelope ml, obtains one A new audio data sequence hl, as shown by the equation:

Hl=X (t)-ml

(4) audio data after decomposing to EMD carries out cluster segmentation.

(5) audio after dividing, can be represented as y₁,…,y_j, wherein y indicates the period of divided audio, and j indicates quilt Divide the number of audio.

Step 5: the semantic segmentation of video.

(1) semantic chunk is defined.Semantic chunk, which refers to, is divided into several relatively independent semantic primitives for a sentence, and length is based on On the meaning of a word under sentence justice；It is a kind of grammer, semanteme, the associated preprocessing means of pragmatic.It is onrecurrent between each semantic chunk, non-embedding Set is not overlapped.

(2) sentence justice is divided.Natural language processing usually requires three aspects of analysis: grammer, semantic and context, thus first into Compose a piece of writing this participle and part of speech target statistical disposition work, after having carried out word classification, it is carried out quickly to mark work, subsequently for Word carries out semantic recombination, finally according to the semantic chunk defined, carries out the segmentation of sentence justice.

(3) paragraph after dividing, can be represented as z₁,…,z_k, wherein z indicates the period of divided audio, and k indicates quilt Divide the number of audio.

Figure neural network (GNN, Graph Neural Network) mainly can be effectively in modeling between object Relationship or interaction.For same section of video, after carrying out scenario above, sound, being split in three dimensions of paragraph, obtain The case where video in different time periods can not be perfectly aligned in the video of three dimensions segmentation, can generate intersection, therefore this Invention uses GNN neural network, evaluates the relevance of the video paragraph after above-mentioned segmentation.T indicates each second video, GNN (t | x), GNN (t | y), GNN (t | z) refer to the feature vector currently extracted in each dimension segmentation video-frequency band.

Step 7: building related network.

Building related network is broadly divided into 2 steps.

(1) from single dimension, according to Euclidean distance or Hamming distance, the network associate rule in each video-frequency band is constructed, Including between node intensity and direction.