CN111008978B

CN111008978B - Video scene segmentation method based on deep learning

Info

Publication number: CN111008978B
Application number: CN201911239331.XA
Authority: CN
Inventors: 代成; 刘欣刚; 李辰奇; 倪铭昊; 韩硕; 曾昕
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2022-10-14
Anticipated expiration: 2039-12-06
Also published as: CN111008978A

Abstract

The invention discloses a video scene segmentation method based on deep learning, and belongs to the technical field of video scene segmentation. Firstly, converting video data to be segmented into frame images, and then carrying out target detection processing based on a deep learning algorithm to obtain background candidate frames of the frame images; and selecting a key background candidate frame of the frame image; determining a background candidate frame corresponding to the position information on an adjacent subsequent image frame of the image frame where the key background candidate frame is located based on the position information of the key background candidate frame; and finally, calculating the joint similarity of the adjacent image frames, and if the joint similarity is lower than a similarity threshold, performing video segmentation on a segment of video data to be segmented based on the frame position of the current adjacent frame. The method can realize the judgment of the similarity of the video background information under the condition of automatically extracting the local background area, solves the problem of overhigh algorithm complexity in the traditional algorithm, and realizes the background segmentation under the complex scene.

Description

Video scene segmentation method based on deep learning

Technical Field

The invention relates to the technical field of video scene segmentation, in particular to a video scene segmentation method based on deep learning.

Background

With the rapid development of multimedia technology, video is widely applied to daily life of people as an important information transmission medium. In recent years, the amount of video data has increased explosively, however, while the work, study and life of people are enriched by massive video data, the storage, management and retrieval of the massive video data become the basis for efficiently using the data, and especially in the big data era, how to accurately classify and retrieve videos also becomes a great challenge at present. Considering that video scene segmentation has important significance for more flexibly and efficiently identifying video data in video retrieval research, accurate scene segmentation is beginning to be paid more and more attention by researchers.

The main objective of scene segmentation is to accurately detect scene similarity and segment video under the condition of obvious discrimination, but the traditional artificial feature-based algorithm has the problems of large artificial feature engineering quantity, high calculation complexity, low accuracy and the like, so that the current real-time segmentation requirement cannot be well met, and therefore, a new method is needed to solve the problem of video background segmentation more intelligently.

Disclosure of Invention

The invention aims to: in order to solve the defects of the prior art, a more accurate and more convenient video background segmentation method is provided for mass data in a complex scene.

The invention discloses a video scene segmentation method based on deep learning, which comprises the following steps:

step S1: image preprocessing: converting video data to be segmented into frame images;

for example, frame image sampling is performed on video data to be segmented (a segment of video frame sequence to be segmented) at fixed intervals to obtain a frame image sequence;

step S2: background candidate box identification:

based on a preset target object, performing target detection processing on each frame of image by adopting a target detection algorithm, namely fast R-CNN, generating a candidate frame of the target object, and labeling coordinate information of the candidate frame;

performing target object identification on the candidate frame, and screening out the candidate frame without the target object as a background candidate frame of the frame image;

and step S3: and (3) selecting a key background candidate frame for the frame image:

step S31: screening out background candidate frames with the areas smaller than a preset area threshold;

step S32: and (3) screening out background candidate frames with high overlapping degree: when the overlapping degree of the two overlapped background candidate frames is larger than a preset overlapping degree threshold value, deleting the smaller one of the two overlapped background candidate frames;

wherein, the calculation formula of the overlapping degree is as follows:

wherein Area represents the Area, B-box _i And B-box _j Respectively representing that two overlapped background candidate frames exist, wherein i and j are background candidate frame identifiers;

taking the current residual background candidate frame as a key background candidate frame;

and step S4: determining a background candidate frame corresponding to the position information on an adjacent subsequent image frame of the image frame where the key background candidate frame is located based on the position information of the key background candidate frame;

step S5: calculating the similarity of adjacent image frames:

taking the position area of the key background candidate frame or the background candidate frame as a background area;

taking the key background candidate frame of the previous image frame obtained in the step (4) and the corresponding background candidate frame on the adjacent next image frame as similarity calculation objects of the background areas at the same position of the adjacent image frames;

respectively calculating the structural similarity and the histogram similarity of the similarity calculation object;

setting a weight value w for each background region _i Comprises the following steps:

wherein A is _i Represents the area of the ith background region; n represents the number of background regions included in the frame image;

and according to the formula

Calculating a joint similarity of adjacent image frames, wherein

SSIM _i 、Hist _i Respectively representing the structural similarity and the histogram similarity of the ith background area corresponding to the two adjacent frame images;

step S6: video scene segmentation:

and if the joint similarity is lower than a preset similarity threshold, performing video segmentation on the video data to be segmented based on the frame position of the current adjacent frame, so as to segment the video data to be segmented into multiple segments of sub-video segments, wherein each segment of sub-video segment is a scene of one class.

For example, for a frame image sequence obtained by fixed-interval sampling, adjacent frames in the frame image sequence are not adjacent to original video data, a certain number of original video frames are included between the two frames, and only one segmentation position needs to be selected from the two frames at will, that is, the adjacent frames with the joint similarity degree lower than a preset similarity degree threshold value in the frame image sequence obtained by fixed-interval sampling are segmented into different types of scenes, the previous frame image of the adjacent frame corresponds to one type of scene, and the next frame image corresponds to another type of scene.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the target detection under the complex scene can be learned through a deep learning technology, and a local background candidate frame is obtained. And then labeling the corresponding coordinates of the candidate frame of the adjacent frame images, and through the weighted comparison of the structural similarity SSIM and the histogram similarity Hist of the local region of the image, the complexity of the algorithm can be reduced, and meanwhile, the feature region based on deep learning has universality and higher segmentation accuracy compared with the traditional manual region labeling.

Drawings

FIG. 1 is a schematic diagram of a specific implementation process in an embodiment;

figure 2 is a schematic diagram of tensor modeling in an example.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

s1: image preprocessing, converting video data into frame images: namely, a conventional video frame extraction mode is adopted to complete the conversion from a video to a corresponding frame, so as to obtain a frame image to be processed;

s2: identifying a background area, determining a target object in the frame image by using a target detection algorithm, namely a Faster R-CNN algorithm, and further determining a background candidate frame of the frame image:

firstly, a CNN + RPN network (a convolutional neural network + a region generation network) is adopted to generate a candidate frame, namely a candidate region frame, and coordinate information of the candidate frame is labeled;

performing classification regression on the content features in the candidate frames so as to realize object target identification;

and screening out the candidate frames without the target object in the candidate frames to obtain the coordinates of the background candidate frame of the frame image (the position area where the background candidate frame is located is the background area).

The Faster R-CNN algorithm can be referred to the literature "Faster R-CNN: targets Real-Time Object Detection with Region pro-site Networks".

S3: selecting a key background candidate box for each frame of image in the video:

calculating according to the area of the background region, quantizing through a region overlapping detection function, deleting the overlapping part in the background region and the background candidate frame with small region area, and realizing the selection of an effective background candidate frame, namely a key background candidate frame;

s31: and screening out the background candidate frame with a small area, and when the area is smaller than a certain threshold value, neglecting, wherein the area formula of the background candidate frame is as follows:

wherein, the first and the second end of the pipe are connected with each other,

and

left and right abscissas representing the ith background candidate frame;

and

the upper and lower vertical coordinates of the ith background candidate frame are represented; a. The _i Representing the area of the ith background candidate frame;

s32: and (3) screening out the background candidate frames with high overlapping degree, and deleting the smaller one of the background candidate frames with high overlapping degree when the overlapping degree is high, wherein the overlapping detection function is as follows:

wherein Area represents an Area, B-box _i And B-box _j Respectively representing the ith and jth background candidate boxes.

S4: extracting the characteristics of the background candidate frames, namely extracting the background candidate frames of the corresponding areas of the adjacent frames according to the coordinates;

extracting coordinates of corresponding points of the key background candidate frame, and finding out a corresponding background candidate frame of an adjacent subsequent frame according to the extracted coordinates of the corresponding points;

s5: and (3) contrast of the background frame similarity, namely weighting the corresponding background area of the adjacent frame by using a structure similarity SSIM and histogram similarity Hist combined algorithm to complete contrast of the background similarity of the adjacent frame.

Referring to fig. 2, the specific calculation method of the structural similarity SSIM is as follows:

SSIM(x,y)＝L(x,y)×C(x,y)×S(x,y)

wherein, the functions of L (x, y), C (x, y) and S (x, y) respectively represent the brightness, contrast and structure contrast of the two images, and SSIM (x, y) is the structure similarity of the two images.

The specific calculation formulas of L (x, y), C (x, y) and S (x, y) are as follows:

(1)

wherein u is _x ，u _y The average values of the pixels representing images x, y respectively,

x _i the ith pixel value of the image x is represented, and N represents the number of pixel points; u. u _y And u _x In the same way as C ₁ Is constant and is used to avoid denominator of 0, usually taking the value C ₁ ＝(K ₁ ×L)，K ₁ ＝0.01，L＝255。

(2)

Wherein σ _x ，σ _y Representing the pixel standard deviation of the images x, y,

wherein, mu _x Mean value of pixels, C, representing image x ₂ ＝(K ₂ ×L) ² ，K ₁ ＝0.03，L＝255。

(3)

Wherein σ _xy Represents the covariance of the pixels of the image x, y, and

μ _y represents the mean value of the pixels of the image y,

the specific calculation formula of the histogram similarity Hist is as follows:

wherein the content of the first and second substances,

the ith number of the histogram of the image x, y is shown, and N is the number of all the numbers contained in the histogram.

When the structural similarity SSIM and the histogram similarity Hist are jointly processed, the weight value of each background frame is firstly set, then the weighted average of two kinds of similarity of all the background frames is calculated, and the combined weighted average of the two kinds of similarity is combined to obtain the final similarity metric value, namely the joint similarity:

weight value w of each background frame _i Comprises the following steps:

wherein A is _i Indicates the area of the ith background box.

The combined similarity is:

wherein

SSIM _i 、Hist _i Respectively representing the structure and the histogram similarity of the ith background frame between two adjacent frame images.

S6: and (4) segmenting a video scene.

According to the result of the scene similarity comparison, if the similarity is lower than the threshold value, the relationship between the images of the similar frames (adjacent frames) is not large, the images do not belong to a class of scenes, and the video is segmented based on the frame position of the current adjacent frame, namely the video is segmented into different paragraph shots.

Examples

The video scene segmentation method based on the invention is applied to the application based on video processing, and realizes a video segmentation algorithm based on an improved Faster R-CNN network, referring to fig. 1, and the specific implementation process is as follows:

s1: image preprocessing, namely converting video data into frame images; in the present embodiment, the length of the processed video is mostly short video files within 1.5 minutes to 3 minutes, and calculated by 24 frames per second, there are approximately 2160 frames to 4320 frames. In order to reduce the amount of calculation and increase the calculation speed, the present embodiment samples video frames at equal intervals with a width of 5 frames. Therefore, the frame number of a single video is reduced to 432-864, and the continuity of the original video can be ensured, thereby avoiding information loss caused by excessively large content change.

S2: target identification, namely marking a target object in a video by using a Faster R-CNN algorithm;

the Faster R-CNN model is mainly composed of 4 parts.

Firstly, the convolution layer performs feature extraction on an input picture frame;

secondly, the extracted feature map enters an RPN (Region Proposal Network) Network to generate 300 candidate Region frames;

thirdly, converting the candidate Region frame into a feature with a fixed length through RoI (Region 0f Interest) pooling;

finally, regression and classification are carried out on each candidate region frame, and the object in the candidate region and the accurate coordinates of the region are output.

In this embodiment, a CNN model of VGG-16 is used for feature extraction, and an image classification dataset VOC2007 is used for training, so that 21 classes of objects can be distinguished. If there are objects in the region box, then it is considered foreground and removed. Then, a certain number of region frames are selected as background candidate region frames (background frames) from the remaining region frames, and 20 region frames are selected in this embodiment.

S3: selecting a key background area, calculating according to the area of the background area, quantizing through an area overlapping detection function, deleting an overlapping part in the background area and a background frame with a small area, and selecting an effective background frame; experiments prove that when the area of the region is larger than 800, the distribution effect of the background region frames is the best, and therefore, the region smaller than 800 is regarded as a small region. Meanwhile, if the overlapping area of the two regions is more than 70% of the area of the smaller region, the region having the smaller area is removed.

S4: extracting the area characteristics of the candidate frames, extracting the background areas of the corresponding areas of the adjacent frames according to the coordinates, and cutting the areas on the images of the two adjacent frames;

s5: and (3) contrast of the background frame similarity, namely weighting the corresponding background area of the adjacent frame by using a structure similarity SSIM and histogram similarity Hist combined algorithm to complete contrast of the background similarity of the adjacent frame. In two adjacent frames, SSIM and histogram similarity calculation is performed once for each corresponding background region. And then, according to the area ratio, giving a weighted value to each area, and respectively carrying out weighted addition on the two indexes to obtain the total SSIM and the histogram similarity of the two images. And finally, combining the two similarities to obtain a new similarity index through a harmonic average method, so as to judge and segment the scene change.

The video scene segmentation algorithm based on the similarity of the background region uses a deep learning Faster R-CNN model to select and extract the background content of a video frame, performs video segmentation tests on 16 short videos including 4 types of sports, movies, news and daily life screened from a network, and refers to F-score to judge the segmentation accuracy. For the motion video with complicated video scene change, the accuracy of the algorithm reaches 80.4 percent on average, and the current method only has 64.8 percent without deep learning. Other three types of videos have simpler scenes, and the accuracy of the algorithm result is higher, the film type video reaches 93.7%, the news type video reaches 93.0%, and the daily life type video even reaches 98.1%. And if the deep learning model is not used, the recognition rate is only 70.5%, 71.4% and 80.0%, respectively. According to experimental results, the video segmentation method for selecting background contents by utilizing deep learning and then comparing similarity can effectively improve the accuracy of simple video segmentation and has very good application prospect.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. The video scene segmentation method based on deep learning is characterized by comprising the following steps of:

step S2: background candidate box identification:

determining a target object in the frame image by using a target detection algorithm, namely a Faster R-CNN algorithm, and further determining a background candidate frame of the frame image:

performing classification regression on the content features in the candidate frames, thereby realizing object target identification;

screening out candidate frames without target objects in the candidate frames to obtain coordinates of background candidate frames of the frame images;

and step S3: selecting key background candidate frames for the frame image:

wherein, the calculation formula of the overlapping degree is as follows:

step S5: calculating the similarity of adjacent image frames:

calculating structural similarity and histogram similarity of the similarity calculation objects respectively;

and according to the formula

Calculating a joint similarity of adjacent image frames, wherein

step S6: video scene segmentation:

and if the joint similarity is lower than a preset similarity threshold, performing video segmentation on the video data to be segmented based on the frame position of the current adjacent frame.

2. The method of claim 1, wherein in step S1, a segment of video data to be segmented is frame image sampled at regular intervals to obtain a sequence of frame images.

3. The method of claim 1, wherein the area threshold is set at 800.

4. The method of claim 1, wherein the overlap threshold is set at 70%.