CN109101920B

CN109101920B - Video time domain unit segmentation method

Info

Publication number: CN109101920B
Application number: CN201810889588.9A
Authority: CN
Inventors: 张云佐; 张莎莎; 郭亚宁; 李汶轩; 朴春慧; 姚慧松
Original assignee: Shijiazhuang Tiedao University
Current assignee: Shijiazhuang Tiedao University
Priority date: 2018-08-07
Filing date: 2018-08-07
Publication date: 2021-06-25
Anticipated expiration: 2038-08-07
Also published as: CN109101920A

Abstract

The invention discloses a video time domain unit segmentation method, and relates to the technical field of video processing methods. The method comprises the following steps: extracting a horizontal space-time slice at the center position of the video caption; computing minimum spatiotemporal semantic information content of each video frame from extracted horizontal spatiotemporal slices

(ii) a Detecting minimum spatiotemporal semantic information content of video

A mutation of (a); with minimum amount of spatiotemporal semantic information

Is a boundary-segmented video temporal unit. Compared with a comparison method, the method has the advantages of strong anti-interference performance, low calculation complexity, small calculation amount and obvious calculation time because only one line of pixels in the video are extracted for detection.

Description

Video time domain unit segmentation method

Technical Field

The invention relates to the technical field of video processing methods, in particular to a video time domain unit segmentation method.

Background

Video is an unstructured data stream, which is essentially a series of temporally successive image frames. These image frames have only a temporal relationship with each other and no structural information. Video time domain segmentation means that hierarchical structure units are detected from a digital video sequence according to video contents or specific marks in the video, and index information is established for the structure units of different hierarchies so as to store, manage, analyze and process video data according to specific contents. The video time domain division is to perform hierarchical division on a video data stream on a time axis to complete conversion from an original unstructured video stream to a structured video entity. The original video stream is divided into meaningful video time domain units which are easy to manage, and a hierarchical structure of video content is formed.

Video time domain segmentation is the basic and key steps of video concentration, retrieval and browsing, shots are time domain structural units widely used in video analysis, and the existing video time domain segmentation method based on shot boundary detection usually takes the variation degree of video characteristics as the basis of video shot segmentation. These video features include color, shape, edges, and motion vectors, among others. In the pixel domain processing algorithm, the video shot segmentation mainly utilizes the characteristics of a color histogram; in compressed domain video segmentation algorithms, motion vector features are typically utilized. The processing ideas of the two types of algorithms are basically consistent, and shot boundaries are determined by comparing feature differences between adjacent video frames with a set threshold value. And if the characteristic difference value is larger than the set threshold value, the current position is regarded as the shot boundary, otherwise, the current position is not the shot boundary. The accuracy of shot boundary detection depends on the definition of the feature difference and the set threshold. However, a shot is not the smallest temporal semantic unit, and a sub-shot is more accurate in describing changes in video semantics. However, the sub-lens detection is performed on the basis of lens segmentation, and the steps are complicated and the calculation amount is large.

Disclosure of Invention

The invention aims to provide a method with strong anti-interference performance, low calculation complexity, small calculation amount and accurate video segmentation.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a video time domain unit segmentation method is characterized by comprising the following steps:

extracting a horizontal space-time slice at the center position of the video caption;

calculating the minimum space-time semantic information quantity MSSI of each video frame according to the extracted horizontal space-time slices;

detecting mutation of minimum spatio-temporal semantic information quantity MSSI of the video;

and (4) dividing the video time domain unit by taking the mutation of the minimum space-time semantic information quantity MSSI as a boundary.

A further technical solution is that, for a video V (x, y, t), a horizontal spatio-temporal slice S of a video subtitle center position may be represented as:

in the formula:

the method is characterized in that the position x is j, t is i, y is a pixel at the middle value of the subtitle height in the video V, and j is in the range of [1, W ∈],i∈[1,L]W denotes the width of a video frame and L denotes the length of the video.

The further technical solution is that the method further comprises a step of preprocessing the horizontal spatio-temporal slice S, and the method comprises the following steps:

preprocessing is carried out by adopting a self-adaptive Gaussian mixture background model, each row of the horizontal space-time slice S is taken as an input Gaussian model, and model parameters are updated row by row; mean μ and variance δ of gauss²The update formula of (2) is:

in the above formula:

is the luminance of the t +1 th column in the spatio-temporal slice S, α is the correction rate, defined as:

in the above formula: m_nIs the number of matches;

detecting each pixel of a spatio-temporal slice S

Whether or not to obey the N (μ, δ) distribution, then the foreground caption will be calculated from the following equation:

separating the subtitles on the horizontal space-time slice S as a foreground from the background according to a formula (4); the minimum spatio-temporal semantic information MSSI of the ith frame in the video V (x, y, t) can be calculated by the following formula

In the formula:

τ is used to measure the size of the minimum spatio-temporal semantic information quantity MSSI of a single pixel, and pixels with MSSI below τ are considered as interference and removed.

A further technical solution is that a video temporal unit boundary generates a sudden change of MSSI, where the sudden change is Δ, and Δ can be calculated according to formula (5) as follows:

from equation (7), Δ contains both the case where MSSI suddenly increases and suddenly decreases, both of which correspond to the boundary of the video temporal unit; the boundary B function of a video temporal unit is defined as:

in the formula: w is a₀The MSSI mutation degree significance threshold value represents the significance of the current caption frame and the previous caption frame;

and (4) calculating according to a formula (8) to obtain a B function curve, wherein the peak value of the curve corresponds to the boundary of the video time domain unit, and the video time domain unit can be segmented according to the B function curve.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the method defines the video caption as a sub-lens unit with minimum Semantic meaning, and correspondingly maps the mutation of minimum temporal and spatial Semantic Information quantity (MSSI) as the boundary of a video time domain unit. Compared with a comparison method, the method has the advantages of strong anti-interference performance, low calculation complexity, small calculation amount and obvious calculation time because only one line of pixels in the video are extracted for detection.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a schematic representation of video spatio-temporal slices in different directions in an embodiment of the present invention;

FIG. 2 is an exemplary diagram of a horizontal spatiotemporal slice of a video caption center location in an embodiment of the present invention;

fig. 3 is a flow chart of a method according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Generally, as shown in fig. 3, the present invention discloses a video time domain unit segmentation method, which comprises the following steps:

firstly, extracting a horizontal space-time slice at the center position of a video caption;

secondly, calculating the minimum space-time semantic information quantity MSSI of each video frame according to the extracted horizontal space-time slices;

then, detecting the mutation of the minimum spatio-temporal semantic information quantity MSSI of the video;

and finally, dividing the video time domain unit by taking the mutation of the minimum space-time semantic information quantity MSSI as a boundary.

The method is described by combining specific technical means as follows:

the video space-time slice is an efficient video analysis method and has the advantages of low calculated amount, strong robustness and the like. It only extracts partial row and column of image space, and retains complete video time information, and the loss of space information can be compensated by multi-slice information fusion. The video is analyzed under the assistance of abundant historical time information, so that the interference can be effectively avoided. Spatio-temporal slices are typically taken from the three directions, horizontal, vertical, and diagonal, as shown in FIG. 1. The video space-time slices in different directions reflect different target object information and video scene information. The analysis object of the method is a video caption, the caption is positioned below the video and is transversely arranged, therefore, a transverse space-time slice is selected, and an example of the horizontal space-time slice of the central position of the video caption is shown in figure 2.

For video V (x, y, t), the horizontal spatio-temporal slice S of the video subtitle center position may be represented as:

in the formula:

According to the formula (1), the horizontal space-time slice only extracts one row of pixels in the caption image space, the complete video time domain information is reserved, and the space domain information can reflect semantic information such as the structure, the existence and the like of the caption. Therefore, it is feasible to analyze the minimum semantic information of the video by adopting the video space-time slice, and the data amount needing to be processed is greatly reduced.

The video subtitles contain rich video semantic information, and the video semantic content corresponding to the same subtitle is relatively complete and basically kept unchanged. Based on this observation, the method defines the video caption as a sub-lens unit with the minimum semantic meaning and correspondingly maps the mutation of the minimum spatiotemporal semantic information content MSSI as the boundary of the time domain unit. The existing sub-shot detection method is carried out on the basis of shot segmentation, is complex in steps and large in calculation amount, and is difficult to adapt to the practical requirement of efficient processing of massive video data. The video MSSI may be analyzed and characterized by video caption spatio-temporal slices, and thus the method employs video caption spatio-temporal slices to detect mutations in the MSSI.

As can be seen from fig. 3, for an input video sequence, spatio-temporal slice extraction is first performed, and a horizontal spatio-temporal slice S of a video caption center position is extracted from the input video sequence according to formula (1). The caption information in the space-time slice S represents the minimum space-time semantic information quantity MSSI, and in order to obtain accurate MSSI, the horizontal space-time slice S is preprocessed. The preprocessing is carried out by adopting a self-adaptive Gaussian mixture background model, each row of the horizontal space-time slice S is taken as an input Gaussian model, and model parameters are updated row by row. Mean μ and variance δ of gauss²The update formula of (2) is:

in the above formula:

in the above formula: m_nIs the number of matches.

Detecting each pixel of a spatio-temporal slice S

the subtitles on the spatio-temporal slice S are separated from the background as foreground according to equation (4). The minimum spatiotemporal semantic information quantity MSSI of the ith frame in the video V (x, y, t) can be calculated by the following formula:

in the formula:

To complete the segmentation of the video time domain unit, the boundary of the time domain unit is detected. The MSSI abrupt change occurs at the video temporal cell boundary, and the abrupt change is denoted as Δ, and Δ can be calculated according to equation (5) as follows:

as can be seen from equation (7), Δ includes both the case where MSSI suddenly increases and suddenly decreases, both of which correspond to the boundary of the video temporal unit. For simplicity, the boundary B function of a video temporal unit is defined as:

in the formula: w is a₀The MSSI mutation degree of the current caption frame and the previous caption frame is represented by a significant threshold value.

To verify the effectiveness of the method, it was compared with the existing mainstream methods (Petersohn C.sub-cuts-Basic Units of Video [ C ]// International Workshop on Systems, Signals and Image Processing,2007 and, Eurasip Conference focus on Speech and Image Processing, Multimedia Communications and services. IEEE,2007: 323-. The comparative experiments were performed on five different types of subtitle video, as shown in table 1:

TABLE 1 Experimental video information

The video 1 is a public class of Chinese people university, the subtitle text is Chinese text, the subtitle text is obviously separated from the background, and the shot switching form is a mutation form; the video 2 is a speech of the TEDxSuzhou, the subtitle text is a Chinese-English mixed text, the subtitle and the background are obviously separated, and the shot switching form is a mutation form; the video 3 is a Zhejiang university public class, the subtitle text is a Chinese text, the subtitle and the background are obviously separated, and the lens switching mode is combination of mutation and gradual change; the video 4 is a speech of the TED, the subtitle text is English text, the subtitle is on the background and is greatly influenced by the background, and the lens switching mode is a mutation mode; video 5 is the oxford university public class, the subtitle text is Chinese-English mixed text, a cross part is formed between the subtitle text and the background, the shot switching form is combination of sudden change and gradual change, and the form comparison is carried outAnd (4) diversity. The test parameters are set as: τ is 10, w₀20. The experiment is completed on a general-purpose personal computer, and the basic configuration is as follows: intel (R) core (TM) i 3M 380@2.53G CPU and 8GB memory.

The comparison is carried out from three aspects of processing time, recall rate and accuracy rate. Wherein the recall rate R_rIs defined as follows:

rate of accuracy R_aIs defined as follows:

in the formula: FC_ZIndicating the correct number of video time domain units, FC, extracted_sRepresenting the actual number of video time domain units, FC_tRepresenting the total number of video temporal units extracted.

The comparison results are shown in tables 2, 3, 4, 5 and 6 respectively:

table 2 method comparison for video 1

Table 3 method comparison for video 2

Table 4 method comparison for video 3

Table 5 method comparison for video 4

Table 6 method comparison for video 5

From the experimental results, the method has higher accuracy when the time domain unit is divided. Compared with a comparison method, the method has the advantages of strong anti-interference performance, low calculation complexity, small calculation amount and obvious calculation time because only one line of pixels in the video is extracted for detection.

Claims

1. A video time domain unit segmentation method is characterized by comprising the following steps:

dividing a video time domain unit by taking the mutation of the minimum space-time semantic information quantity MSSI as a boundary;

in the formula:

the method is characterized in that the position x is j, t is i, y is a pixel at the middle value of the subtitle height in the video V, and j is in the range of [1, W ∈],i∈[1,L]W represents the width of a video frame, L represents the length of the video;

the method also comprises a step of preprocessing the horizontal spatio-temporal slice S, the method being as follows:

in the above formula:

in the above formula: m_nIs the number of matches;

detecting each pixel of a spatio-temporal slice S

In the formula:

tau is used to measure the size of the minimum spatiotemporal semantic information quantity MSSI of a single pixel,

pixels below τ will be removed as noise;

the MSSI abrupt change occurs at the video temporal cell boundary, and the abrupt change is denoted as Δ, and Δ can be calculated according to equation (5) as follows: