Disclosure of Invention
The invention aims to provide a method with strong anti-interference performance, low calculation complexity, small calculation amount and accurate video segmentation.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a video time domain unit segmentation method is characterized by comprising the following steps:
extracting a horizontal space-time slice at the center position of the video caption;
calculating the minimum space-time semantic information quantity MSSI of each video frame according to the extracted horizontal space-time slices;
detecting mutation of minimum spatio-temporal semantic information quantity MSSI of the video;
and (4) dividing the video time domain unit by taking the mutation of the minimum space-time semantic information quantity MSSI as a boundary.
A further technical solution is that, for a video V (x, y, t), a horizontal spatio-temporal slice S of a video subtitle center position may be represented as:
in the formula:
the method is characterized in that the position x is j, t is i, y is a pixel at the middle value of the subtitle height in the video V, and j is in the range of [1, W ∈],i∈[1,L]W denotes the width of a video frame and L denotes the length of the video.
The further technical solution is that the method further comprises a step of preprocessing the horizontal spatio-temporal slice S, and the method comprises the following steps:
preprocessing is carried out by adopting a self-adaptive Gaussian mixture background model, each row of the horizontal space-time slice S is taken as an input Gaussian model, and model parameters are updated row by row; mean μ and variance δ of gauss2The update formula of (2) is:
in the above formula:
is the luminance of the t +1 th column in the spatio-temporal slice S, α is the correction rate, defined as:
in the above formula: mnIs the number of matches;
detecting each pixel of a spatio-temporal slice S
Whether or not to obey the N (μ, δ) distribution, then the foreground caption will be calculated from the following equation:
separating the subtitles on the horizontal space-time slice S as a foreground from the background according to a formula (4); the minimum spatio-temporal semantic information MSSI of the ith frame in the video V (x, y, t) can be calculated by the following formula
In the formula:
τ is used to measure the size of the minimum spatio-temporal semantic information quantity MSSI of a single pixel, and pixels with MSSI below τ are considered as interference and removed.
A further technical solution is that a video temporal unit boundary generates a sudden change of MSSI, where the sudden change is Δ, and Δ can be calculated according to formula (5) as follows:
from equation (7), Δ contains both the case where MSSI suddenly increases and suddenly decreases, both of which correspond to the boundary of the video temporal unit; the boundary B function of a video temporal unit is defined as:
in the formula: w is a0The MSSI mutation degree significance threshold value represents the significance of the current caption frame and the previous caption frame;
and (4) calculating according to a formula (8) to obtain a B function curve, wherein the peak value of the curve corresponds to the boundary of the video time domain unit, and the video time domain unit can be segmented according to the B function curve.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the method defines the video caption as a sub-lens unit with minimum Semantic meaning, and correspondingly maps the mutation of minimum temporal and spatial Semantic Information quantity (MSSI) as the boundary of a video time domain unit. Compared with a comparison method, the method has the advantages of strong anti-interference performance, low calculation complexity, small calculation amount and obvious calculation time because only one line of pixels in the video are extracted for detection.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Generally, as shown in fig. 3, the present invention discloses a video time domain unit segmentation method, which comprises the following steps:
firstly, extracting a horizontal space-time slice at the center position of a video caption;
secondly, calculating the minimum space-time semantic information quantity MSSI of each video frame according to the extracted horizontal space-time slices;
then, detecting the mutation of the minimum spatio-temporal semantic information quantity MSSI of the video;
and finally, dividing the video time domain unit by taking the mutation of the minimum space-time semantic information quantity MSSI as a boundary.
The method is described by combining specific technical means as follows:
the video space-time slice is an efficient video analysis method and has the advantages of low calculated amount, strong robustness and the like. It only extracts partial row and column of image space, and retains complete video time information, and the loss of space information can be compensated by multi-slice information fusion. The video is analyzed under the assistance of abundant historical time information, so that the interference can be effectively avoided. Spatio-temporal slices are typically taken from the three directions, horizontal, vertical, and diagonal, as shown in FIG. 1. The video space-time slices in different directions reflect different target object information and video scene information. The analysis object of the method is a video caption, the caption is positioned below the video and is transversely arranged, therefore, a transverse space-time slice is selected, and an example of the horizontal space-time slice of the central position of the video caption is shown in figure 2.
For video V (x, y, t), the horizontal spatio-temporal slice S of the video subtitle center position may be represented as:
in the formula:
the method is characterized in that the position x is j, t is i, y is a pixel at the middle value of the subtitle height in the video V, and j is in the range of [1, W ∈],i∈[1,L]W denotes the width of a video frame and L denotes the length of the video.
According to the formula (1), the horizontal space-time slice only extracts one row of pixels in the caption image space, the complete video time domain information is reserved, and the space domain information can reflect semantic information such as the structure, the existence and the like of the caption. Therefore, it is feasible to analyze the minimum semantic information of the video by adopting the video space-time slice, and the data amount needing to be processed is greatly reduced.
The video subtitles contain rich video semantic information, and the video semantic content corresponding to the same subtitle is relatively complete and basically kept unchanged. Based on this observation, the method defines the video caption as a sub-lens unit with the minimum semantic meaning and correspondingly maps the mutation of the minimum spatiotemporal semantic information content MSSI as the boundary of the time domain unit. The existing sub-shot detection method is carried out on the basis of shot segmentation, is complex in steps and large in calculation amount, and is difficult to adapt to the practical requirement of efficient processing of massive video data. The video MSSI may be analyzed and characterized by video caption spatio-temporal slices, and thus the method employs video caption spatio-temporal slices to detect mutations in the MSSI.
As can be seen from fig. 3, for an input video sequence, spatio-temporal slice extraction is first performed, and a horizontal spatio-temporal slice S of a video caption center position is extracted from the input video sequence according to formula (1). The caption information in the space-time slice S represents the minimum space-time semantic information quantity MSSI, and in order to obtain accurate MSSI, the horizontal space-time slice S is preprocessed. The preprocessing is carried out by adopting a self-adaptive Gaussian mixture background model, each row of the horizontal space-time slice S is taken as an input Gaussian model, and model parameters are updated row by row. Mean μ and variance δ of gauss2The update formula of (2) is:
in the above formula:
is the luminance of the t +1 th column in the spatio-temporal slice S, α is the correction rate, defined as:
in the above formula: mnIs the number of matches.
Detecting each pixel of a spatio-temporal slice S
Whether or not to obey the N (μ, δ) distribution, then the foreground caption will be calculated from the following equation:
the subtitles on the spatio-temporal slice S are separated from the background as foreground according to equation (4). The minimum spatiotemporal semantic information quantity MSSI of the ith frame in the video V (x, y, t) can be calculated by the following formula:
in the formula:
τ is used to measure the size of the minimum spatio-temporal semantic information quantity MSSI of a single pixel, and pixels with MSSI below τ are considered as interference and removed.
To complete the segmentation of the video time domain unit, the boundary of the time domain unit is detected. The MSSI abrupt change occurs at the video temporal cell boundary, and the abrupt change is denoted as Δ, and Δ can be calculated according to equation (5) as follows:
as can be seen from equation (7), Δ includes both the case where MSSI suddenly increases and suddenly decreases, both of which correspond to the boundary of the video temporal unit. For simplicity, the boundary B function of a video temporal unit is defined as:
in the formula: w is a0The MSSI mutation degree of the current caption frame and the previous caption frame is represented by a significant threshold value.
And (4) calculating according to a formula (8) to obtain a B function curve, wherein the peak value of the curve corresponds to the boundary of the video time domain unit, and the video time domain unit can be segmented according to the B function curve.
To verify the effectiveness of the method, it was compared with the existing mainstream methods (Petersohn C.sub-cuts-Basic Units of Video [ C ]// International Workshop on Systems, Signals and Image Processing,2007 and, Eurasip Conference focus on Speech and Image Processing, Multimedia Communications and services. IEEE,2007: 323-. The comparative experiments were performed on five different types of subtitle video, as shown in table 1:
TABLE 1 Experimental video information
The video 1 is a public class of Chinese people university, the subtitle text is Chinese text, the subtitle text is obviously separated from the background, and the shot switching form is a mutation form; the video 2 is a speech of the TEDxSuzhou, the subtitle text is a Chinese-English mixed text, the subtitle and the background are obviously separated, and the shot switching form is a mutation form; the video 3 is a Zhejiang university public class, the subtitle text is a Chinese text, the subtitle and the background are obviously separated, and the lens switching mode is combination of mutation and gradual change; the video 4 is a speech of the TED, the subtitle text is English text, the subtitle is on the background and is greatly influenced by the background, and the lens switching mode is a mutation mode; video 5 is the oxford university public class, the subtitle text is Chinese-English mixed text, a cross part is formed between the subtitle text and the background, the shot switching form is combination of sudden change and gradual change, and the form comparison is carried outAnd (4) diversity. The test parameters are set as: τ is 10, w020. The experiment is completed on a general-purpose personal computer, and the basic configuration is as follows: intel (R) core (TM) i 3M 380@2.53G CPU and 8GB memory.
The comparison is carried out from three aspects of processing time, recall rate and accuracy rate. Wherein the recall rate RrIs defined as follows:
rate of accuracy RaIs defined as follows:
in the formula: FCZIndicating the correct number of video time domain units, FC, extractedsRepresenting the actual number of video time domain units, FCtRepresenting the total number of video temporal units extracted.
The comparison results are shown in tables 2, 3, 4, 5 and 6 respectively:
table 2 method comparison for video 1
Table 3 method comparison for video 2
Table 4 method comparison for video 3
Table 5 method comparison for video 4
Table 6 method comparison for video 5
From the experimental results, the method has higher accuracy when the time domain unit is divided. Compared with a comparison method, the method has the advantages of strong anti-interference performance, low calculation complexity, small calculation amount and obvious calculation time because only one line of pixels in the video is extracted for detection.