CN101853381A

CN101853381A - Method and device for acquiring video subtitle information

Info

Publication number: CN101853381A
Application number: CN200910081051A
Authority: CN
Inventors: 杨锦春; 刘贵忠; 钱学明; 李智; 郭旦萍; 姜海侠; 南楠; 孙力; 王琛
Original assignee: Huawei Technologies Co Ltd; Xian Jiaotong University
Current assignee: Xian Jiaotong University; Huawei Cloud Computing Technologies Co Ltd
Priority date: 2009-03-31
Filing date: 2009-03-31
Publication date: 2010-10-06
Anticipated expiration: 2029-03-31
Also published as: CN101853381B

Abstract

The invention relates to a method and a device for acquiring video subtitle information. The method comprises the following steps: carrying out small wave-based subtitle detection on the luminance component image of a data frame in a video stream; acquiring detected attribute information of subtitles; and extracting the detected subtitles according to the attribute information, thus accurately acquiring the subtitle information in the data frame. The area in which the subtitles are located does not need not to be limited due to the small wave-based subtitle detection, thus the embodiment of the invention can accurately acquire the subtitle information in the video data under the condition of not limiting the subtitle position area.

Description

Video caption information getting method and device

Technical field

The present invention relates to the application electric technology field, relate in particular to a kind of video caption information getting method and device.

Background technology

Video caption gives intuitively that form represents video program content, can assist people to hold the theme of program well effectively in video is appreciated, and then understand the content of video.The detection and Identification of video caption information can be enriched the inquiry of text based video content in addition.Therefore video caption information effectively being obtained is a necessary link.

The inventor finds in realizing process of the present invention, existing obtaining in the technology of caption information, relatively more responsive to the positional information that caption information appears in the video pictures, and generally, suppose that caption area is static, and subtitle position also is the lower middle portion that is fixed on image, if caption information not in specified sensing range, caption information can not be obtained well and be used so.

Summary of the invention

The embodiment of the invention provides a kind of video caption information getting method and device, thereby under the situation that does not limit the captions band of position, accurately obtains the caption information in the video data.

The embodiment of the invention provides a kind of video caption information getting method, comprising:

Luminance component image to Frame in the video flowing carries out detecting based on the captions of small echo;

Obtain the attribute information of detected captions;

According to described attribute information, extract detected captions.

The embodiment of the invention also provides a kind of video caption information acquisition device, comprising:

Detection module is used for the luminance component image of video flowing Frame is carried out detecting based on the captions of small echo;

First acquisition module is used to obtain the attribute information of the detected captions of described detection module;

Extraction module is used for belonging to information according to the captions that described first acquisition module obtains, and extracts the detected captions of described detection module.

The technical scheme that is provided by the invention described above embodiment as can be seen, in the embodiment of the invention, undertaken detecting by luminance component image, and obtain the attribute information of detected captions based on the captions of small echo to Frame in the video flowing, according to described attribute information, extract detected captions.Thereby accurately obtain the caption information in the Frame.Because the captions based on small echo detect, and need not the zone at captions place is limited, therefore, the embodiment of the invention can accurately be obtained the caption information in the video data under the situation that does not limit the captions band of position.

Description of drawings

The described method flow synoptic diagram one that Fig. 1 provides for the embodiment of the invention;

The described method flow synoptic diagram two that Fig. 2 provides for the embodiment of the invention;

The described method flow synoptic diagram three that Fig. 3 provides for the embodiment of the invention;

The described apparatus structure synoptic diagram one that Fig. 4 provides for the embodiment of the invention;

The described apparatus structure synoptic diagram two that Fig. 5 provides for the embodiment of the invention;

The described detection module structural representation one that Fig. 6 provides for the embodiment of the invention;

The described detection module structural representation two that Fig. 7 provides for the embodiment of the invention;

The described first acquisition module structural representation that Fig. 8 provides for the embodiment of the invention;

The described extraction module structural representation that Fig. 9 provides for the embodiment of the invention.

Embodiment

The embodiment of the invention provides a kind of video caption information getting method, as shown in Figure 1, this method is undertaken detecting based on the captions of small echo by the luminance component image to Frame, and obtains the attribute information of detected captions, according to described attribute information, extract detected captions.Thereby accurately obtain the caption information in the Frame.Because the captions based on small echo detect, and need not the zone at captions place is limited, therefore, the embodiment of the invention can accurately be obtained the caption information in the video data under the situation that does not limit the captions band of position.

Specific embodiment of the video caption information getting method that the embodiment of the invention provides can be as shown in Figure 2, and this embodiment specifically can comprise:

Step 21 is obtained the luminance component image of specific data frame from video data stream.

In order to accelerate to obtain the speed of caption information, the embodiment of the invention data designated frame of specifically can from video data stream, decoding, and obtain the luminance component image of specific data frame.

Such as, the frame number of only decoding is the intraframe coding of odd number (or even number), be that the I frame (also can be other forms of frame of video, as encoded predicted frame, be the P frame) code stream, obtain the luminance component image of I frame, and to the chromatic component of I frame, and other frame then skips fast, thereby accelerated to obtain the speed of caption information.

Need to prove that the embodiment of the invention does not limit the compressed format of video data stream.

Step 22 carries out detecting based on the captions of small echo to the luminance component image of the Frame chosen.

Concrete, for the luminance component image of the Frame of having chosen, adopt captions to detect in this step based on small echo.

In a specific embodiment, the concrete implementation of this step can comprise as shown in accompanying drawing 3:

Step 221 is carried out wavelet transformation to the luminance component image of Frame, obtains horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps.

Related wavelet transformation in the embodiment of the invention is specifically as follows HAAR (Ha Er) wavelet transformation, Mexico's straw hat wavelet transformation, and the 9-7 wavelet transformation, the 5-3 wavelet transformation, or the like.

In this step, luminance component image to the Frame chosen, carry out wavelet transformation, to obtain a low frequency sub-band, high-frequency sub-band with level, vertical, three directions of diagonal line, wherein, the sub-high frequency band of level can be designated as that H, vertical high frequency subband can be designated as V, the diagonal line high-frequency sub-band can be designated as D.

H, the V that generates behind the wavelet transformation, the coefficient of three high-frequency sub-band of D are asked absolute value respectively, obtain horizontal high-frequency sub-band texture maps (CH), vertical high frequency subband texture maps (CV) and diagonal line high-frequency sub-band texture maps (CD).

Can also obtain comprehensive high-frequency sub-band texture maps (CS) in conjunction with three high-frequency sub-band texture maps (CH, CV, CD) in this step.

The value of each point can obtain by following formula in the comprehensive high-frequency sub-band texture image:

CS(i，j)＝CH(i，j)+CV(i，j)+CD(i，j)

Step 222 according to horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps, is obtained the captions dot image (TextPnt) of Frame.

In a specific embodiment, in this step, specifically can comprise following link:

At first, according to the high-frequency sub-band texture maps, generate initial captions dot image.

With horizontal high-frequency sub-band texture maps is example, horizontal high-frequency sub-band texture maps is carried out captions point detect, to obtain the initial captions dot image of this horizontal high-frequency sub-band (MAPH_ORG).

Wherein, the initial captions dot image of this horizontal high-frequency sub-band coordinate (i, the value of j) locating is to calculate according to following formula:

MAPH_ORG (i, j) = \{\begin{matrix} 1, & CH (i, j) &GreaterEqual; TH \\ 0, & CH (i, j) < TH \end{matrix}

Need to prove that value is " 0 " expression background, value is the initial captions point of " 1 " expression, and the computing method of threshold value in the formula (TH) can be as follows:

TH = \{\begin{matrix} 50, & MH * 5 &GreaterEqual; 50 \\ MH * 5, & 50 > MH * 5 > 18 \\ 18, & MH * 5 \leq 18 \end{matrix}

MH in the formula is a texture strength average in the horizontal high-frequency sub-band texture image.

Then, the initial captions dot image of horizontal high-frequency sub-band is removed noise processed, with the final captions dot image of the horizontal direction that obtains (MAPH).

The related noise processed of removing specifically can adopt the proven technique processing schemes such as square filtering of sliding as overlapping in the embodiment of the invention, and the embodiment of the invention does not limit this.

Then, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps are carried out similar treatment step to obtain the initial captions dot image of initial captions dot image of vertical subband (MAPV_ORG) and diagonal line subband (MAPD_ORG), and initial captions dot image of vertical subband and the initial captions dot image of diagonal line subband removed noise processed respectively, to obtain the final captions dot image of final captions dot image of vertical direction (MAPV) and diagonal (MAPD).

At last, the captions dot image (TextPnt) that the final captions dot image (MAPH, MAPV, MAPD) of three directions is sought common ground and obtains Frame.

Need to prove, in the embodiment of the invention, initial captions dot image (MAP_ORG) is removed the captions noise spot, the specific implementation method flow that obtains caption area can adopt following program to realize:

//h, w represent the height and the width of sub-band images respectively

Block=4; The size of // square

Dis=3; The distance of // square skew each time

H_num=(h/dis); // square is at the number of times of vertical offset

W_num=(w/dis); The number of times that // square is offset in the horizontal direction

MAP＝MAPH_ORG；

for(k＝1:h_num)

for(I＝1:w_num)

if(((k-1)＊dis+1+block＞h)||((I-1)＊dis+1+block＞w))

Continue; // as the fruit piece border of having shifted out image, jump out circulation

else

Num=TextPntNum (); The captions point is contained in // statistics square inside

Number

if(num＜(block＊block/2))

StartH＝(k-1)＊dis；

EndH＝StartH+block；

StartW＝(I-1)＊dis；

EndW＝StartW+block；

MAP(StartH:End?H，StartW:EndW)＝0；

If // number is less than (block*block/2), these all pixels of square zone are established in being

Be 0,

// be that captions point in this square is a noise spot

If else//number is greater than (block*block/2), this square zone is real captions point

MAP(StartH:EndH，StartW:EndW)＝MAP_ORG(StartH:EndH，

StartW:EndW)

end

Be understandable that above example only for illustrating, does not play the effect of any restriction to the protection domain of the embodiment of the invention.

Step 223 is by the captions dot image generation caption area image (TextArea) of Frame.

In a specific embodiment, specifically can comprise following link in this step:

At first, closed operation and the opening operation that the captions dot image of having obtained is carried out horizontal direction respectively obtains horizontal image (Verlmg).

Wherein, the structural element of closed operation can be complete " 1 " matrix of 20*1, and the structural element of opening operation can be complete " 1 " matrix of 1*2, and certainly, the structural element that closed operation and opening operation are adopted can be arranged according to actual needs flexibly.

Then, closed operation and the opening operation that the captions dot image is carried out vertical direction obtains vertical image (Horlmg).

Equally, the structural element of closed operation can be complete " 1 " matrix of 1*20, and the structural element of opening operation can be complete " 1 " matrix of 2*1;

Then, the horizontal image and the vertical image that obtain are asked the union operation, to obtain comprising the maximum point set image (Img) of all caption areas, its concrete preparation method is as follows:

Next, maximum point set image is carried out closed operation to obtain the caption area image.

The structural element of closed operation can adopt complete " 1 " matrix, perhaps other matrix of 6*6.

Step 224 is determined the bar number and the caption area positional information of captions in the caption area image.

At first, each caption area in the caption area image is carried out the differentiation of captions for horizontal or homeotropic alignment.

The method of distinguishing is the relative size according to caption area Gao Yukuan.Concrete, if caption area be wider than height, then the captions in this caption area be horizontally, if caption area is wide less than height, then the interior captions of this caption area are homeotropic alignment.

Need to prove that the confirmation method of the caption area in the caption area image can adopt the labeling method in the morphology, perhaps other ripe method confirms that the embodiment of the invention does not limit this.

For captions is horizontal caption area, determine this caption area corresponding zone in horizontal image, and, by the coordinate position of the upper and lower, left and right pixel of this caption area in horizontal image, determine the position of this caption area upper side frame, lower frame, left frame, left frame in horizontal image.

For captions is the caption area of homeotropic alignment, determine this caption area corresponding zone in vertical image, and employing and above-mentioned captions are the horizontal same method of caption area, obtain the position of this caption area upper side frame, lower frame, left frame, left frame in vertical image.

Then, corresponding comprehensive subband texture maps (CS) The corresponding area is carried out horizontal projection in the caption area posting, and, determine the upper side frame and the lower frame position of caption strips number in the comprehensive subband texture maps and every horizontal captions from the peak valley information of comprehensive subband texture maps drop shadow curve.

Concrete, can determine the bar number of captions in the caption area by the quantity of trough in the drop shadow curve, this process specifically can comprise:

Texture average in the comprehensive subband texture maps is obtained threshold value divided by a parameter (alfa).If the value of drop shadow curve is trough less than this threshold value.Because the position of trough is exactly two centre positions between the captions, thereby by determining the quantity of trough, determine the bar number of captions in this caption area, promptly the trough number adds 1.Need to prove that in embodiments of the present invention, the span of parameter (alfa) can be [2,3], after the practical operation check, recommended parameter alfa=2.6.

In addition, because the upper and lower bezel locations of the captions that trough is separated is the top and the terminal coordinate position of corresponding trough respectively, therefore, by determining the position at trough place, can determine in this caption area the upper side frame of every horizontal captions and the position of lower frame.

Captions for homeotropic alignment, vertical projection is carried out in corresponding comprehensive subband texture maps zone in the caption area posting, and from the definite wherein caption strips number of peak valley relation of drop shadow curve and the left frame and the left frame position of every vertical captions, its concrete implementation method is identical with horizontal captions.

By aforesaid operations, can determine the information such as position that captions occur in video flowing.

Optionally, in one embodiment,, can further include in order to improve the accuracy of detection:

Whether step 225 is the detection of real caption area to caption area.

Owing in captions detect, may have error-detecting, it is caption area that the zone that is not captions is detected, and therefore, need carry out authenticity verification to the caption area of confirming, can effectively promote the performance that captions detect like this.

Concrete, can determine whether surveyed area is real caption area according to the distribution situation of the distribution of captions texture, intensity profile and number of edge points.

When a caption area is true caption area, trough in the projection on the corresponding comprehensive subband texture maps, and being evenly distributed of the trough of the low frequency component image projection behind the wavelet transformation.Wherein the detection method of trough is with what put down in writing in the step 224, and the length scale that uniform measure is a trough is no more than crest, and the variance of trough is less.

Step 23 is obtained the attribute information of detected captions

Concrete, in this step, can mate and follow the tracks of operation detected captions, determine the captions attribute information.

The captions matching operation is to judge according to the captions detection case of last I frame and current I frame whether detected captions mate, if coupling then show that the captions that are complementary belong to same captions otherwise belongs to different captions.

Whether the I frame that adjacent two needs are carried out the captions detection needs to carry out the tracking of captions coupling, is to judge according to caption strips number detected in this two frame and by the following situation that may occur:

1) if the caption strips number average of last I frame and current I frame is 0, then need not to mate and follow the tracks of operation;

2) if the caption strips quantity of last I frame is 0, and the caption strips quantity of current I frame is not 0, and the caption strips number that then can determine the current I frame all is emerging captions, needs so to mate and follows the tracks of operation, to determine the start frame of captions in the current I frame.

When doing the start frame judgement, at first need to handle according to captions match condition and determined captions attribute in current I frame and next I frame.If do not have captions in next I frame or captions are arranged but and in the current I frame captions that detect do not match, then with the captions that detect in the current I frame as false retrieval and rejected, otherwise the caption strips that newly occurs that is detected in the current I frame is carried out captions and is followed the tracks of.

3) if the caption strips quantity of last I frame is not 0, and the caption strips quantity of current I frame is 0, and then the caption strips of current I frame is the disappearance caption strips, needs so to mate and follows the tracks of operation, to determine the abort frame of captions in the current I frame.

4) if the caption strips number average of last I frame and current I frame is not 0, then need the captions among last I frame and the present frame I are mated and follow the tracks of operation, to determine which captions mates in the last I frame, which disappears, and which captions is couplings in the current I frame, and which is emerging.For in last I frame, which need determine the abort frame of these captions at last I frame to the I frame that disappears between the current I frame, need be from last I frame to the appearance frame of determining these captions the current I frame for emerging caption strips in the current I frame.

So as can be seen, as long as have the caption strips number of a frame non-vanishing in last I frame or the current I frame, promptly need to mate and follow the tracks of operation.

In the embodiment of the invention, can be by the mode of sampling matching, realize the matching operation of captions, promptly calculate in the current I frame (the minimum average B configuration absolute error (MAD:Mean AbsoluteDifference) in the coupling of sliding of 1≤q≤n) of any captions q who did not mate among the captions p to be matched and next I frame, from n bar captions couplings, choose MAD value minimum then, as the optimum matching captions, and judge further whether this minimum MAD satisfies the least commitment threshold value.

Concrete, for captions q and next I frame captions p of current I frame, the position of the frame up and down at captions place is respectively

And

If two I frames all are horizontally, then extract captions q and next I frame captions p of current I frame, in the public domain in the horizontal direction, the maximal value of left side frame

And the minimum value of the right frame

If Rpq-Lpq, then thinks do not match (threshold value herein specifically can be 10) smaller or equal to threshold value; If greater than threshold value, then extract in the public domain on the horizontal direction center of next I frame captions p

Round[wherein] expression rounds) the pixel IP that locates (and cy Lpq:Rpq), determines the captions q of itself and current I frame by methods such as the couplings of sliding, highly for the IC of y place (y, the matching error MAD of pixel bars Lpq:Rpq) (y, q), and best match position

Specifically can calculate and obtain by following formula:

MAD (y, q) = \frac{1}{(Rpq - Lpq)} Σ_{x = Lpq}^{Rpq} | IP (cy, x) - IC (y, x) |, y &Element; [U_{IC}^{q}, D_{IC}^{q}]

y_{q} = \arg \min_{y} {MAD (y, q)}

q_{0} = \arg \min_{q} {MAD (q)}

If in best match position

Under MAD (q ₀)≤MAD _Th, then think to mate captions.In the embodiment of the invention, the preferable value of threshold value MADth can be MAD _Th=20.

If all be homeotropic alignment, then extract the captions q of current I frame and the captions p of next I frame, in the public domain in vertical direction, the maximal value of top frame

Minimum value with following frame

If Dpq-Upq≤10 then think do not match; If greater than threshold value, then extract in vertical direction in the public domain center of next I frame captions p

Center pixel IP (the Upq:Dpq at place, cx), determine it and frame captions q before I by methods such as the couplings of sliding, width be the IC of x place (Upq:Dpq, the matching error of pixel bars x) be MAD (x, q), and best match position x0, concrete method and above-mentioned horizontal captions are similar, therefrom select minimum MAD to be worth pairing captions as optimum matching then, if best match position

MAD (q ₀)≤MAD _ThThen think to mate captions.

For the captions on the coupling, can follow the tracks of operation to it, thereby determine the position of start frame and abort frame in the captions.

Concrete, the matching speed that can be calculated according to the relative position difference from the captions coupling is divided into two types of static captions and roll titless.If the captions of coupling are carried out the invariant position in the frame that captions detect then are judged as static captions at two, otherwise are judged as roll titles.

If roll titles, then according to the position at roll titles place in matching speed and the present frame, determine that a certain frame of this captions frame before present frame enters image frame just, and a certain frame after present frame just exceeds the Frame of the pairing frame number of image frame scope, as frame and abort frame occurring.

If static captions, then visit image sets (the GOP:group of pictures: video flowing image sets) at former frame place, and the luminance component image of every frame wherein carried out decode operation, obtain its caption area direct current (DC) image simultaneously, calculating is in this GOP, the mean absolute error MAD value of caption area DC image is determined the appearance frame and the abort frame of static captions according to the MAD value.

During static caption strips in above-mentioned steps is followed the tracks of in GOP the mean absolute error of caption area DC image be to mate and be achieved by extracting DC lines in this zone.Specific as follows:

At first, realize the frame between former frame and the present frame is carried out partial decoding of h and obtains the DC image.

Then, draw its corresponding coordinate position in the DC image, and extract the DC of the central block place lines of captions region in the DC image therebetween according to the captions bezel locations that is drawn in the present frame.

Next, calculate the given frame i and the DC lines difference value of present frame.

When extracting the DC lines, to consider the orientation of captions.For horizontal captions, the DC lines difference value MADDC (i) of i frame wherein and present frame, specifically can obtain by following formula:

MADDC (i) = \frac{1}{L} Σ_{dcx = 1}^{L} | DC (dcy, dcx, IC) - DC (dcy, dcx, i) | IP \leq i \leq IC

Wherein (y, x i) represent the pairing DC image of i frame to DC, and dcy represents caption area center in vertical direction in the DC image.

Computing method and top method for the homeotropic alignment captions are similar.

For the determination methods that frame or abort frame occur, can determine by on the MADDC curve, seeking catastrophe point.Shown in the following formula of concrete grammar:

Wherein th1 and th2 are the constraint threshold values of judging catastrophe point, and the preferable constraint threshold value of selecting for use in the embodiment of the invention is th1=3.5, th2=9.

If be the center with the present frame, search radius is not find catastrophe point in 2 GOP length ranges, and then the captions that this caption strips is surveyed as false retrieval are rejected; Otherwise find out nearest Frame before or after the present frame, as frame or abort frame occurring.

Following formula is to horizontal captions calculated difference value, obtains for the computing method and the top similar method of homeotropic alignment captions.

Step 24 according to the attribute information of captions, is extracted detected captions.

Need to prove, in the video caption information getting method that the embodiment of the invention provides, the captions attribute information that can real-time record have obtained.

The captions attribute information specifically can comprise essential information, scene information and the match information etc. of captions.

Essential information specifically can comprise the base attribute information of these captions, and detection information etc.;

Scene information specifically can comprise the start frame and the abort frame of these captions, and whether captions cross over camera lens sign etc.;

Match information specifically can comprise the sign that whether mates, and the positional information of coupling etc.

Wherein, the embodiment of the invention is for the determination methods of whether crossing over camera lens, can adopt in the interval at Frame before the start frame that is write down and the Frame place after the abort frame and carry out maturation methods such as scene change-detection.The embodiment of the invention does not limit this.

The related captions attribute information of the embodiment of the invention specifically can be as shown in table 1:

Table 1

In addition, the embodiment of the invention can also be preserved the captions attribute information that obtains in real time with the form of text entry.The text of recorded and stored specifically can be as shown in table 2:

Table 2

So, in this step, concrete according to the captions attribute information that has write down, comprise start frame, the abort frame of captions and information such as position occurs, extract the caption frame that is used to cut apart, carry out the captions that merge multiframe then and cut apart, and the result of cutting apart discerned, specifically can comprise:

From the captions attribute information of record, judge that captions belong to static and still roll.

For static captions, directly extract I frames all between the initial sum abort frame and P frame, the caption area image of same position;

For roll titles,, extract all I frames of these captions and P frame respective image zone then according to rolling speed.

On definite basis, zone, with the caption area part of I frames all in the lasting frame of captions, carry out the adaptive threshold binaryzation earlier and cut apart, obtain pixel value and have only 0 and 255 bianry image; All the I frame caption area images that to cut apart again carry out " with operation " at the pixel value of same position, obtain " I frame and image "; Then captions are continued the I frames all in the frame and the caption area image of P frame, ask average pixel value, promptly ask a average image of these images, this average image is carried out binaryzation cut apart, obtain " I-P frame the average image " at the pixel value of same position; " the I frame and the image " that will obtain at last and " I-P frame the average image " two width of cloth images carry out " with operation " the design sketch that draws as final segmentation result.

For segmentation result, can be in the subtitle recognition process, adopt literal identification (OCR:OpticalCharacter Recognition) software, with the bianry image that splits with discerning.

Foregoing description as can be seen, the embodiment of the invention provides the caption information acquisition methods, undertaken detecting by luminance component image based on the captions of small echo to Frame in the video flowing, obtain the attribute information of detected captions, according to described attribute information, extract detected captions and extract, thereby accurately obtain caption information in the Frame.Because the captions based on small echo detect, and need not the zone at captions place is limited, therefore, the caption information acquisition methods that the embodiment of the invention provides can obtain the caption information in the video data under the situation that does not limit the subtitle position zone.And owing to only obtain the luminance component image of specific data frame, therefore, the caption information acquisition methods that the embodiment of the invention provides can obtain caption information more efficiently.And, the caption information acquisition methods that the embodiment of the invention provides, can also carry out the checking of caption area authenticity to the captions that obtain, and coupling and tracking operation, thereby the caption information acquisition methods that the embodiment of the invention is provided can obtain caption information more accurately, effectively promotes the performance that captions detect.In addition, the caption information acquisition methods that the embodiment of the invention provides can also carry out cutting operation to the captions that obtain, thereby be more convenient for user's use.

The embodiment of the invention also provides a kind of caption information deriving means, and as shown in Figure 4, this device comprises detection module 410, the first acquisition modules 420 and extraction module 430.Wherein:

Detection module 410 is used for the luminance component image of video flowing Frame is carried out detecting based on the captions of small echo.

First acquisition module 420 is used to obtain the attribute information of detection module 410 detected captions.

The captions attribute information that first acquisition module 420 obtains specifically can comprise essential information, scene information and the match information etc. of captions.

The related captions attribute information of the embodiment of the invention specifically can be as shown in table 1.

In addition, the embodiment of the invention can also be preserved the captions attribute information that obtains in real time with the form of text entry.The text of recorded and stored specifically can be as shown in table 2.

Extraction module 430 is used for belonging to information according to the captions that first acquisition module 420 obtains, and extracts detection module 430 detected captions.

In a specific embodiment of the caption information deriving means that the embodiment of the invention provides, as shown in Figure 5, this device specifically can also comprise second acquisition module 440, is used to obtain the luminance component image of specific data frame.

The detection module 410 that the embodiment of the invention is related specifically can comprise first acquiring unit 411, second acquisition unit 412, generation unit 413, determining unit 414 as shown in Figure 6.Wherein:

First acquiring unit 411 is used for the luminance component image that second acquisition module 430 obtains is carried out wavelet transformation, obtains the high-frequency sub-band texture maps of level, vertical and three directions of diagonal line.

Concrete, the luminance component image of 411 pairs of Frames of having chosen of first acquiring unit, carry out wavelet transformation, to obtain a low frequency sub-band, high-frequency sub-band with level, vertical, three directions of diagonal line, wherein, horizontal high-frequency sub-band is designated as that H, vertical high frequency subband are designated as V, the diagonal line high-frequency sub-band is designated as D.

Then, the coefficient of the high-frequency sub-band of the level obtained, vertical and three directions of diagonal line is asked absolute value respectively, to obtain horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps.

First acquiring unit 411 can also obtain comprehensive high-frequency sub-band texture maps (CS) in conjunction with three high-frequency sub-band texture maps to obtain.

CS(i，j)＝CH(i，j)+CV(i，j)+CD(i，j)

Second acquisition unit 412, the high-frequency sub-band texture maps of the level that is used for first acquiring unit 411 is obtained, vertical and three directions of diagonal line is obtained Frame captions dot image (TextPnt).

Second acquisition unit 412 is concrete by following operation, obtains the captions dot image of Frame:

MAPH_ORG (i, j) = \{\begin{matrix} 1, & CH (i, j) &GreaterEqual; TH \\ 0, & CH (i, j) < TH \end{matrix}

TH = \{\begin{matrix} 50, & MH * 5 &GreaterEqual; 50 \\ MH * 5, & 50 > MH * 5 > 18 \\ 18, & MH * 5 \leq 18 \end{matrix}

Generation unit 413 is used for the captions dot image obtained according to second acquisition unit 412, generates the caption area image.

Generation unit 413 specifically can generate the caption area image by following operation:

At first, closed operation and the opening operation that the captions dot image that has generated is carried out horizontal direction respectively obtains horizontal image (Verlmg).

Wherein, the structural element of closed operation can be complete " 1 " matrix of 20*1, and the structural element of opening operation can be complete " 1 " matrix of 1*2, and certainly, the structural element that closed operation and opening operation are adopted can be arranged according to actual needs flexibly;

Determining unit 414, the bar number and the caption area positional information of the caption area image captions that are used for determining that generation unit 413 generates.

Determining unit 414 specifically can be determined the bar number and the caption area positional information of captions in the caption area image by following operation:

In another specific embodiment of the detection module 410 that the embodiment of the invention provides, detection module 410 further can be as shown in Figure 7, can also comprise detecting unit 415, be used for whether the caption areas that determining unit 414 is determined are belonged to detection for real caption area.

When a caption area is true caption area, trough in the projection on the corresponding comprehensive subband texture maps, and being evenly distributed of the trough of the low frequency component image projection behind the wavelet transformation.The length scale that uniform measure is a trough is no more than crest, and the variance of trough is less.

First acquisition module 420 that the embodiment of the invention provides specifically can comprise judging unit 421, the first determining units 422 and second determining unit 423 as shown in Figure 8.Wherein:

Whether judging unit 421 is used to judge the current I frame at detection module 410 detected captions places, mate with the last I frame of current I frame.

Judging unit 421 is carried out the condition of judging and specifically can be comprised: whether the caption strips number in last I frame and the current I frame is zero.

If in last I frame and the current I frame, there is the caption strips number of an I frame non-vanishing, then judging unit 421 needs to carry out the decision operation of whether mating.

Need to prove that the Rule of judgment of judging unit 421 is not limited in above-mentioned condition, can replenish and adjust according to the needs of practical application.

Whether judging unit 421 can pass through the sampling matching method, judges the current I frame at detection module 410 detected captions places, mate with the last I frame of current I frame.

Promptly calculate in the current I frame (the minimum average B configuration absolute error (MAD:Mean AbsoluteDifference) in the coupling of sliding of 1≤q≤n) of any captions q who did not mate among the captions p to be matched and next I frame, from n bar captions couplings, choose MAD value minimum then, as the optimum matching captions, and judge further whether this minimum MAD satisfies the least commitment threshold value.

And

And the minimum value of the right frame

Specifically can calculate and obtain by following formula:

MAD (y, q) = \frac{1}{(Rpq - Lpq)} Σ_{x = Lpq}^{Rpq} | IP (cy, x) - IC (y, x) |, y &Element; [U_{IC}^{q}, D_{IC}^{q}]

y_{q} = \arg \min_{y} {MAD (y, q)}

q_{0} = \arg \min_{q} {MAD (q)}

If in best match position

Minimum value with following frame

MAD (q ₀)≤MAD _ThThen think to mate captions.

Judging unit triggers first determining unit 422 after determining coupling.

First determining unit 422 is used for when the judged result of judging unit 421 is coupling, and the matching speed that the relative position difference of mating according to captions is calculated determines that detected captions are dynamic title or static captions.

Concrete, the matching speed that first determining unit 422 can be calculated according to the relative position difference from the captions coupling is divided into two types of static captions and roll titless.

If the captions of coupling are carried out the invariant position in the Frame that captions detect then are judged as static captions at two, otherwise are judged as roll titles.

Second determining unit 423 is used for when first determining unit 422 determines that captions are dynamic title, according to the matching speed of dynamic title, and the position of present frame in dynamic title, determines the start frame and the abort frame of dynamic title; When first determining unit 422 determines that captions are static captions, extract the direct current lines in the static captions, and the direct current lines are carried out matching operation, determine the start frame and the abort frame of static captions.

If roll titles, 423 positions of second determining unit according to roll titles place in matching speed and the present frame, determine that a certain frame of this captions frame before present frame enters image frame just, and a certain frame after present frame just exceeds the Frame of the pairing frame number of image frame scope, as frame and abort frame occurring.

If static captions, image sets (the GOP:group of pictures: video flowing image sets) at 423 visits of second determining unit former frame place, and the luminance component image of every frame wherein carried out decode operation, obtain its caption area direct current (DC) image simultaneously, calculating is in this GOP, the mean absolute error MAD value of caption area DC image is determined the appearance frame and the abort frame of static captions according to the MAD value.

MADDC (i) = \frac{1}{L} Σ_{dcx = 1}^{L} | DC (dcy, dcx, IC) - DC (dcy, dcx, i) | IP \leq i \leq IC

The extraction module 430 that the real-time example of the present invention provides specifically can comprise extracting unit 431, cutting unit 432 and recognition unit 433 as shown in Figure 9.Wherein:

Extracting unit 431 is used for start frame, the abort frame according to captions and positional information occurs, extracts the caption frame that is used to cut apart in the captions.

Cutting unit 432, the caption area of the caption frame correspondence that is used for determining that extracting unit 431 extracts carries out binaryzation to described caption area and cuts apart, and obtains bianry image.

Concrete, cutting unit 432 is concrete according to the captions attribute information that has write down, and comprises start frame, the abort frame of captions and information such as position occurs, the caption frame that extraction is used to cut apart, carry out the captions that merge multiframe then and cut apart, and the result of cutting apart is discerned, specifically can comprise:

Recognition unit 433 is used to discern the bianry image that cutting unit 432 obtains, and extracts captions.

Concrete, recognition unit 433 can adopt literal identification (OCR:Optical CharacterRecognition) software, to the bianry image that splits with discerning extraction captions wherein.

Foregoing description as can be seen, the embodiment of the invention provides the caption information deriving means, undertaken detecting by luminance component image, and detected captions are mated and follow the tracks of operation based on the captions of small echo to Frame in the video flowing, thus the caption information of definite this Frame exactly.Because the captions based on small echo detect, and need not the zone at captions place is limited, therefore, the caption information deriving means that the embodiment of the invention provides can obtain the caption information in the video data under the situation that does not limit the subtitle position zone.And, owing to only obtain the luminance component image of part specific data frame, and the captions that obtain are carried out the checking of caption area authenticity, and coupling and tracking operation, thereby the caption information deriving means that the embodiment of the invention is provided can obtain caption information faster, accurately, effectively promotes the performance that captions detect.In addition, the caption information deriving means that the embodiment of the invention provides can also carry out cutting operation to the captions that obtain, thereby be more convenient for user's use.

Need to prove that related formula or numerical value among the invention described above embodiment does not play any limitations affect for the protection domain of the embodiment of the invention, when adopting other wavelet transformations, coupling tracking technique means, can carry out corresponding conversion fully.

Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential hardware platform, can certainly all implement, but the former is better embodiment under a lot of situation by hardware.Based on such understanding, all or part of can the embodying that technical scheme of the present invention contributes to background technology with the form of software product, this computer software product can be stored in the storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be a personal computer, server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the present invention or embodiment.

The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. a video caption information getting method is characterized in that, comprising:

Obtain the attribute information of detected captions;

According to described attribute information, extract detected captions.

2. method according to claim 1 is characterized in that, described method also comprised before the luminance component image to Frame carries out detecting based on the captions of small echo: the luminance component image that obtains the specific data frame.

3. method according to claim 1 is characterized in that, described luminance component image to Frame carries out comprising based on the captions detection of small echo:

Luminance component image to Frame carries out wavelet transformation, obtains horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps;

According to described horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps, obtain the captions dot image of Frame;

Captions dot image by described Frame generates the caption area image;

Determine the bar number and the caption area positional information of captions in the described caption area image.

4. method according to claim 3 is characterized in that, described luminance component image to Frame carries out wavelet transformation, and the horizontal high-frequency sub-band texture maps of acquisition, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps comprise:

Luminance component image to Frame carries out wavelet transformation, generates horizontal subband, vertical subband and diagonal line subband;

The coefficient of described horizontal subband, vertical subband and diagonal line subband is asked absolute value respectively, obtain horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps.

5. method according to claim 3 is characterized in that, described according to described horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps, the captions dot image of obtaining Frame comprises:

Described horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps are carried out captions point respectively detect, the initial captions dot image of generation level, vertical and three directions of diagonal line;

Initial captions dot image to described three directions is removed noise processed respectively, obtains the final captions dot image of three directions;

The final captions dot image of described three directions is sought common ground, obtain the captions dot image of described Frame.

6. method according to claim 3 is characterized in that, described captions dot image by described Frame generates the caption area image and comprises:

The closed operation and the opening operation that described captions dot image are carried out horizontal direction respectively obtain horizontal image, and closed operation and the opening operation that described captions dot image is carried out vertical direction respectively obtained vertical image;

The horizontal image and the vertical image that obtain are asked the union operation, obtain comprising the maximum point set image of all caption areas;

Described maximum point set image is carried out closed operation, obtain the caption area image.

7. method according to claim 6 is characterized in that, the bar number and the caption area positional information of captions comprise in described definite described caption area image:

Distinguish horizontal caption area and vertical caption area in the described caption area image;

By the coordinate position of the upper and lower, left and right pixel of horizontal caption area in described horizontal image, determine upper side frame, lower frame, the left frame of described horizontal caption area posting in horizontal image, the position of left frame; By the coordinate position of the upper and lower, left and right pixel of vertical caption area in described vertical image, determine upper side frame, lower frame, the left frame of described vertical caption area posting in vertical image, the position of left frame;

Corresponding comprehensive high-frequency sub-band texture maps The corresponding area is carried out horizontal projection and vertical projection respectively in corresponding comprehensive high-frequency sub-band texture maps The corresponding area and the described vertical caption area posting in described horizontal caption area posting, determine the peak valley information of drop shadow curve, and, determine the upper side frame and the lower frame position of caption strips number in the described caption area and captions according to described peak valley information.

8. according to each described method of claim 1 to 7, it is characterized in that the attribute information of described captions comprises the start frame and the abort frame of described captions, and positional information occurs.

9. method according to claim 8 is characterized in that, described start frame and the abort frame that obtains detected captions comprises:

Judge the current I frame at detected captions place, whether mate with the last I frame of described current I frame;

If coupling, then the matching speed that is calculated according to the relative position difference of captions couplings determines that described captions are dynamic title or static captions;

If described captions are dynamic title, then according to the matching speed of described dynamic title, and the position of present frame in dynamic title, determine the start frame and the abort frame of described dynamic title;

If described captions are static captions, then extract the direct current lines in the described static captions, and described direct current lines are carried out matching operation, determine the start frame and the abort frame of described static captions.

10. method according to claim 1 is characterized in that, and is described according to described attribute information, extracts detected captions and comprises:

According to start frame, the abort frame of described captions and positional information occurs, extract the caption frame that is used to cut apart in the described captions;

Determine the caption area of the caption frame correspondence of described extraction, described caption area is carried out binaryzation cut apart, obtain bianry image;

Discern described bianry image, obtain described captions.

11. a video caption information acquisition device is characterized in that, comprising:

12. device according to claim 11 is characterized in that, described device also comprises:

Second acquisition module is used to obtain the luminance component image of specific data frame.

13. device according to claim 11 is characterized in that, described detection module comprises:

First acquiring unit carries out wavelet transformation to the luminance component image of Frame, obtains horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal line high-frequency sub-band texture maps;

Second acquisition unit is used for the described level of obtaining according to described first acquiring unit, vertical and diagonal line high-frequency sub-band texture maps, obtains the captions dot image of Frame;

Generation unit is used for the captions dot image of the described Frame that obtains according to described second acquisition unit, generates the caption area image;

Determining unit, the bar number and the caption area positional information of the caption area image captions that are used for determining that described generation unit generates.

14. device according to claim 13 is characterized in that, described detection module also comprises:

Detecting unit, whether the caption area that is used for described determining unit is determined is the detection of true caption area.

15. device according to claim 11 is characterized in that, described first acquisition module comprises:

Whether judging unit is used to judge the current I frame at the detected captions of described detection module place, mate with the last I frame of described current I frame;

First determining unit is used for when the judged result of described judging unit is coupling, and the matching speed that the relative position difference of mating according to captions is calculated determines that described captions are dynamic title or static captions;

Second determining unit is used for when described captions are dynamic title, according to the matching speed of described dynamic title, and the position of present frame in dynamic title, determines the start frame and the abort frame of described dynamic title; When described captions are static captions, extract the direct current lines in the described static captions, and described direct current lines are carried out matching operation, determine the start frame and the abort frame of described static captions.

16. device according to claim 11 is characterized in that, described extraction module comprises:

Extracting unit is used for start frame, the abort frame according to described captions and positional information occurs, extracts the caption frame that is used to cut apart in the described captions;

Cutting unit, the caption area of the caption frame correspondence that is used for determining that described extracting unit extracts carries out binaryzation to described caption area and cuts apart, and obtains bianry image;

Recognition unit is used to discern the bianry image that described cutting unit obtains, and extracts described captions.