Embodiment
The embodiment of the invention provides a kind of video caption information getting method, as shown in Figure 1, the method is undertaken detecting based on the captions of small echo by the luminance component image to Frame, and obtains the attribute information of detected captions, according to described attribute information, extract detected captions.Thereby the caption information in the Obtaining Accurate Frame.Because the captions based on small echo detect, and need not the zone at captions place is limited, therefore, the embodiment of the invention can be in the situation that do not limit the captions band of position, the caption information in the Obtaining Accurate video data.
Specific embodiment of the video caption information getting method that the embodiment of the invention provides can be as shown in Figure 2, and this embodiment specifically can comprise:
Step 21 is obtained the luminance component image of specific data frame from video data stream.
In order to accelerate to obtain the speed of caption information, the embodiment of the invention Frame of appointment of specifically can from video data stream, decoding, and obtain the luminance component image of specific data frame.
Such as, the frame number of only decoding is the intraframe coding of odd number (or even number), be that the I frame (also can be other forms of frame of video, such as encoded predicted frame, be the P frame) code stream, obtain the luminance component image of I frame, and to the chromatic component of I frame, and other frame then skips fast, thereby accelerated to obtain the speed of caption information.
Need to prove, the embodiment of the invention does not limit the compressed format of video data stream.
Step 22 carries out detecting based on the captions of small echo to the luminance component image of the Frame chosen.
Concrete, for the luminance component image of the Frame of having chosen, adopt the captions based on small echo to detect in this step.
In a specific embodiment, the concrete implementation of this step can as shown in Figure 3, comprise:
Step 221 is carried out wavelet transformation to the luminance component image of Frame, obtains horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps.
Related wavelet transformation in the embodiment of the invention is specifically as follows HAAR (Ha Er) wavelet transformation, Mexico's hat wavelet conversion, and the 9-7 wavelet transformation, the 5-3 wavelet transformation, etc.
In this step, luminance component image to the Frame chosen, carry out wavelet transformation, to obtain a low frequency sub-band, high-frequency sub-band with level, vertical, three directions of diagonal, wherein, the sub-high frequency band of level can be designated as that H, vertical high frequency subband can be designated as V, the diagonal high-frequency sub-band can be designated as D.
H, the V that generates behind the wavelet transformation, the coefficient of three high-frequency sub-band of D are asked respectively absolute value, obtain horizontal high-frequency sub-band texture maps (CH), vertical high frequency subband texture maps (CV) and diagonal high-frequency sub-band texture maps (CD).
Can also in conjunction with three high-frequency sub-band texture maps (CH, CV, CD), obtain comprehensive high-frequency sub-band texture maps (CS) in this step.
The value of each point can obtain by following formula in the comprehensive high-frequency sub-band texture image:
CS(i,j)=CH(i,j)+CV(i,j)+CD(i,j)
Step 222 according to horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps, is obtained the captions dot image (TextPnt) of Frame.
In a specific embodiment, in this step, specifically can comprise following link:
At first, according to the high-frequency sub-band texture maps, generate initial captions dot image.
Take horizontal high-frequency sub-band texture maps as example, horizontal high-frequency sub-band texture maps is carried out captions point detect, to obtain the initial captions dot image of this horizontal high-frequency sub-band (MAPH_ORG).
Wherein, the value located at coordinate (i, j) of the initial captions dot image of this horizontal high-frequency sub-band is to calculate according to following formula:
Need to prove, value is " 0 " expression background, and value is the initial captions point of " 1 " expression, and the computational methods of threshold value in the formula (TH) can be as follows:
MH in the formula is texture strength average in the horizontal high-frequency sub-band texture image.
Then, the initial captions dot image of horizontal high-frequency sub-band is removed noise processed, with the final captions dot image of the horizontal direction that obtains (MAPH).
Related except noise processed in the embodiment of the invention, specifically can adopt the ripe treatment schemes such as square filtering such as overlapping slip, the embodiment of the invention does not limit this.
Then, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps are carried out similar treatment step to obtain the initial captions dot image of the initial captions dot image of vertical subband (MAPV_ORG) and diagonal subband (MAPD_ORG), and the initial captions dot image of vertical subband and the initial captions dot image of diagonal subband removed respectively noise processed, to obtain the final captions dot image of the final captions dot image of vertical direction (MAPV) and diagonal (MAPD).
At last, the captions dot image (TextPnt) that the final captions dot image (MAPH, MAPV, MAPD) of three directions is sought common ground and obtains Frame.
Need to prove, in the embodiment of the invention, initial captions dot image (MAP_ORG) is removed the captions noise spot, the concrete methods of realizing flow process that obtains caption area can adopt following program to realize:
//h, w represent respectively height and the width of sub-band images
Block=4; The size of // square
Dis=3; The distance of // each time square skew
H_num=(h/dis); The number of times that // square is offset in the vertical direction
W_num=(w/dis); The number of times that // square is offset in the horizontal direction
MAP=MAPH_ORG;
for(k=1:h_num)
for(I=1:w_num)
if(((k-1)*dis+1+block>h)||((I-1)*dis+1+block>w))
Continue; If circulation is jumped out on // sub-block has shifted out image border
else
Num=TextPntNum (); The number of captions point is contained in // statistics square inside
if(num<(block*block/2))
StartH=(k-1)*dis;
EndH=StartH+block;
StartW=(I-1)*dis;
EndW=StartW+block;
MAP(StartH:End?H,StartW:EndW)=0;
If // number is less than (block*block/2), these all pixels of square zone are made as 0 in being,
// be that captions point in this square is noise spot
If else//number is greater than (block*block/2), this square zone is real captions point
MAP(StartH:EndH,StartW:EndW)=MAP_ORG(StartH:EndH,StartW:EndW)
end
end
Be understandable that, above example only for illustrating, does not play the effect of any restriction to the protection range of the embodiment of the invention.
Step 223 is by the captions dot image generation caption area image (TextArea) of Frame.
In a specific embodiment, specifically can comprise following link in this step:
At first, the closed operation and the opening operation that the captions dot image of having obtained are carried out respectively horizontal direction obtain horizontal image (Verlmg).
Wherein, the structural element of closed operation can be complete " 1 " matrix of 20*1, and the structural element of opening operation can be complete " 1 " matrix of 1*2, and certainly, the structural element that closed operation and opening operation adopt can arrange according to actual needs flexibly.
Then, the closed operation and the opening operation that the captions dot image are carried out vertical direction obtain vertical image (Horlmg).
Equally, the structural element of closed operation can be complete " 1 " matrix of 1*20, and the structural element of opening operation can be complete " 1 " matrix of 2*1;
Then, the horizontal image and the vertical image that obtain are asked the union operation, to obtain comprising the maximum point set image (lmg) of all caption areas, its concrete preparation method is as follows:
Next, maximum point set image is carried out closed operation to obtain the caption area image.
The structural element of closed operation can adopt complete " 1 " matrix, perhaps other matrix of 6*6.
Step 224 is determined number and the caption area positional information of captions in the caption area image.
In a specific embodiment, specifically can comprise following link in this step:
At first, each caption area in the caption area image is carried out captions and be the differentiation of horizontal or vertical arrangement.
The method of distinguishing is the relative size according to caption area Gao Yukuan.Concrete, if caption area be wider than height, then the captions in this caption area be horizontally, if caption area is wide less than height, then the interior captions of this caption area are vertical arrangement.
Need to prove, the confirmation method of the caption area in the caption area image can adopt the labeling method in the morphology, and perhaps other ripe method is confirmed, the embodiment of the invention does not limit this.
Be horizontal caption area for captions, determine this caption area corresponding zone in horizontal image, and, by the coordinate position of the upper and lower, left and right pixel of this caption area in horizontal image, determine the position of this caption area upper side frame, lower frame, left frame, left frame in horizontal image.
Be the caption area of vertical arrangement for captions, determine this caption area corresponding zone in vertical image, and employing and above-mentioned captions are the horizontal same method of caption area, obtain the position of this caption area upper side frame, lower frame, left frame, left frame in vertical image.
Then, floor projection is carried out in the corresponding comprehensive corresponding zone of subband texture maps (CS) in the caption area posting, and from the peak valley information of comprehensive subband texture maps drop shadow curve, determine upper side frame and the lower frame position of captions number in the comprehensive subband texture maps and every horizontal captions.
Concrete, can determine by the quantity of trough in the drop shadow curve number of captions in the caption area, this process specifically can comprise:
Texture average in the comprehensive subband texture maps is obtained threshold value divided by a parameter (alfa).If the value of drop shadow curve is trough less than this threshold value.Because the position of trough is exactly two centre positions between the captions, thereby by determining the quantity of trough, determine the number of captions in this caption area, namely the trough number adds 1.Need to prove, in embodiments of the present invention, the span of parameter (alfa) can be [2,3], after the practical operation check, and recommended parameter alfa=2.6.
In addition, because the upper and lower bezel locations of the captions that trough is separated is top and the terminal coordinate position of corresponding trough respectively, therefore, by determining the position at trough place, can determine in this caption area the upper side frame of every horizontal captions and the position of lower frame.
Captions for vertical arrangement, upright projection is carried out in corresponding comprehensive subband texture maps zone in the caption area posting, and from the definite wherein captions number of peak valley relation of drop shadow curve and left frame and the left frame position of every vertical captions, its concrete implementation method is identical with horizontal captions.
By aforesaid operations, can determine the information such as position that captions occur in video flowing.
Optionally, in one embodiment, in order to improve the accuracy of detection, can further include:
Whether step 225 is the detection of real caption area to caption area.
Owing in captions detect, may have error detection, it is caption area that the zone that is not captions is detected, and therefore, need to carry out authenticity verification to the caption area of confirming, can effectively promote like this performance that captions detect.
Concrete, can determine whether surveyed area is real caption area according to the distribution situation of captions grain distribution, intensity profile and number of edge points.
When a caption area is true caption area, trough in the projection on the corresponding comprehensive subband texture maps, and being evenly distributed of the trough of the low frequency component image projection behind the wavelet transformation.Wherein the detection method of trough is with what put down in writing in the step 224, and the length scale that uniform measure is trough is no more than crest, and the variance of trough is less.
Step 23 is obtained the attribute information of detected captions
Concrete, in this step, can mate and follow the tracks of operation detected captions, determine title information.
The captions matching operation is to judge according to the captions detection case of last I frame and current I frame whether detected captions mate, if coupling then show that the captions that are complementary belong to same captions otherwise belongs to different captions.
Whether the I frame that adjacent two needs are carried out the captions detection needs to carry out the tracking of captions coupling, is to judge according to caption strips number detected in this two frame and by the following situation that may occur:
1) if the caption strips number average of last I frame and current I frame is 0, then need not to mate and follow the tracks of operation;
2) if the caption strips quantity of last I frame is 0, and the caption strips quantity of current I frame is not 0, the captions number that then can determine the current I frame all is emerging captions, needs so to mate and follow the tracks of operation, to determine the start frame of captions in the current I frame.
When doing the start frame judgement, at first need to process according to captions match condition and determined title in current I frame and next I frame.If do not have captions in next I frame or captions are arranged but and in the current I frame captions that detect do not mate, then with the captions that detect in the current I frame as false retrieval and rejected, otherwise the caption strips that newly occurs that detects in the current I frame is carried out captions and is followed the tracks of.
3) if the caption strips quantity of last I frame is not 0, and the caption strips quantity of current I frame is 0, and then the caption strips of current I frame is the disappearance caption strips, needs so to mate and follows the tracks of operation, to determine the abort frame of captions in the current I frame.
4) if the caption strips number average of last I frame and current I frame is not 0, then need the captions among last I frame and the present frame I are mated and follow the tracks of operation, to determine which captions mates in the last I frame, which disappears, and which captions is couplings in the current I frame, and which is emerging.For in last I frame, which need to determine the abort frame of these captions to the I frame that disappears between the current I frame at last I frame, need to be from last I frame to the appearance frame of determining these captions the current I frame for emerging caption strips in the current I frame.
Can find out so, as long as have the captions number of a frame non-vanishing in last I frame or the current I frame, namely need to mate and follow the tracks of operation.
In the embodiment of the invention, can be by the mode of sampling matching, realize the matching operation of captions, namely calculate in the current I frame (the minimum average B configuration absolute error in shiding matching (MAD:Mean AbsoluteDifference) of 1≤q≤n) of any captions q who did not mate among the captions p to be matched and next I frame, then from n bar captions couplings, choose MAD value minimum, as the optimum Match captions, and judge further whether this minimum MAD satisfies the least commitment threshold value.
Concrete, for captions q and next I frame captions p of current I frame, the position of the up and down frame at captions place is respectively U
IC q, D
IC q, L
IC q, R
IC qAnd U
IP p, D
IP p, L
IP p, R
IP p
If two I frames all are horizontally, then extract captions q and next I frame captions p of current I frame, in the public domain in the horizontal direction, the maximum of left side frame
And the minimum value of the right frame
If less than or equal to threshold value, then thinking, Rpq-Lpq do not mate (threshold value herein specifically can be 10); If greater than threshold value, then extract in the public domain on the horizontal direction, the center cy of next I frame captions p (
Round[wherein] expression rounds) the pixel IP (cy, Lpq:Rpq) that locates, determine the captions q of itself and current I frame by methods such as shiding matchings, highly be the IC (y of y place, the matching error MAD (y, q) of pixel bars Lpq:Rpq), and best match position
Specifically can calculate by following formula and obtain:
If in best match position
Under MAD (q
0)≤MAD
Th, then think to mate captions.In the embodiment of the invention, the better value of threshold value MADth can be MAD
Th=20.
If all be vertical arrangement, then extract the captions q of current I frame and the captions p of next I frame, in the public domain in vertical direction, the maximum of top frame
Minimum value with following frame
If does not mate Dpq-Upq≤10 then think; If greater than threshold value, then extract in vertical direction in the public domain, the center cx of next I frame captions p (
) the center pixel IP (Upq:Dpq that locates, cx), determine it and frame captions q before I by methods such as shiding matchings, the matching error that at width is the pixel bars of the IC of x place (Upq:Dpq, x) is MAD (x, q), and best match position x0, concrete method and above-mentioned horizontal captions are similar, then therefrom select minimum MAD to be worth corresponding captions as optimum Match, if best match position
MAD (q
0)≤MAD
ThThen think to mate captions.
For the captions on the coupling, can follow the tracks of operation to it, thereby determine the position of start frame and abort frame in the captions.
Concrete, the matching speed that can calculate according to the relative position difference from the captions coupling is divided into two types of static captions and roll titles.If the captions of coupling are carried out the invariant position in the frame that captions detect then are judged as static captions at two, otherwise are judged as roll titles.
If roll titles, then according to the position at roll titles place in matching speed and the present frame, determine that a certain frame of this captions frame before present frame enters image frame just, and a certain frame after present frame just exceeds the Frame of the corresponding frame number of image frame scope, as frame and abort frame occurring.
If static captions, then access image sets (the GOP:group of pictures: video flowing image sets) at former frame place, and the luminance component image of every frame wherein carried out decode operation, obtain simultaneously its caption area direct current (DC) image, calculating is in this GOP, the mean absolute error MAD value of caption area DC image is determined appearance frame and the abort frame of static captions according to the MAD value.
During static caption strips in above-mentioned steps is followed the tracks of in GOP the mean absolute error of caption area DC image be to mate and be achieved by extracting DC lines in this zone.Specific as follows:
At first, realize the frame between former frame and the present frame is carried out partial decoding of h and obtains the DC image.
Then, draw its corresponding coordinate position in the DC image according to the drawn captions bezel locations in the present frame, and extract the DC of the central block place lines of captions region in the DC image therebetween.
Next, calculate given frame i and the DC lines difference value of present frame.
When extracting the DC lines, to consider the orientation of captions.For horizontal captions, the DC lines difference value MADDC (i) of i frame wherein and present frame, specifically can obtain by following formula:
Wherein DC (y, x, i) represents the corresponding DC image of i frame, and dcy represents caption area center in vertical direction in the DC image.
Computational methods and top method for the vertical arrangement captions are similar.
For the determination methods that frame or abort frame occur, can determine by seeking catastrophe point at the MADDC curve.Shown in the following formula of concrete grammar:
Wherein th1 and th2 are the constraint threshold values of judging catastrophe point, and the better constraint threshold value of selecting in the embodiment of the invention is th1=3.5, th2=9.
If centered by present frame, search radius is not find catastrophe point in 2 GOP length ranges, and the captions of then this caption strips being surveyed as false retrieval are rejected; Otherwise find out nearest Frame before or after the present frame, as frame or abort frame occurring.
Following formula is to horizontal captions calculated difference value, obtains for computational methods and the top similar method of vertical arrangement captions.
Step 24 according to the attribute information of captions, is extracted detected captions.
Need to prove, in the video caption information getting method that the embodiment of the invention provides, the title information of can real-time record having obtained.
Title information specifically can comprise essential information, scene information and the match information etc. of captions.
Essential information specifically can comprise the base attribute information of these captions, and detection information etc.;
Scene information specifically can comprise start frame and the abort frame of these captions, and whether captions cross over camera lens sign etc.;
Match information specifically can comprise the sign that whether mates, and the positional information of coupling etc.
Wherein, the embodiment of the invention is for the determination methods of whether crossing over camera lens, can adopt in the interval at Frame before the start frame that records and the Frame place after the abort frame and carry out the maturation methods such as Scene change detection.The embodiment of the invention does not limit this.
The related title information of the embodiment of the invention specifically can be as shown in table 1:
Table 1
/ * structure is used for describing the attribute * of the caption strips of current active/typedef struct ActiveTextLine {/* essential information */int frameIndex; // current frame number int textPos[4]; // positional information 4 dimension groups, left margin, coboundary, right margin, lower boundary int rollingFlag; // rolling and static flag, 0-is static, 1-vertical scrolling, 2-horizontal rolling int verVel; // vertical speed is downwards positive speed, upwards is negative velocity int horVel; // horizontal velocity is to the right positive speed, is left negative velocity bool direction; // distribution arrangement, 0-horizontal distribution, 1-vertical distribution/* scene information */int startFrame; // start frame int startGOP; // initial GOP int endFrame; // abort frame int duration; Length (frame number) the bool startAbrupt that // captions occur; Whether // start frame sudden change occurs, and 0-does not have, and bool endAbrupt appears in 1-; Whether // abort frame sudden change occurs, and 0 does not have, and bool crossScene appears in 1-; The camera lens sign crossed in // captions, and 0-does not have, and int crossPos[10 appears in 1-]; The frame number of // recording caption leap camera lens/* match information */bool matchFlag; // captions match flag is 1, occurs coupling in the next I frame, is 0, does not mate int matchTextPos[4]; The position of // coupling captions } ATL; |
In addition, the embodiment of the invention can also with the form of text entry, be preserved the title information of Real-time Obtaining.The text that record is preserved specifically can be as shown in table 2:
Table 2
N bar captions startFrame in the attribute record file format */TextNumIndex:#n of/* caption strips // video; // start frame endFrame; // abort frame rollingFlag; // roll and static flag direction; // distribution arrangement, 0-horizontal distribution, 1-vertical distribution textPos[4]; // positional information 4 dimension groups, left margin, coboundary, right margin, lower boundary RollingMV[2]; // captions rolling speed 2 dimension groups, vertical speed, horizontal velocity OCRString; Recognition result behind the // caption extraction |
So, in this step, concrete title information according to having recorded, comprise start frame, the abort frame of captions and the information such as position occurs, extract and be used for the caption frame cut apart, then carry out the caption extraction that merges multiframe, and the result of cutting apart identified, specifically can comprise:
From the title information of record, judge that captions belong to static or rolling.
For static captions, directly extract I frames all between the initial sum abort frame and P frame, the caption area image of same position;
For roll titles, then according to rolling speed, extract all I frames of these captions and P frame respective image zone.
On definite basis, zone, captions are continued the caption area part of I frames all in the frame, carry out first the adaptive threshold binarization segmentation, obtain pixel value and only have 0 and 255 bianry image; All the I frame caption area images that to cut apart again carry out " with operation " for the pixel value of same position, obtain " I frame and image "; Then captions are continued I frames all in the frame and the caption area image of P frame, be averaging pixel value for the pixel value of same position, namely ask a average image of these images, this average image is carried out binarization segmentation, obtain " I-P frame the average image "; " I frame and the image " that will obtain at last and " I-P frame the average image " two width of cloth images carry out " with operation " drawn design sketch as final segmentation result.
For segmentation result, can be in the subtitle recognition process, adopt literal identification (OCR:OpticalCharacter Recognition) software, with the bianry image that splits with identifying.
Foregoing description can be found out, the embodiment of the invention provides the caption information acquisition methods, undertaken detecting based on the captions of small echo by the luminance component image to Frame in the video flowing, obtain the attribute information of detected captions, according to described attribute information, extract detected captions and extract, thus caption information in the Obtaining Accurate Frame.Owing to detecting based on the captions of small echo, need not the zone at captions place is limited, therefore, the caption information acquisition methods that the embodiment of the invention provides can in the situation that do not limit the subtitle position zone, obtain the caption information in the video data.And owing to only obtaining the luminance component image of specific data frame, therefore, the caption information acquisition methods that the embodiment of the invention provides can obtain caption information more efficiently.And, the caption information acquisition methods that the embodiment of the invention provides, can also carry out to the captions that obtain the checking of caption area authenticity, and coupling and tracking operation, thereby the caption information acquisition methods that the embodiment of the invention is provided can obtain caption information more accurately, effectively promotes the performance that captions detect.In addition, the caption information acquisition methods that the embodiment of the invention provides can also carry out cutting operation to the captions that obtain, thereby be more convenient for user's use.
The embodiment of the invention also provides a kind of caption information deriving means, and as shown in Figure 4, this device comprises detection module 410, the first acquisition modules 420 and extraction module 430.Wherein:
Detection module 410 is used for the luminance component image of video flowing Frame is carried out detecting based on the captions of small echo.
The first acquisition module 420 is for the attribute information that obtains detection module 410 detected captions.
The title information that the first acquisition module 420 obtains specifically can comprise essential information, scene information and the match information etc. of captions.
Essential information specifically can comprise the base attribute information of these captions, and detection information etc.;
Scene information specifically can comprise start frame and the abort frame of these captions, and whether captions cross over camera lens sign etc.;
Match information specifically can comprise the sign that whether mates, and the positional information of coupling etc.
Wherein, the embodiment of the invention is for the determination methods of whether crossing over camera lens, can adopt in the interval at Frame before the start frame that records and the Frame place after the abort frame and carry out the maturation methods such as Scene change detection.The embodiment of the invention does not limit this.
The related title information of the embodiment of the invention specifically can be as shown in table 1.
In addition, the embodiment of the invention can also with the form of text entry, be preserved the title information of Real-time Obtaining.The text that record is preserved specifically can be as shown in table 2.
Extraction module 430, the captions that are used for obtaining according to the first acquisition module 420 belong to information, extract detection module 430 detected captions.
In a specific embodiment of the caption information deriving means that the embodiment of the invention provides, as shown in Figure 5, this device specifically can also comprise the second acquisition module 440, is used for obtaining the luminance component image of specific data frame.
In order to accelerate to obtain the speed of caption information, the embodiment of the invention Frame of appointment of specifically can from video data stream, decoding, and obtain the luminance component image of specific data frame.
Such as, the frame number of only decoding is the intraframe coding of odd number (or even number), be that the I frame (also can be other forms of frame of video, such as encoded predicted frame, be the P frame) code stream, obtain the luminance component image of I frame, and to the chromatic component of I frame, and other frame then skips fast, thereby accelerated to obtain the speed of caption information.
Need to prove, the embodiment of the invention does not limit the compressed format of video data stream.
The detection module 410 that the embodiment of the invention is related specifically can comprise the first acquiring unit 411, second acquisition unit 412, generation unit 413, determining unit 414 as shown in Figure 6.Wherein:
The first acquiring unit 411, the luminance component image that is used for the second acquisition module 430 is obtained carries out wavelet transformation, obtains the high-frequency sub-band texture maps of level, vertical and three directions of diagonal.
Related wavelet transformation in the embodiment of the invention is specifically as follows HAAR (Ha Er) wavelet transformation, Mexico's hat wavelet conversion, and the 9-7 wavelet transformation, the 5-3 wavelet transformation, etc.
Concrete, the luminance component image of 411 pairs of Frames of having chosen of the first acquiring unit, carry out wavelet transformation, to obtain a low frequency sub-band, high-frequency sub-band with level, vertical, three directions of diagonal, wherein, horizontal high-frequency sub-band is designated as that H, vertical high frequency subband are designated as V, the diagonal high-frequency sub-band is designated as D.
Then, the coefficient of the high-frequency sub-band of the level obtained, vertical and three directions of diagonal is asked respectively absolute value, to obtain horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps.
The first acquiring unit 411 can also in conjunction with three high-frequency sub-band texture maps to obtain, obtain comprehensive high-frequency sub-band texture maps (CS).
The value of each point can obtain by following formula in the comprehensive high-frequency sub-band texture image:
CS(i,j)=CH(i,j)+CV(i,j)+CD(i,j)
Second acquisition unit 412, the high-frequency sub-band texture maps of the level that is used for the first acquiring unit 41 1 is obtained, vertical and three directions of diagonal is obtained Frame captions dot image (TextPnt).
Second acquisition unit 412 is concrete by following operation, obtains the captions dot image of Frame:
At first, according to the high-frequency sub-band texture maps, generate initial captions dot image.
Take horizontal high-frequency sub-band texture maps as example, horizontal high-frequency sub-band texture maps is carried out captions point detect, to obtain the initial captions dot image of this horizontal high-frequency sub-band (MAPH_ORG).
Wherein, the value located at coordinate (i, j) of the initial captions dot image of this horizontal high-frequency sub-band is to calculate according to following formula:
Need to prove, value is " 0 " expression background, and value is the initial captions point of " 1 " expression, and the computational methods of threshold value in the formula (TH) can be as follows:
MH in the formula is texture strength average in the horizontal high-frequency sub-band texture image.
Then, the initial captions dot image of horizontal high-frequency sub-band is removed noise processed, with the final captions dot image of the horizontal direction that obtains (MAPH).
Related except noise processed in the embodiment of the invention, specifically can adopt the ripe treatment schemes such as square filtering such as overlapping slip, the embodiment of the invention does not limit this.
Then, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps are carried out similar treatment step to obtain the initial captions dot image of the initial captions dot image of vertical subband (MAPV_ORG) and diagonal subband (MAPD_ORG), and the initial captions dot image of vertical subband and the initial captions dot image of diagonal subband removed respectively noise processed, to obtain the final captions dot image of the final captions dot image of vertical direction (MAPV) and diagonal (MAPD).
At last, the captions dot image (TextPnt) that the final captions dot image (MAPH, MAPV, MAPD) of three directions is sought common ground and obtains Frame.
Generation unit 413 for the captions dot image of obtaining according to second acquisition unit 412, generates the caption area image.
Generation unit 413 specifically can generate the caption area image by following operation:
At first, the closed operation and the opening operation that the captions dot image that has generated are carried out respectively horizontal direction obtain horizontal image (Verlmg).
Wherein, the structural element of closed operation can be complete " 1 " matrix of 20*1, and the structural element of opening operation can be complete " 1 " matrix of 1*2, and certainly, the structural element that closed operation and opening operation adopt can arrange according to actual needs flexibly;
Then, the closed operation and the opening operation that the captions dot image are carried out vertical direction obtain vertical image (Horlmg).
Equally, the structural element of closed operation can be complete " 1 " matrix of 1*20, and the structural element of opening operation can be complete " 1 " matrix of 2*1;
Then, the horizontal image and the vertical image that obtain are asked the union operation, to obtain comprising the maximum point set image (lmg) of all caption areas, its concrete preparation method is as follows:
Next, maximum point set image is carried out closed operation to obtain the caption area image.
The structural element of closed operation can adopt complete " 1 " matrix, perhaps other matrix of 6*6.
Determining unit 414 is for number and the caption area positional information of determining the caption area image captions that generation unit 413 generates.
Determining unit 414 specifically can be determined by following operation number and the caption area positional information of captions in the caption area image:
At first, each caption area in the caption area image is carried out captions and be the differentiation of horizontal or vertical arrangement.
The method of distinguishing is the relative size according to caption area Gao Yukuan.Concrete, if caption area be wider than height, then the captions in this caption area be horizontally, if caption area is wide less than height, then the interior captions of this caption area are vertical arrangement.
Need to prove, the confirmation method of the caption area in the caption area image can adopt the labeling method in the morphology, and perhaps other ripe method is confirmed, the embodiment of the invention does not limit this.
Be horizontal caption area for captions, determine this caption area corresponding zone in horizontal image, and, by the coordinate position of the upper and lower, left and right pixel of this caption area in horizontal image, determine the position of this caption area upper side frame, lower frame, left frame, left frame in horizontal image.
Be the caption area of vertical arrangement for captions, determine this caption area corresponding zone in vertical image, and employing and above-mentioned captions are the horizontal same method of caption area, obtain the position of this caption area upper side frame, lower frame, left frame, left frame in vertical image.
Then, floor projection is carried out in the corresponding comprehensive corresponding zone of subband texture maps (CS) in the caption area posting, and from the peak valley information of comprehensive subband texture maps drop shadow curve, determine upper side frame and the lower frame position of captions number in the comprehensive subband texture maps and every horizontal captions.
Concrete, can determine by the quantity of trough in the drop shadow curve number of captions in the caption area, this process specifically can comprise:
Texture average in the comprehensive subband texture maps is obtained threshold value divided by a parameter (alfa).If the value of drop shadow curve is trough less than this threshold value.Because the position of trough is exactly two centre positions between the captions, thereby by determining the quantity of trough, determine the number of captions in this caption area, namely the trough number adds 1.Need to prove, in embodiments of the present invention, the span of parameter (alfa) can be [2,3], after the practical operation check, and recommended parameter alfa=2.6.
In addition, because the upper and lower bezel locations of the captions that trough is separated is top and the terminal coordinate position of corresponding trough respectively, therefore, by determining the position at trough place, can determine in this caption area the upper side frame of every horizontal captions and the position of lower frame.
Captions for vertical arrangement, upright projection is carried out in corresponding comprehensive subband texture maps zone in the caption area posting, and from the definite wherein captions number of peak valley relation of drop shadow curve and left frame and the left frame position of every vertical captions, its concrete implementation method is identical with horizontal captions.
By aforesaid operations, can determine the information such as position that captions occur in video flowing.
In another specific embodiment of the detection module 410 that the embodiment of the invention provides, detection module 410 further can be as shown in Figure 7, can also comprise detecting unit 415, for the detection that whether determining unit 414 definite caption areas is belonged to for real caption area.
Owing in captions detect, may have error detection, it is caption area that the zone that is not captions is detected, and therefore, need to carry out authenticity verification to the caption area of confirming, can effectively promote like this performance that captions detect.
Concrete, can determine whether surveyed area is real caption area according to the distribution situation of captions grain distribution, intensity profile and number of edge points.
When a caption area is true caption area, trough in the projection on the corresponding comprehensive subband texture maps, and being evenly distributed of the trough of the low frequency component image projection behind the wavelet transformation.The length scale that uniform measure is trough is no more than crest, and the variance of trough is less.
The first acquisition module 420 that the embodiment of the invention provides specifically can as shown in Figure 8, comprise judging unit 421, the first determining units 422 and the second determining unit 423.Wherein:
Whether judging unit 421 be used for is judged the current I frame at detection module 410 detected captions places, mate with the upper I frame of current I frame.
Judging unit 421 is carried out the condition of judging and specifically can be comprised: whether the captions number in last I frame and the current I frame is zero.
If in last I frame and the current I frame, there is the captions number of an I frame non-vanishing, then judging unit 421 needs to carry out the decision operation of whether mating.
Need to prove, the Rule of judgment of judging unit 421 is not limited in above-mentioned condition, can according to the needs of practical application, replenish and adjust.
Whether judging unit 421 can pass through the sampling matching method, judges the current I frame at detection module 410 detected captions places, mate with the upper I frame of current I frame.
Namely calculate in the current I frame (the minimum average B configuration absolute error in shiding matching (MAD:Mean AbsoluteDifference) of 1≤q≤n) of any captions q who did not mate among the captions p to be matched and next I frame, then from n bar captions couplings, choose MAD value minimum, as the optimum Match captions, and judge further whether this minimum MAD satisfies the least commitment threshold value.
Concrete, for captions q and next I frame captions p of current I frame, the position of the up and down frame at captions place is respectively U
IC q, D
IC q, L
IC q, R
IC qAnd U
IP p, D
IP p, L
IP p, R
IP p
If two I frames all are horizontally, then extract captions q and next I frame captions p of current I frame, in the public domain in the horizontal direction, the maximum of left side frame
And the minimum value of the right frame
If less than or equal to threshold value, then thinking, Rpq-Lpq do not mate (threshold value herein specifically can be 10); If greater than threshold value, then extract in the public domain on the horizontal direction, the center cy of next I frame captions p (
Round[wherein] expression rounds) the pixel IP (cy, Lpq:Rpq) that locates, determine the captions q of itself and current I frame by methods such as shiding matchings, highly be the IC (y of y place, the matching error MAD (y, q) of pixel bars Lpq:Rpq), and best match position
Specifically can calculate by following formula and obtain:
If in best match position
Under MAD (q
0)≤MA
Th, then think to mate captions.In the embodiment of the invention, the better value of threshold value MADth can be MAD
Th=20.
If all be vertical arrangement, then extract the captions q of current I frame and the captions p of next I frame, in the public domain in vertical direction, the maximum of top frame
Minimum value with following frame
If does not mate Dpq-Upq≤10 then think; If greater than threshold value, then extract in vertical direction in the public domain, the center cx of next I frame captions p (
) the center pixel IP (Upq:Dpq that locates, cx), determine it and frame captions q before I by methods such as shiding matchings, the matching error that at width is the pixel bars of the IC of x place (Upq:Dpq, x) is MAD (x, q), and best match position x0, concrete method and above-mentioned horizontal captions are similar, then therefrom select minimum MAD to be worth corresponding captions as optimum Match, if best match position
MAD (q
0)≤MAD
ThThen think to mate captions.
Judging unit triggers the first determining unit 422 after determining coupling.
The first determining unit 422 is used for when the judged result of judging unit 421 is coupling, and the matching speed that the relative position difference of mating according to captions calculates determines that detected captions are dynamic title or static captions.
Concrete, the matching speed that the first determining unit 422 can calculate according to the relative position difference from the captions coupling is divided into two types of static captions and roll titles.
If the captions of coupling are carried out the invariant position in the Frame that captions detect then are judged as static captions at two, otherwise are judged as roll titles.
The second determining unit 423 is used for when the first determining unit 422 definite captions are dynamic title, according to the matching speed of dynamic title, and the position of present frame in dynamic title, determines start frame and the abort frame of dynamic title; When the first determining unit 422 determines that captions are static captions, extract the direct current lines in the static captions, and the direct current lines are carried out matching operation, determine start frame and the abort frame of static captions.
If roll titles, 423 positions according to roll titles place in matching speed and the present frame of the second determining unit, determine that a certain frame of this captions frame before present frame enters image frame just, and a certain frame after present frame just exceeds the Frame of the corresponding frame number of image frame scope, as frame and abort frame occurring.
If static captions, image sets (the GOP:group of pictures: video flowing image sets) at 423 access of the second determining unit former frame place, and the luminance component image of every frame wherein carried out decode operation, obtain simultaneously its caption area direct current (DC) image, calculating is in this GOP, the mean absolute error MAD value of caption area DC image is determined appearance frame and the abort frame of static captions according to the MAD value.
During static caption strips in above-mentioned steps is followed the tracks of in GOP the mean absolute error of caption area DC image be to mate and be achieved by extracting DC lines in this zone.Specific as follows:
At first, realize the frame between former frame and the present frame is carried out partial decoding of h and obtains the DC image.
Then, draw its corresponding coordinate position in the DC image according to the drawn captions bezel locations in the present frame, and extract the DC of the central block place lines of captions region in the DC image therebetween.
Next, calculate given frame i and the DC lines difference value of present frame.
When extracting the DC lines, to consider the orientation of captions.For horizontal captions, the DC lines difference value MADDC (i) of i frame wherein and present frame, specifically can obtain by following formula:
Wherein DC (y, x, i) represents the corresponding DC image of i frame, and dcy represents caption area center in vertical direction in the DC image.
Computational methods and top method for the vertical arrangement captions are similar.
For the determination methods that frame or abort frame occur, can determine by seeking catastrophe point at the MADDC curve.Shown in the following formula of concrete grammar:
Wherein th1 and th2 are the constraint threshold values of judging catastrophe point, and the better constraint threshold value of selecting in the embodiment of the invention is th1=3.5, th2=9.
If centered by present frame, search radius is not find catastrophe point in 2 GOP length ranges, and the captions of then this caption strips being surveyed as false retrieval are rejected; Otherwise find out nearest Frame before or after the present frame, as frame or abort frame occurring.
Following formula is to horizontal captions calculated difference value, obtains for computational methods and the top similar method of vertical arrangement captions.
The extraction module 430 that the real-time example of the present invention provides specifically can comprise extracting unit 431, cutting unit 432 and recognition unit 433 as shown in Figure 9.Wherein:
Extracting unit 431 is used for according to start frame, the abort frame of captions and positional information occurs, extracts to be used for the caption frame cut apart in the captions.
Cutting unit 432 is used for caption area corresponding to caption frame that definite extracting unit 431 extracts, and described caption area is carried out binarization segmentation, obtains bianry image.
Concrete, cutting unit 432 concrete title information according to having recorded comprise start frame, the abort frame of captions and the information such as position occurs, extract for the caption frame of cutting apart, then carry out the caption extraction that merges multiframe, and the result of cutting apart identified, specifically can comprise:
From the title information of record, judge that captions belong to static or rolling.
For static captions, directly extract I frames all between the initial sum abort frame and P frame, the caption area image of same position;
For roll titles, then according to rolling speed, extract all I frames of these captions and P frame respective image zone.
On definite basis, zone, captions are continued the caption area part of I frames all in the frame, carry out first the adaptive threshold binarization segmentation, obtain pixel value and only have 0 and 255 bianry image; All the I frame caption area images that to cut apart again carry out " with operation " for the pixel value of same position, obtain " I frame and image "; Then captions are continued I frames all in the frame and the caption area image of P frame, be averaging pixel value for the pixel value of same position, namely ask a average image of these images, this average image is carried out binarization segmentation, obtain " I-P frame the average image "; " I frame and the image " that will obtain at last and " I-P frame the average image " two width of cloth images carry out " with operation " drawn design sketch as final segmentation result.
Recognition unit 433 is used for the bianry image that identification cutting unit 432 obtains, and extracts captions.
Concrete, recognition unit 433 can adopt literal identification (OCR:Optical CharacterRecognition) software, to the bianry image that splits with identifying extraction captions wherein.
Foregoing description can be found out, the embodiment of the invention provides the caption information deriving means, undertaken detecting based on the captions of small echo by the luminance component image to Frame in the video flowing, and detected captions are mated and follow the tracks of operation, thus the caption information of definite this Frame exactly.Owing to detecting based on the captions of small echo, need not the zone at captions place is limited, therefore, the caption information deriving means that the embodiment of the invention provides can in the situation that do not limit the subtitle position zone, obtain the caption information in the video data.And, owing to only obtaining the luminance component image of part specific data frame, and the captions that obtain are carried out the checking of caption area authenticity, and coupling and tracking operation, thereby the caption information deriving means that the embodiment of the invention is provided can obtain caption information faster, accurately, effectively promotes the performance that captions detect.In addition, the caption information deriving means that the embodiment of the invention provides can also carry out cutting operation to the captions that obtain, thereby be more convenient for user's use.
Need to prove, related formula or numerical value among the invention described above embodiment does not play any limitations affect for the protection range of the embodiment of the invention, when adopting other wavelet transformations, coupling tracking technique means, can carry out corresponding conversion fully.
Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential hardware platform, can certainly all implement by hardware, but the former is better execution mode in a lot of situation.Based on such understanding, technical scheme of the present invention is to can embodying with the form of software product in whole or in part that background technology contributes, this computer software product can be stored in the storage medium, such as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the present invention or embodiment.
The above; only for the better embodiment of the present invention, but protection scope of the present invention is not limited to this, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection range of claim.