CN101853381B - Method and device for acquiring video subtitle information - Google Patents

Method and device for acquiring video subtitle information Download PDF

Info

Publication number
CN101853381B
CN101853381B CN 200910081051 CN200910081051A CN101853381B CN 101853381 B CN101853381 B CN 101853381B CN 200910081051 CN200910081051 CN 200910081051 CN 200910081051 A CN200910081051 A CN 200910081051A CN 101853381 B CN101853381 B CN 101853381B
Authority
CN
China
Prior art keywords
captions
frame
image
caption
caption area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 200910081051
Other languages
Chinese (zh)
Other versions
CN101853381A (en
Inventor
杨锦春
刘贵忠
钱学明
李智
郭旦萍
姜海侠
南楠
孙力
王琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd, Xian Jiaotong University filed Critical Huawei Technologies Co Ltd
Priority to CN 200910081051 priority Critical patent/CN101853381B/en
Publication of CN101853381A publication Critical patent/CN101853381A/en
Application granted granted Critical
Publication of CN101853381B publication Critical patent/CN101853381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Television Systems (AREA)

Abstract

The invention relates to a method and a device for acquiring video subtitle information. The method comprises the following steps: carrying out small wave-based subtitle detection on the luminance component image of a data frame in a video stream; acquiring detected attribute information of subtitles; and extracting the detected subtitles according to the attribute information, thus accurately acquiring the subtitle information in the data frame. The area in which the subtitles are located does not need not to be limited due to the small wave-based subtitle detection, thus the embodiment of the invention can accurately acquire the subtitle information in the video data under the condition of not limiting the subtitle position area.

Description

Video caption information getting method and device
Technical field
The present invention relates to the application electric technology field, relate in particular to a kind of video caption information getting method and device.
Background technology
Video caption gives intuitively that form represents video program content, can effectively assist people to hold well the theme of program in video is appreciated, and then understand the content of video.The detection and Identification of video caption information can be enriched the inquiry of text based video content in addition.Therefore video caption information effectively being obtained is the link an of necessity.
The inventor finds in realizing process of the present invention, existing obtaining in the technology of caption information, relatively more responsive to the positional information that caption information appears in the video pictures, and generally, suppose that caption area is static, and subtitle position also is the lower middle portion that is fixed on image, if caption information not in specified detection range, caption information can not be obtained well and be used so.
Summary of the invention
The embodiment of the invention provides a kind of video caption information getting method and device, thereby in the situation that does not limit the captions band of position, the caption information in the Obtaining Accurate video data.
The embodiment of the invention provides a kind of video caption information getting method, comprising:
Luminance component image to Frame in the video flowing carries out detecting based on the captions of small echo;
Obtain the attribute information of detected captions;
According to described attribute information, extract detected captions.
The embodiment of the invention also provides a kind of video caption information acquisition device, comprising:
Detection module is used for the luminance component image of video flowing Frame is carried out detecting based on the captions of small echo;
The first acquisition module is for the attribute information that obtains the detected captions of described detection module;
Extraction module, the captions that are used for obtaining according to described the first acquisition module belong to information, extract the detected captions of described detection module.
Can be found out by the technical scheme that the invention described above embodiment provides, in the embodiment of the invention, undertaken detecting based on the captions of small echo by the luminance component image to Frame in the video flowing, and obtain the attribute information of detected captions, according to described attribute information, extract detected captions.Thereby the caption information in the Obtaining Accurate Frame.Because the captions based on small echo detect, and need not the zone at captions place is limited, therefore, the embodiment of the invention can be in the situation that do not limit the captions band of position, the caption information in the Obtaining Accurate video data.
Description of drawings
The described method flow schematic diagram one that Fig. 1 provides for the embodiment of the invention;
The described method flow schematic diagram two that Fig. 2 provides for the embodiment of the invention;
The described method flow schematic diagram three that Fig. 3 provides for the embodiment of the invention;
The described apparatus structure schematic diagram one that Fig. 4 provides for the embodiment of the invention;
The described apparatus structure schematic diagram two that Fig. 5 provides for the embodiment of the invention;
The described detection module structural representation one that Fig. 6 provides for the embodiment of the invention;
The described detection module structural representation two that Fig. 7 provides for the embodiment of the invention;
Described the first acquisition module structural representation that Fig. 8 provides for the embodiment of the invention;
The described extraction module structural representation that Fig. 9 provides for the embodiment of the invention.
Embodiment
The embodiment of the invention provides a kind of video caption information getting method, as shown in Figure 1, the method is undertaken detecting based on the captions of small echo by the luminance component image to Frame, and obtains the attribute information of detected captions, according to described attribute information, extract detected captions.Thereby the caption information in the Obtaining Accurate Frame.Because the captions based on small echo detect, and need not the zone at captions place is limited, therefore, the embodiment of the invention can be in the situation that do not limit the captions band of position, the caption information in the Obtaining Accurate video data.
Specific embodiment of the video caption information getting method that the embodiment of the invention provides can be as shown in Figure 2, and this embodiment specifically can comprise:
Step 21 is obtained the luminance component image of specific data frame from video data stream.
In order to accelerate to obtain the speed of caption information, the embodiment of the invention Frame of appointment of specifically can from video data stream, decoding, and obtain the luminance component image of specific data frame.
Such as, the frame number of only decoding is the intraframe coding of odd number (or even number), be that the I frame (also can be other forms of frame of video, such as encoded predicted frame, be the P frame) code stream, obtain the luminance component image of I frame, and to the chromatic component of I frame, and other frame then skips fast, thereby accelerated to obtain the speed of caption information.
Need to prove, the embodiment of the invention does not limit the compressed format of video data stream.
Step 22 carries out detecting based on the captions of small echo to the luminance component image of the Frame chosen.
Concrete, for the luminance component image of the Frame of having chosen, adopt the captions based on small echo to detect in this step.
In a specific embodiment, the concrete implementation of this step can as shown in Figure 3, comprise:
Step 221 is carried out wavelet transformation to the luminance component image of Frame, obtains horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps.
Related wavelet transformation in the embodiment of the invention is specifically as follows HAAR (Ha Er) wavelet transformation, Mexico's hat wavelet conversion, and the 9-7 wavelet transformation, the 5-3 wavelet transformation, etc.
In this step, luminance component image to the Frame chosen, carry out wavelet transformation, to obtain a low frequency sub-band, high-frequency sub-band with level, vertical, three directions of diagonal, wherein, the sub-high frequency band of level can be designated as that H, vertical high frequency subband can be designated as V, the diagonal high-frequency sub-band can be designated as D.
H, the V that generates behind the wavelet transformation, the coefficient of three high-frequency sub-band of D are asked respectively absolute value, obtain horizontal high-frequency sub-band texture maps (CH), vertical high frequency subband texture maps (CV) and diagonal high-frequency sub-band texture maps (CD).
Can also in conjunction with three high-frequency sub-band texture maps (CH, CV, CD), obtain comprehensive high-frequency sub-band texture maps (CS) in this step.
The value of each point can obtain by following formula in the comprehensive high-frequency sub-band texture image:
CS(i,j)=CH(i,j)+CV(i,j)+CD(i,j)
Step 222 according to horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps, is obtained the captions dot image (TextPnt) of Frame.
In a specific embodiment, in this step, specifically can comprise following link:
At first, according to the high-frequency sub-band texture maps, generate initial captions dot image.
Take horizontal high-frequency sub-band texture maps as example, horizontal high-frequency sub-band texture maps is carried out captions point detect, to obtain the initial captions dot image of this horizontal high-frequency sub-band (MAPH_ORG).
Wherein, the value located at coordinate (i, j) of the initial captions dot image of this horizontal high-frequency sub-band is to calculate according to following formula:
MAPH _ ORG ( i , j ) = 1 , CH ( i , j ) &GreaterEqual; TH 0 , CH ( i , j ) < TH
Need to prove, value is " 0 " expression background, and value is the initial captions point of " 1 " expression, and the computational methods of threshold value in the formula (TH) can be as follows:
TH = 50 , MH * 5 &GreaterEqual; 50 MH * 5 , 50 > MH * 5 > 18 18 , MH * 5 &le; 18
MH in the formula is texture strength average in the horizontal high-frequency sub-band texture image.
Then, the initial captions dot image of horizontal high-frequency sub-band is removed noise processed, with the final captions dot image of the horizontal direction that obtains (MAPH).
Related except noise processed in the embodiment of the invention, specifically can adopt the ripe treatment schemes such as square filtering such as overlapping slip, the embodiment of the invention does not limit this.
Then, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps are carried out similar treatment step to obtain the initial captions dot image of the initial captions dot image of vertical subband (MAPV_ORG) and diagonal subband (MAPD_ORG), and the initial captions dot image of vertical subband and the initial captions dot image of diagonal subband removed respectively noise processed, to obtain the final captions dot image of the final captions dot image of vertical direction (MAPV) and diagonal (MAPD).
At last, the captions dot image (TextPnt) that the final captions dot image (MAPH, MAPV, MAPD) of three directions is sought common ground and obtains Frame.
Need to prove, in the embodiment of the invention, initial captions dot image (MAP_ORG) is removed the captions noise spot, the concrete methods of realizing flow process that obtains caption area can adopt following program to realize:
//h, w represent respectively height and the width of sub-band images
Block=4; The size of // square
Dis=3; The distance of // each time square skew
H_num=(h/dis); The number of times that // square is offset in the vertical direction
W_num=(w/dis); The number of times that // square is offset in the horizontal direction
MAP=MAPH_ORG;
for(k=1:h_num)
for(I=1:w_num)
if(((k-1)*dis+1+block>h)||((I-1)*dis+1+block>w))
Continue; If circulation is jumped out on // sub-block has shifted out image border
else
Num=TextPntNum (); The number of captions point is contained in // statistics square inside
if(num<(block*block/2))
StartH=(k-1)*dis;
EndH=StartH+block;
StartW=(I-1)*dis;
EndW=StartW+block;
MAP(StartH:End?H,StartW:EndW)=0;
If // number is less than (block*block/2), these all pixels of square zone are made as 0 in being,
// be that captions point in this square is noise spot
If else//number is greater than (block*block/2), this square zone is real captions point
MAP(StartH:EndH,StartW:EndW)=MAP_ORG(StartH:EndH,StartW:EndW)
end
end
Be understandable that, above example only for illustrating, does not play the effect of any restriction to the protection range of the embodiment of the invention.
Step 223 is by the captions dot image generation caption area image (TextArea) of Frame.
In a specific embodiment, specifically can comprise following link in this step:
At first, the closed operation and the opening operation that the captions dot image of having obtained are carried out respectively horizontal direction obtain horizontal image (Verlmg).
Wherein, the structural element of closed operation can be complete " 1 " matrix of 20*1, and the structural element of opening operation can be complete " 1 " matrix of 1*2, and certainly, the structural element that closed operation and opening operation adopt can arrange according to actual needs flexibly.
Then, the closed operation and the opening operation that the captions dot image are carried out vertical direction obtain vertical image (Horlmg).
Equally, the structural element of closed operation can be complete " 1 " matrix of 1*20, and the structural element of opening operation can be complete " 1 " matrix of 2*1;
Then, the horizontal image and the vertical image that obtain are asked the union operation, to obtain comprising the maximum point set image (lmg) of all caption areas, its concrete preparation method is as follows:
Figure G2009100810510D00071
Next, maximum point set image is carried out closed operation to obtain the caption area image.
The structural element of closed operation can adopt complete " 1 " matrix, perhaps other matrix of 6*6.
Step 224 is determined number and the caption area positional information of captions in the caption area image.
In a specific embodiment, specifically can comprise following link in this step:
At first, each caption area in the caption area image is carried out captions and be the differentiation of horizontal or vertical arrangement.
The method of distinguishing is the relative size according to caption area Gao Yukuan.Concrete, if caption area be wider than height, then the captions in this caption area be horizontally, if caption area is wide less than height, then the interior captions of this caption area are vertical arrangement.
Need to prove, the confirmation method of the caption area in the caption area image can adopt the labeling method in the morphology, and perhaps other ripe method is confirmed, the embodiment of the invention does not limit this.
Be horizontal caption area for captions, determine this caption area corresponding zone in horizontal image, and, by the coordinate position of the upper and lower, left and right pixel of this caption area in horizontal image, determine the position of this caption area upper side frame, lower frame, left frame, left frame in horizontal image.
Be the caption area of vertical arrangement for captions, determine this caption area corresponding zone in vertical image, and employing and above-mentioned captions are the horizontal same method of caption area, obtain the position of this caption area upper side frame, lower frame, left frame, left frame in vertical image.
Then, floor projection is carried out in the corresponding comprehensive corresponding zone of subband texture maps (CS) in the caption area posting, and from the peak valley information of comprehensive subband texture maps drop shadow curve, determine upper side frame and the lower frame position of captions number in the comprehensive subband texture maps and every horizontal captions.
Concrete, can determine by the quantity of trough in the drop shadow curve number of captions in the caption area, this process specifically can comprise:
Texture average in the comprehensive subband texture maps is obtained threshold value divided by a parameter (alfa).If the value of drop shadow curve is trough less than this threshold value.Because the position of trough is exactly two centre positions between the captions, thereby by determining the quantity of trough, determine the number of captions in this caption area, namely the trough number adds 1.Need to prove, in embodiments of the present invention, the span of parameter (alfa) can be [2,3], after the practical operation check, and recommended parameter alfa=2.6.
In addition, because the upper and lower bezel locations of the captions that trough is separated is top and the terminal coordinate position of corresponding trough respectively, therefore, by determining the position at trough place, can determine in this caption area the upper side frame of every horizontal captions and the position of lower frame.
Captions for vertical arrangement, upright projection is carried out in corresponding comprehensive subband texture maps zone in the caption area posting, and from the definite wherein captions number of peak valley relation of drop shadow curve and left frame and the left frame position of every vertical captions, its concrete implementation method is identical with horizontal captions.
By aforesaid operations, can determine the information such as position that captions occur in video flowing.
Optionally, in one embodiment, in order to improve the accuracy of detection, can further include:
Whether step 225 is the detection of real caption area to caption area.
Owing in captions detect, may have error detection, it is caption area that the zone that is not captions is detected, and therefore, need to carry out authenticity verification to the caption area of confirming, can effectively promote like this performance that captions detect.
Concrete, can determine whether surveyed area is real caption area according to the distribution situation of captions grain distribution, intensity profile and number of edge points.
When a caption area is true caption area, trough in the projection on the corresponding comprehensive subband texture maps, and being evenly distributed of the trough of the low frequency component image projection behind the wavelet transformation.Wherein the detection method of trough is with what put down in writing in the step 224, and the length scale that uniform measure is trough is no more than crest, and the variance of trough is less.
Step 23 is obtained the attribute information of detected captions
Concrete, in this step, can mate and follow the tracks of operation detected captions, determine title information.
The captions matching operation is to judge according to the captions detection case of last I frame and current I frame whether detected captions mate, if coupling then show that the captions that are complementary belong to same captions otherwise belongs to different captions.
Whether the I frame that adjacent two needs are carried out the captions detection needs to carry out the tracking of captions coupling, is to judge according to caption strips number detected in this two frame and by the following situation that may occur:
1) if the caption strips number average of last I frame and current I frame is 0, then need not to mate and follow the tracks of operation;
2) if the caption strips quantity of last I frame is 0, and the caption strips quantity of current I frame is not 0, the captions number that then can determine the current I frame all is emerging captions, needs so to mate and follow the tracks of operation, to determine the start frame of captions in the current I frame.
When doing the start frame judgement, at first need to process according to captions match condition and determined title in current I frame and next I frame.If do not have captions in next I frame or captions are arranged but and in the current I frame captions that detect do not mate, then with the captions that detect in the current I frame as false retrieval and rejected, otherwise the caption strips that newly occurs that detects in the current I frame is carried out captions and is followed the tracks of.
3) if the caption strips quantity of last I frame is not 0, and the caption strips quantity of current I frame is 0, and then the caption strips of current I frame is the disappearance caption strips, needs so to mate and follows the tracks of operation, to determine the abort frame of captions in the current I frame.
4) if the caption strips number average of last I frame and current I frame is not 0, then need the captions among last I frame and the present frame I are mated and follow the tracks of operation, to determine which captions mates in the last I frame, which disappears, and which captions is couplings in the current I frame, and which is emerging.For in last I frame, which need to determine the abort frame of these captions to the I frame that disappears between the current I frame at last I frame, need to be from last I frame to the appearance frame of determining these captions the current I frame for emerging caption strips in the current I frame.
Can find out so, as long as have the captions number of a frame non-vanishing in last I frame or the current I frame, namely need to mate and follow the tracks of operation.
In the embodiment of the invention, can be by the mode of sampling matching, realize the matching operation of captions, namely calculate in the current I frame (the minimum average B configuration absolute error in shiding matching (MAD:Mean AbsoluteDifference) of 1≤q≤n) of any captions q who did not mate among the captions p to be matched and next I frame, then from n bar captions couplings, choose MAD value minimum, as the optimum Match captions, and judge further whether this minimum MAD satisfies the least commitment threshold value.
Concrete, for captions q and next I frame captions p of current I frame, the position of the up and down frame at captions place is respectively U IC q, D IC q, L IC q, R IC qAnd U IP p, D IP p, L IP p, R IP p
If two I frames all are horizontally, then extract captions q and next I frame captions p of current I frame, in the public domain in the horizontal direction, the maximum of left side frame Lpq = max { L IP p , L IC q } , And the minimum value of the right frame Rpq = min { R IP p , R IC q } , If less than or equal to threshold value, then thinking, Rpq-Lpq do not mate (threshold value herein specifically can be 10); If greater than threshold value, then extract in the public domain on the horizontal direction, the center cy of next I frame captions p ( cy = round [ ( U IP p + D IP p ) / 2 ] , Round[wherein] expression rounds) the pixel IP (cy, Lpq:Rpq) that locates, determine the captions q of itself and current I frame by methods such as shiding matchings, highly be the IC (y of y place, the matching error MAD (y, q) of pixel bars Lpq:Rpq), and best match position Specifically can calculate by following formula and obtain:
MAD ( y , q ) = 1 ( Rpq - Lpq ) &Sigma; x = Lpq Rpq | IP ( cy , x ) - IC ( y , x ) | , y &Element; [ U IC q , D IC q ]
y q = arg min y { MAD ( y , q ) }
q 0 = arg min q { MAD ( q ) }
If in best match position
Figure G2009100810510D00108
Under MAD (q 0)≤MAD Th, then think to mate captions.In the embodiment of the invention, the better value of threshold value MADth can be MAD Th=20.
If all be vertical arrangement, then extract the captions q of current I frame and the captions p of next I frame, in the public domain in vertical direction, the maximum of top frame Upq = max { U IP p , U IC q } , Minimum value with following frame Dpq = min { D IP p , D IC q } , If does not mate Dpq-Upq≤10 then think; If greater than threshold value, then extract in vertical direction in the public domain, the center cx of next I frame captions p ( cx = round [ ( L IP p + R IP p ) / 2 ] ) the center pixel IP (Upq:Dpq that locates, cx), determine it and frame captions q before I by methods such as shiding matchings, the matching error that at width is the pixel bars of the IC of x place (Upq:Dpq, x) is MAD (x, q), and best match position x0, concrete method and above-mentioned horizontal captions are similar, then therefrom select minimum MAD to be worth corresponding captions as optimum Match, if best match position MAD (q 0)≤MAD ThThen think to mate captions.
For the captions on the coupling, can follow the tracks of operation to it, thereby determine the position of start frame and abort frame in the captions.
Concrete, the matching speed that can calculate according to the relative position difference from the captions coupling is divided into two types of static captions and roll titles.If the captions of coupling are carried out the invariant position in the frame that captions detect then are judged as static captions at two, otherwise are judged as roll titles.
If roll titles, then according to the position at roll titles place in matching speed and the present frame, determine that a certain frame of this captions frame before present frame enters image frame just, and a certain frame after present frame just exceeds the Frame of the corresponding frame number of image frame scope, as frame and abort frame occurring.
If static captions, then access image sets (the GOP:group of pictures: video flowing image sets) at former frame place, and the luminance component image of every frame wherein carried out decode operation, obtain simultaneously its caption area direct current (DC) image, calculating is in this GOP, the mean absolute error MAD value of caption area DC image is determined appearance frame and the abort frame of static captions according to the MAD value.
During static caption strips in above-mentioned steps is followed the tracks of in GOP the mean absolute error of caption area DC image be to mate and be achieved by extracting DC lines in this zone.Specific as follows:
At first, realize the frame between former frame and the present frame is carried out partial decoding of h and obtains the DC image.
Then, draw its corresponding coordinate position in the DC image according to the drawn captions bezel locations in the present frame, and extract the DC of the central block place lines of captions region in the DC image therebetween.
Next, calculate given frame i and the DC lines difference value of present frame.
When extracting the DC lines, to consider the orientation of captions.For horizontal captions, the DC lines difference value MADDC (i) of i frame wherein and present frame, specifically can obtain by following formula:
MADDC ( i ) = 1 L &Sigma; dcx = 1 L | DC ( dcy , dcx , IC ) - DC ( dcy , dcx , i ) | IP &le; i &le; IC
Wherein DC (y, x, i) represents the corresponding DC image of i frame, and dcy represents caption area center in vertical direction in the DC image.
Computational methods and top method for the vertical arrangement captions are similar.
For the determination methods that frame or abort frame occur, can determine by seeking catastrophe point at the MADDC curve.Shown in the following formula of concrete grammar:
Figure G2009100810510D00122
Wherein th1 and th2 are the constraint threshold values of judging catastrophe point, and the better constraint threshold value of selecting in the embodiment of the invention is th1=3.5, th2=9.
If centered by present frame, search radius is not find catastrophe point in 2 GOP length ranges, and the captions of then this caption strips being surveyed as false retrieval are rejected; Otherwise find out nearest Frame before or after the present frame, as frame or abort frame occurring.
Following formula is to horizontal captions calculated difference value, obtains for computational methods and the top similar method of vertical arrangement captions.
Step 24 according to the attribute information of captions, is extracted detected captions.
Need to prove, in the video caption information getting method that the embodiment of the invention provides, the title information of can real-time record having obtained.
Title information specifically can comprise essential information, scene information and the match information etc. of captions.
Essential information specifically can comprise the base attribute information of these captions, and detection information etc.;
Scene information specifically can comprise start frame and the abort frame of these captions, and whether captions cross over camera lens sign etc.;
Match information specifically can comprise the sign that whether mates, and the positional information of coupling etc.
Wherein, the embodiment of the invention is for the determination methods of whether crossing over camera lens, can adopt in the interval at Frame before the start frame that records and the Frame place after the abort frame and carry out the maturation methods such as Scene change detection.The embodiment of the invention does not limit this.
The related title information of the embodiment of the invention specifically can be as shown in table 1:
Table 1
/ * structure is used for describing the attribute * of the caption strips of current active/typedef struct ActiveTextLine {/* essential information */int frameIndex; // current frame number int textPos[4]; // positional information 4 dimension groups, left margin, coboundary, right margin, lower boundary int rollingFlag; // rolling and static flag, 0-is static, 1-vertical scrolling, 2-horizontal rolling int verVel; // vertical speed is downwards positive speed, upwards is negative velocity int horVel; // horizontal velocity is to the right positive speed, is left negative velocity bool direction; // distribution arrangement, 0-horizontal distribution, 1-vertical distribution/* scene information */int startFrame; // start frame int startGOP; // initial GOP int endFrame; // abort frame int duration; Length (frame number) the bool startAbrupt that // captions occur; Whether // start frame sudden change occurs, and 0-does not have, and bool endAbrupt appears in 1-; Whether // abort frame sudden change occurs, and 0 does not have, and bool crossScene appears in 1-; The camera lens sign crossed in // captions, and 0-does not have, and int crossPos[10 appears in 1-]; The frame number of // recording caption leap camera lens/* match information */bool matchFlag; // captions match flag is 1, occurs coupling in the next I frame, is 0, does not mate int matchTextPos[4]; The position of // coupling captions } ATL;
In addition, the embodiment of the invention can also with the form of text entry, be preserved the title information of Real-time Obtaining.The text that record is preserved specifically can be as shown in table 2:
Table 2
N bar captions startFrame in the attribute record file format */TextNumIndex:#n of/* caption strips // video; // start frame endFrame; // abort frame rollingFlag; // roll and static flag direction; // distribution arrangement, 0-horizontal distribution, 1-vertical distribution textPos[4]; // positional information 4 dimension groups, left margin, coboundary, right margin, lower boundary RollingMV[2]; // captions rolling speed 2 dimension groups, vertical speed, horizontal velocity OCRString; Recognition result behind the // caption extraction
So, in this step, concrete title information according to having recorded, comprise start frame, the abort frame of captions and the information such as position occurs, extract and be used for the caption frame cut apart, then carry out the caption extraction that merges multiframe, and the result of cutting apart identified, specifically can comprise:
From the title information of record, judge that captions belong to static or rolling.
For static captions, directly extract I frames all between the initial sum abort frame and P frame, the caption area image of same position;
For roll titles, then according to rolling speed, extract all I frames of these captions and P frame respective image zone.
On definite basis, zone, captions are continued the caption area part of I frames all in the frame, carry out first the adaptive threshold binarization segmentation, obtain pixel value and only have 0 and 255 bianry image; All the I frame caption area images that to cut apart again carry out " with operation " for the pixel value of same position, obtain " I frame and image "; Then captions are continued I frames all in the frame and the caption area image of P frame, be averaging pixel value for the pixel value of same position, namely ask a average image of these images, this average image is carried out binarization segmentation, obtain " I-P frame the average image "; " I frame and the image " that will obtain at last and " I-P frame the average image " two width of cloth images carry out " with operation " drawn design sketch as final segmentation result.
For segmentation result, can be in the subtitle recognition process, adopt literal identification (OCR:OpticalCharacter Recognition) software, with the bianry image that splits with identifying.
Foregoing description can be found out, the embodiment of the invention provides the caption information acquisition methods, undertaken detecting based on the captions of small echo by the luminance component image to Frame in the video flowing, obtain the attribute information of detected captions, according to described attribute information, extract detected captions and extract, thus caption information in the Obtaining Accurate Frame.Owing to detecting based on the captions of small echo, need not the zone at captions place is limited, therefore, the caption information acquisition methods that the embodiment of the invention provides can in the situation that do not limit the subtitle position zone, obtain the caption information in the video data.And owing to only obtaining the luminance component image of specific data frame, therefore, the caption information acquisition methods that the embodiment of the invention provides can obtain caption information more efficiently.And, the caption information acquisition methods that the embodiment of the invention provides, can also carry out to the captions that obtain the checking of caption area authenticity, and coupling and tracking operation, thereby the caption information acquisition methods that the embodiment of the invention is provided can obtain caption information more accurately, effectively promotes the performance that captions detect.In addition, the caption information acquisition methods that the embodiment of the invention provides can also carry out cutting operation to the captions that obtain, thereby be more convenient for user's use.
The embodiment of the invention also provides a kind of caption information deriving means, and as shown in Figure 4, this device comprises detection module 410, the first acquisition modules 420 and extraction module 430.Wherein:
Detection module 410 is used for the luminance component image of video flowing Frame is carried out detecting based on the captions of small echo.
The first acquisition module 420 is for the attribute information that obtains detection module 410 detected captions.
The title information that the first acquisition module 420 obtains specifically can comprise essential information, scene information and the match information etc. of captions.
Essential information specifically can comprise the base attribute information of these captions, and detection information etc.;
Scene information specifically can comprise start frame and the abort frame of these captions, and whether captions cross over camera lens sign etc.;
Match information specifically can comprise the sign that whether mates, and the positional information of coupling etc.
Wherein, the embodiment of the invention is for the determination methods of whether crossing over camera lens, can adopt in the interval at Frame before the start frame that records and the Frame place after the abort frame and carry out the maturation methods such as Scene change detection.The embodiment of the invention does not limit this.
The related title information of the embodiment of the invention specifically can be as shown in table 1.
In addition, the embodiment of the invention can also with the form of text entry, be preserved the title information of Real-time Obtaining.The text that record is preserved specifically can be as shown in table 2.
Extraction module 430, the captions that are used for obtaining according to the first acquisition module 420 belong to information, extract detection module 430 detected captions.
In a specific embodiment of the caption information deriving means that the embodiment of the invention provides, as shown in Figure 5, this device specifically can also comprise the second acquisition module 440, is used for obtaining the luminance component image of specific data frame.
In order to accelerate to obtain the speed of caption information, the embodiment of the invention Frame of appointment of specifically can from video data stream, decoding, and obtain the luminance component image of specific data frame.
Such as, the frame number of only decoding is the intraframe coding of odd number (or even number), be that the I frame (also can be other forms of frame of video, such as encoded predicted frame, be the P frame) code stream, obtain the luminance component image of I frame, and to the chromatic component of I frame, and other frame then skips fast, thereby accelerated to obtain the speed of caption information.
Need to prove, the embodiment of the invention does not limit the compressed format of video data stream.
The detection module 410 that the embodiment of the invention is related specifically can comprise the first acquiring unit 411, second acquisition unit 412, generation unit 413, determining unit 414 as shown in Figure 6.Wherein:
The first acquiring unit 411, the luminance component image that is used for the second acquisition module 430 is obtained carries out wavelet transformation, obtains the high-frequency sub-band texture maps of level, vertical and three directions of diagonal.
Related wavelet transformation in the embodiment of the invention is specifically as follows HAAR (Ha Er) wavelet transformation, Mexico's hat wavelet conversion, and the 9-7 wavelet transformation, the 5-3 wavelet transformation, etc.
Concrete, the luminance component image of 411 pairs of Frames of having chosen of the first acquiring unit, carry out wavelet transformation, to obtain a low frequency sub-band, high-frequency sub-band with level, vertical, three directions of diagonal, wherein, horizontal high-frequency sub-band is designated as that H, vertical high frequency subband are designated as V, the diagonal high-frequency sub-band is designated as D.
Then, the coefficient of the high-frequency sub-band of the level obtained, vertical and three directions of diagonal is asked respectively absolute value, to obtain horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps.
The first acquiring unit 411 can also in conjunction with three high-frequency sub-band texture maps to obtain, obtain comprehensive high-frequency sub-band texture maps (CS).
The value of each point can obtain by following formula in the comprehensive high-frequency sub-band texture image:
CS(i,j)=CH(i,j)+CV(i,j)+CD(i,j)
Second acquisition unit 412, the high-frequency sub-band texture maps of the level that is used for the first acquiring unit 41 1 is obtained, vertical and three directions of diagonal is obtained Frame captions dot image (TextPnt).
Second acquisition unit 412 is concrete by following operation, obtains the captions dot image of Frame:
At first, according to the high-frequency sub-band texture maps, generate initial captions dot image.
Take horizontal high-frequency sub-band texture maps as example, horizontal high-frequency sub-band texture maps is carried out captions point detect, to obtain the initial captions dot image of this horizontal high-frequency sub-band (MAPH_ORG).
Wherein, the value located at coordinate (i, j) of the initial captions dot image of this horizontal high-frequency sub-band is to calculate according to following formula:
MAPH _ ORG ( i , j ) = 1 , CH ( i , j ) &GreaterEqual; TH 0 , CH ( i , j ) < TH
Need to prove, value is " 0 " expression background, and value is the initial captions point of " 1 " expression, and the computational methods of threshold value in the formula (TH) can be as follows:
TH = 50 , MH * 5 &GreaterEqual; 50 MH * 5 , 50 > MH * 5 > 18 18 , MH * 5 &le; 18
MH in the formula is texture strength average in the horizontal high-frequency sub-band texture image.
Then, the initial captions dot image of horizontal high-frequency sub-band is removed noise processed, with the final captions dot image of the horizontal direction that obtains (MAPH).
Related except noise processed in the embodiment of the invention, specifically can adopt the ripe treatment schemes such as square filtering such as overlapping slip, the embodiment of the invention does not limit this.
Then, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps are carried out similar treatment step to obtain the initial captions dot image of the initial captions dot image of vertical subband (MAPV_ORG) and diagonal subband (MAPD_ORG), and the initial captions dot image of vertical subband and the initial captions dot image of diagonal subband removed respectively noise processed, to obtain the final captions dot image of the final captions dot image of vertical direction (MAPV) and diagonal (MAPD).
At last, the captions dot image (TextPnt) that the final captions dot image (MAPH, MAPV, MAPD) of three directions is sought common ground and obtains Frame.
Generation unit 413 for the captions dot image of obtaining according to second acquisition unit 412, generates the caption area image.
Generation unit 413 specifically can generate the caption area image by following operation:
At first, the closed operation and the opening operation that the captions dot image that has generated are carried out respectively horizontal direction obtain horizontal image (Verlmg).
Wherein, the structural element of closed operation can be complete " 1 " matrix of 20*1, and the structural element of opening operation can be complete " 1 " matrix of 1*2, and certainly, the structural element that closed operation and opening operation adopt can arrange according to actual needs flexibly;
Then, the closed operation and the opening operation that the captions dot image are carried out vertical direction obtain vertical image (Horlmg).
Equally, the structural element of closed operation can be complete " 1 " matrix of 1*20, and the structural element of opening operation can be complete " 1 " matrix of 2*1;
Then, the horizontal image and the vertical image that obtain are asked the union operation, to obtain comprising the maximum point set image (lmg) of all caption areas, its concrete preparation method is as follows:
Next, maximum point set image is carried out closed operation to obtain the caption area image.
The structural element of closed operation can adopt complete " 1 " matrix, perhaps other matrix of 6*6.
Determining unit 414 is for number and the caption area positional information of determining the caption area image captions that generation unit 413 generates.
Determining unit 414 specifically can be determined by following operation number and the caption area positional information of captions in the caption area image:
At first, each caption area in the caption area image is carried out captions and be the differentiation of horizontal or vertical arrangement.
The method of distinguishing is the relative size according to caption area Gao Yukuan.Concrete, if caption area be wider than height, then the captions in this caption area be horizontally, if caption area is wide less than height, then the interior captions of this caption area are vertical arrangement.
Need to prove, the confirmation method of the caption area in the caption area image can adopt the labeling method in the morphology, and perhaps other ripe method is confirmed, the embodiment of the invention does not limit this.
Be horizontal caption area for captions, determine this caption area corresponding zone in horizontal image, and, by the coordinate position of the upper and lower, left and right pixel of this caption area in horizontal image, determine the position of this caption area upper side frame, lower frame, left frame, left frame in horizontal image.
Be the caption area of vertical arrangement for captions, determine this caption area corresponding zone in vertical image, and employing and above-mentioned captions are the horizontal same method of caption area, obtain the position of this caption area upper side frame, lower frame, left frame, left frame in vertical image.
Then, floor projection is carried out in the corresponding comprehensive corresponding zone of subband texture maps (CS) in the caption area posting, and from the peak valley information of comprehensive subband texture maps drop shadow curve, determine upper side frame and the lower frame position of captions number in the comprehensive subband texture maps and every horizontal captions.
Concrete, can determine by the quantity of trough in the drop shadow curve number of captions in the caption area, this process specifically can comprise:
Texture average in the comprehensive subband texture maps is obtained threshold value divided by a parameter (alfa).If the value of drop shadow curve is trough less than this threshold value.Because the position of trough is exactly two centre positions between the captions, thereby by determining the quantity of trough, determine the number of captions in this caption area, namely the trough number adds 1.Need to prove, in embodiments of the present invention, the span of parameter (alfa) can be [2,3], after the practical operation check, and recommended parameter alfa=2.6.
In addition, because the upper and lower bezel locations of the captions that trough is separated is top and the terminal coordinate position of corresponding trough respectively, therefore, by determining the position at trough place, can determine in this caption area the upper side frame of every horizontal captions and the position of lower frame.
Captions for vertical arrangement, upright projection is carried out in corresponding comprehensive subband texture maps zone in the caption area posting, and from the definite wherein captions number of peak valley relation of drop shadow curve and left frame and the left frame position of every vertical captions, its concrete implementation method is identical with horizontal captions.
By aforesaid operations, can determine the information such as position that captions occur in video flowing.
In another specific embodiment of the detection module 410 that the embodiment of the invention provides, detection module 410 further can be as shown in Figure 7, can also comprise detecting unit 415, for the detection that whether determining unit 414 definite caption areas is belonged to for real caption area.
Owing in captions detect, may have error detection, it is caption area that the zone that is not captions is detected, and therefore, need to carry out authenticity verification to the caption area of confirming, can effectively promote like this performance that captions detect.
Concrete, can determine whether surveyed area is real caption area according to the distribution situation of captions grain distribution, intensity profile and number of edge points.
When a caption area is true caption area, trough in the projection on the corresponding comprehensive subband texture maps, and being evenly distributed of the trough of the low frequency component image projection behind the wavelet transformation.The length scale that uniform measure is trough is no more than crest, and the variance of trough is less.
The first acquisition module 420 that the embodiment of the invention provides specifically can as shown in Figure 8, comprise judging unit 421, the first determining units 422 and the second determining unit 423.Wherein:
Whether judging unit 421 be used for is judged the current I frame at detection module 410 detected captions places, mate with the upper I frame of current I frame.
Judging unit 421 is carried out the condition of judging and specifically can be comprised: whether the captions number in last I frame and the current I frame is zero.
If in last I frame and the current I frame, there is the captions number of an I frame non-vanishing, then judging unit 421 needs to carry out the decision operation of whether mating.
Need to prove, the Rule of judgment of judging unit 421 is not limited in above-mentioned condition, can according to the needs of practical application, replenish and adjust.
Whether judging unit 421 can pass through the sampling matching method, judges the current I frame at detection module 410 detected captions places, mate with the upper I frame of current I frame.
Namely calculate in the current I frame (the minimum average B configuration absolute error in shiding matching (MAD:Mean AbsoluteDifference) of 1≤q≤n) of any captions q who did not mate among the captions p to be matched and next I frame, then from n bar captions couplings, choose MAD value minimum, as the optimum Match captions, and judge further whether this minimum MAD satisfies the least commitment threshold value.
Concrete, for captions q and next I frame captions p of current I frame, the position of the up and down frame at captions place is respectively U IC q, D IC q, L IC q, R IC qAnd U IP p, D IP p, L IP p, R IP p
If two I frames all are horizontally, then extract captions q and next I frame captions p of current I frame, in the public domain in the horizontal direction, the maximum of left side frame Lpq = max { L IP p , L IC q } , And the minimum value of the right frame Rpq = min { R IP p , R IC q } , If less than or equal to threshold value, then thinking, Rpq-Lpq do not mate (threshold value herein specifically can be 10); If greater than threshold value, then extract in the public domain on the horizontal direction, the center cy of next I frame captions p ( cy = round [ ( U IP p + D IP p ) / 2 ] , Round[wherein] expression rounds) the pixel IP (cy, Lpq:Rpq) that locates, determine the captions q of itself and current I frame by methods such as shiding matchings, highly be the IC (y of y place, the matching error MAD (y, q) of pixel bars Lpq:Rpq), and best match position
Figure G2009100810510D00214
Specifically can calculate by following formula and obtain:
MAD ( y , q ) = 1 ( Rpq - Lpq ) &Sigma; x = Lpq Rpq | IP ( cy , x ) - IC ( y , x ) | , y &Element; [ U IC q , D IC q ]
y q = arg min y { MAD ( y , q ) }
q 0 = arg min q { MAD ( q ) } If in best match position
Figure G2009100810510D00221
Under MAD (q 0)≤MA Th, then think to mate captions.In the embodiment of the invention, the better value of threshold value MADth can be MAD Th=20.
If all be vertical arrangement, then extract the captions q of current I frame and the captions p of next I frame, in the public domain in vertical direction, the maximum of top frame Upq = max { U IP p , U IC q } , Minimum value with following frame Dpq = min { D IP p , D IC q } , If does not mate Dpq-Upq≤10 then think; If greater than threshold value, then extract in vertical direction in the public domain, the center cx of next I frame captions p ( cx = round [ ( L IP p + R IP p ) / 2 ] ) the center pixel IP (Upq:Dpq that locates, cx), determine it and frame captions q before I by methods such as shiding matchings, the matching error that at width is the pixel bars of the IC of x place (Upq:Dpq, x) is MAD (x, q), and best match position x0, concrete method and above-mentioned horizontal captions are similar, then therefrom select minimum MAD to be worth corresponding captions as optimum Match, if best match position
Figure G2009100810510D00225
MAD (q 0)≤MAD ThThen think to mate captions.
Judging unit triggers the first determining unit 422 after determining coupling.
The first determining unit 422 is used for when the judged result of judging unit 421 is coupling, and the matching speed that the relative position difference of mating according to captions calculates determines that detected captions are dynamic title or static captions.
Concrete, the matching speed that the first determining unit 422 can calculate according to the relative position difference from the captions coupling is divided into two types of static captions and roll titles.
If the captions of coupling are carried out the invariant position in the Frame that captions detect then are judged as static captions at two, otherwise are judged as roll titles.
The second determining unit 423 is used for when the first determining unit 422 definite captions are dynamic title, according to the matching speed of dynamic title, and the position of present frame in dynamic title, determines start frame and the abort frame of dynamic title; When the first determining unit 422 determines that captions are static captions, extract the direct current lines in the static captions, and the direct current lines are carried out matching operation, determine start frame and the abort frame of static captions.
If roll titles, 423 positions according to roll titles place in matching speed and the present frame of the second determining unit, determine that a certain frame of this captions frame before present frame enters image frame just, and a certain frame after present frame just exceeds the Frame of the corresponding frame number of image frame scope, as frame and abort frame occurring.
If static captions, image sets (the GOP:group of pictures: video flowing image sets) at 423 access of the second determining unit former frame place, and the luminance component image of every frame wherein carried out decode operation, obtain simultaneously its caption area direct current (DC) image, calculating is in this GOP, the mean absolute error MAD value of caption area DC image is determined appearance frame and the abort frame of static captions according to the MAD value.
During static caption strips in above-mentioned steps is followed the tracks of in GOP the mean absolute error of caption area DC image be to mate and be achieved by extracting DC lines in this zone.Specific as follows:
At first, realize the frame between former frame and the present frame is carried out partial decoding of h and obtains the DC image.
Then, draw its corresponding coordinate position in the DC image according to the drawn captions bezel locations in the present frame, and extract the DC of the central block place lines of captions region in the DC image therebetween.
Next, calculate given frame i and the DC lines difference value of present frame.
When extracting the DC lines, to consider the orientation of captions.For horizontal captions, the DC lines difference value MADDC (i) of i frame wherein and present frame, specifically can obtain by following formula:
MADDC ( i ) = 1 L &Sigma; dcx = 1 L | DC ( dcy , dcx , IC ) - DC ( dcy , dcx , i ) | IP &le; i &le; IC
Wherein DC (y, x, i) represents the corresponding DC image of i frame, and dcy represents caption area center in vertical direction in the DC image.
Computational methods and top method for the vertical arrangement captions are similar.
For the determination methods that frame or abort frame occur, can determine by seeking catastrophe point at the MADDC curve.Shown in the following formula of concrete grammar:
Figure G2009100810510D00232
Wherein th1 and th2 are the constraint threshold values of judging catastrophe point, and the better constraint threshold value of selecting in the embodiment of the invention is th1=3.5, th2=9.
If centered by present frame, search radius is not find catastrophe point in 2 GOP length ranges, and the captions of then this caption strips being surveyed as false retrieval are rejected; Otherwise find out nearest Frame before or after the present frame, as frame or abort frame occurring.
Following formula is to horizontal captions calculated difference value, obtains for computational methods and the top similar method of vertical arrangement captions.
The extraction module 430 that the real-time example of the present invention provides specifically can comprise extracting unit 431, cutting unit 432 and recognition unit 433 as shown in Figure 9.Wherein:
Extracting unit 431 is used for according to start frame, the abort frame of captions and positional information occurs, extracts to be used for the caption frame cut apart in the captions.
Cutting unit 432 is used for caption area corresponding to caption frame that definite extracting unit 431 extracts, and described caption area is carried out binarization segmentation, obtains bianry image.
Concrete, cutting unit 432 concrete title information according to having recorded comprise start frame, the abort frame of captions and the information such as position occurs, extract for the caption frame of cutting apart, then carry out the caption extraction that merges multiframe, and the result of cutting apart identified, specifically can comprise:
From the title information of record, judge that captions belong to static or rolling.
For static captions, directly extract I frames all between the initial sum abort frame and P frame, the caption area image of same position;
For roll titles, then according to rolling speed, extract all I frames of these captions and P frame respective image zone.
On definite basis, zone, captions are continued the caption area part of I frames all in the frame, carry out first the adaptive threshold binarization segmentation, obtain pixel value and only have 0 and 255 bianry image; All the I frame caption area images that to cut apart again carry out " with operation " for the pixel value of same position, obtain " I frame and image "; Then captions are continued I frames all in the frame and the caption area image of P frame, be averaging pixel value for the pixel value of same position, namely ask a average image of these images, this average image is carried out binarization segmentation, obtain " I-P frame the average image "; " I frame and the image " that will obtain at last and " I-P frame the average image " two width of cloth images carry out " with operation " drawn design sketch as final segmentation result.
Recognition unit 433 is used for the bianry image that identification cutting unit 432 obtains, and extracts captions.
Concrete, recognition unit 433 can adopt literal identification (OCR:Optical CharacterRecognition) software, to the bianry image that splits with identifying extraction captions wherein.
Foregoing description can be found out, the embodiment of the invention provides the caption information deriving means, undertaken detecting based on the captions of small echo by the luminance component image to Frame in the video flowing, and detected captions are mated and follow the tracks of operation, thus the caption information of definite this Frame exactly.Owing to detecting based on the captions of small echo, need not the zone at captions place is limited, therefore, the caption information deriving means that the embodiment of the invention provides can in the situation that do not limit the subtitle position zone, obtain the caption information in the video data.And, owing to only obtaining the luminance component image of part specific data frame, and the captions that obtain are carried out the checking of caption area authenticity, and coupling and tracking operation, thereby the caption information deriving means that the embodiment of the invention is provided can obtain caption information faster, accurately, effectively promotes the performance that captions detect.In addition, the caption information deriving means that the embodiment of the invention provides can also carry out cutting operation to the captions that obtain, thereby be more convenient for user's use.
Need to prove, related formula or numerical value among the invention described above embodiment does not play any limitations affect for the protection range of the embodiment of the invention, when adopting other wavelet transformations, coupling tracking technique means, can carry out corresponding conversion fully.
Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential hardware platform, can certainly all implement by hardware, but the former is better execution mode in a lot of situation.Based on such understanding, technical scheme of the present invention is to can embodying with the form of software product in whole or in part that background technology contributes, this computer software product can be stored in the storage medium, such as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the present invention or embodiment.
The above; only for the better embodiment of the present invention, but protection scope of the present invention is not limited to this, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection range of claim.

Claims (11)

1. a video caption information getting method is characterized in that, comprising:
Luminance component image to Frame carries out wavelet transformation, generate horizontal subband, vertical subband and diagonal subband, the coefficient of described horizontal subband, vertical subband and diagonal subband is asked respectively absolute value, obtain horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps;
Described horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps are carried out respectively captions point to be detected, the initial captions dot image of generation level, vertical and three directions of diagonal, initial captions dot image to described three directions is removed respectively noise processed, obtain the final captions dot image of three directions, the final captions dot image of described three directions is sought common ground, obtain the captions dot image of described Frame;
Captions dot image by described Frame generates the caption area image;
Determine number and the caption area positional information of captions in the described caption area image;
Obtain the attribute information of detected captions, this attribute information comprises: the essential information of captions, scene information and match information;
According to described attribute information, extract detected captions.
2. method according to claim 1 is characterized in that, described method also comprised before the luminance component image to Frame carries out detecting based on the captions of small echo: the luminance component image that obtains the specific data frame.
3. method according to claim 1 is characterized in that, described captions dot image by described Frame generates the caption area image and comprises:
The closed operation and the opening operation that described captions dot image are carried out respectively horizontal direction obtain horizontal image, and closed operation and the opening operation that described captions dot image is carried out respectively vertical direction obtained vertical image;
The horizontal image and the vertical image that obtain are asked the union operation, obtain comprising the maximum point set image of all caption areas;
Described maximum point set image is carried out closed operation, obtain the caption area image.
4. method according to claim 3 is characterized in that, number and the caption area positional information of captions comprise in described definite described caption area image:
Distinguish horizontal caption area and vertical caption area in the described caption area image;
By the coordinate position of the upper and lower, left and right pixel of horizontal caption area in described horizontal image, determine upper side frame, lower frame, the left frame of described horizontal caption area posting in horizontal image, the position of left frame; By the coordinate position of the upper and lower, left and right pixel of vertical caption area in described vertical image, determine upper side frame, lower frame, the left frame of described vertical caption area posting in vertical image, the position of left frame;
Floor projection and upright projection are carried out respectively in the corresponding comprehensive corresponding zone of high-frequency sub-band texture maps in the corresponding comprehensive corresponding zone of high-frequency sub-band texture maps and the described vertical caption area posting in described horizontal caption area posting, determine the peak valley information of drop shadow curve, and according to described peak valley information, determine upper side frame and the lower frame position of captions number in the described caption area and captions.
5. according to claim 1 to 4 each described methods, it is characterized in that, the attribute information of described captions comprises start frame and the abort frame of described captions, and positional information occurs.
6. method according to claim 5 is characterized in that, described start frame and the abort frame that obtains detected captions comprises:
Judge the current I frame at detected captions place, with described current | whether a upper I frame of frame mates;
If coupling, the matching speed that the relative position difference of then mating according to captions calculates determines that described captions are dynamic title or static captions;
If described captions are dynamic title, then according to the matching speed of described dynamic title, and the position of present frame in dynamic title, determine start frame and the abort frame of described dynamic title;
If described captions are static captions, then extract the direct current lines in the described static captions, and described direct current lines are carried out matching operation, determine start frame and the abort frame of described static captions.
7. method according to claim 1 is characterized in that, and is described according to described attribute information, extracts detected captions and comprises:
According to start frame, the abort frame of described captions and positional information occurs, extract and be used for the caption frame cut apart in the described captions;
Determine the caption area corresponding to caption frame of described extraction, described caption area is carried out binarization segmentation, obtain bianry image;
Identify described bianry image, obtain described captions.
8. a video caption information acquisition device is characterized in that, comprising:
Detection module is used for the luminance component image of video flowing Frame is carried out detecting based on the captions of small echo;
The first acquisition module is for the attribute information that obtains the detected captions of described detection module;
Extraction module, the captions that are used for obtaining according to described the first acquisition module belong to information, extract the detected captions of described detection module:
Described detection module comprises:
The first acquiring unit carries out wavelet transformation to the luminance component image of Frame, obtains horizontal high-frequency sub-band texture maps, vertical high frequency subband texture maps and diagonal high-frequency sub-band texture maps;
Second acquisition unit is used for the described level of obtaining according to described the first acquiring unit, vertical and diagonal high-frequency sub-band texture maps, obtains the captions dot image of Frame;
Generation unit for the captions dot image of the described Frame that obtains according to described second acquisition unit, generates the caption area image;
Determining unit is for number and the caption area positional information of determining the caption area image captions that described generation unit generates;
Described the first acquisition module comprises:
Judging unit, for the current I frame of judging the detected captions of described detection module place, with described current | upper one of frame | whether frame mates;
The first determining unit is used for when the judged result of described judging unit is coupling, and the matching speed that the relative position difference of mating according to captions calculates determines that described captions are dynamic title or static captions;
The second determining unit is used for when described captions are dynamic title, according to the matching speed of described dynamic title, and the position of present frame in dynamic title, determines start frame and the abort frame of described dynamic title; When described captions are static captions, extract the direct current lines in the described static captions, and described direct current lines are carried out matching operation, determine start frame and the abort frame of described static captions.
9. device according to claim 8 is characterized in that, described device also comprises:
The second acquisition module is for the luminance component image that obtains the specific data frame.
10. device according to claim 9 is characterized in that, described detection module also comprises:
Detecting unit, whether the caption area that is used for described determining unit is determined is the detection of true caption area.
11. device according to claim 8 is characterized in that, described extraction module comprises:
Extracting unit is used for according to start frame, the abort frame of described captions and positional information occurs, extracts to be used for the caption frame cut apart in the described captions;
Cutting unit is used for caption area corresponding to caption frame that definite described extracting unit extracts, and described caption area is carried out binarization segmentation, obtains bianry image;
Recognition unit is used for identifying the bianry image that described cutting unit obtains, and extracts described captions.
CN 200910081051 2009-03-31 2009-03-31 Method and device for acquiring video subtitle information Active CN101853381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910081051 CN101853381B (en) 2009-03-31 2009-03-31 Method and device for acquiring video subtitle information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910081051 CN101853381B (en) 2009-03-31 2009-03-31 Method and device for acquiring video subtitle information

Publications (2)

Publication Number Publication Date
CN101853381A CN101853381A (en) 2010-10-06
CN101853381B true CN101853381B (en) 2013-04-24

Family

ID=42804861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910081051 Active CN101853381B (en) 2009-03-31 2009-03-31 Method and device for acquiring video subtitle information

Country Status (1)

Country Link
CN (1) CN101853381B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102625029B (en) * 2012-03-23 2015-07-01 无锡引速得信息科技有限公司 Self-adaptive threshold caption detection method
CN103475831A (en) * 2012-06-06 2013-12-25 晨星软件研发(深圳)有限公司 Caption control method applied to display device and component
CN102902954A (en) * 2012-08-29 2013-01-30 无锡泛太科技有限公司 Pedestrian detection system and method based on internet-of-things intelligent household security gateway
JP6519329B2 (en) * 2015-06-09 2019-05-29 ソニー株式会社 Receiving apparatus, receiving method, transmitting apparatus and transmitting method
CN105828165B (en) * 2016-04-29 2019-05-17 维沃移动通信有限公司 A kind of method and terminal obtaining subtitle
CN106454151A (en) * 2016-10-18 2017-02-22 珠海市魅族科技有限公司 Video image stitching method and device
CN107454479A (en) * 2017-08-22 2017-12-08 无锡天脉聚源传媒科技有限公司 A kind of processing method and processing device of multi-medium data
CN111860262B (en) * 2020-07-10 2022-10-25 燕山大学 Video subtitle extraction method and device
CN112954455B (en) * 2021-02-22 2023-01-20 北京奇艺世纪科技有限公司 Subtitle tracking method and device and electronic equipment
CN113920507B (en) * 2021-12-13 2022-04-12 成都索贝数码科技股份有限公司 Rolling caption extraction method for news scene

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1461146A (en) * 2002-05-16 2003-12-10 精工爱普生株式会社 Caption pickup device
CN101448100A (en) * 2008-12-26 2009-06-03 西安交通大学 Method for extracting video captions quickly and accurately

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1461146A (en) * 2002-05-16 2003-12-10 精工爱普生株式会社 Caption pickup device
CN101448100A (en) * 2008-12-26 2009-06-03 西安交通大学 Method for extracting video captions quickly and accurately

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Junyong You et al.A Multiple Visual Models Based Perceptive Analysis Framework for Multilevel Video Summarization.《IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY》.2007,第17卷(第3期), *
钱学明 等.基于GA的压缩域中全局运动估计及在字幕遮挡区域恢复中的应用.《电子学报》.2006,第34卷(第10期), *

Also Published As

Publication number Publication date
CN101853381A (en) 2010-10-06

Similar Documents

Publication Publication Date Title
CN101853381B (en) Method and device for acquiring video subtitle information
US6366699B1 (en) Scheme for extractions and recognitions of telop characters from video data
CN103020650B (en) Station caption identifying method and device
Kumar et al. Profile view lip reading
KR100636910B1 (en) Video Search System
US7787705B2 (en) Video text processing apparatus
CN101448100B (en) Method for extracting video captions quickly and accurately
CN101102419B (en) A method for caption area of positioning video
CN102222104B (en) Method for intelligently extracting video abstract based on time-space fusion
Shivakumara et al. Efficient video text detection using edge features
Shivakumara et al. Video text detection based on filters and edge features
JP2011118498A (en) Video identifier extraction device and method, video identifier collation device and method, and program
Ishino et al. Detection system of damaged cables using video obtained from an aerial inspection of transmission lines
CN102301697B (en) Video identifier creation device
Özay et al. Automatic TV logo detection and classification in broadcast videos
Kuwano et al. Telop-on-demand: Video structuring and retrieval based on text recognition
US8311269B2 (en) Blocker image identification apparatus and method
JP2011203790A (en) Image verification device
CN101827224A (en) Detection method of anchor shot in news video
JP3655110B2 (en) Video processing method and apparatus, and recording medium recording video processing procedure
KR101323369B1 (en) Apparatus and method for clustering video frames
KR100683501B1 (en) An image extraction device of anchor frame in the news video using neural network and method thereof
Heng et al. The implementation of object-based shot boundary detection using edge tracing and tracking
JP2000182028A (en) Superimposed dialogue detecting method, its device, moving picture retrieving method and its device
CN110602444B (en) Video summarization method based on Weber-Fisher&#39;s law and time domain masking effect

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220628

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Patentee after: Huawei Cloud Computing Technologies Co.,Ltd.

Patentee after: Xi'an Jiao Tong University

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.

Patentee before: Xi'an Jiao Tong University