CN101777124A

CN101777124A - Method for extracting video text message and device thereof

Info

Publication number: CN101777124A
Application number: CN201010104243A
Authority: CN
Inventors: 周景超; 苗广义; 鲍东山
Original assignee: BEIJING NUFRONT SOFTWARE TECHNOLOGY Co Ltd
Current assignee: BEIJING NUFRONT SOFTWARE TECHNOLOGY Co Ltd
Priority date: 2010-01-29
Filing date: 2010-01-29
Publication date: 2010-07-14

Abstract

The invention discloses a method for extracting video text message and a device thereof. The method comprises the steps of: confirming the position of the text block in a video image; performing segmentation and character recognition to the text block image according to Chinese and English character characteristic, and obtaining Chinese and English character strings; calibrating the recognition reliability; and combining the Chinese character string and the English character string based on the calibrated character recognition reliability and the position relationship between the Chinese character and the English character, and obtaining the text message. The invention can perform the character segmentation recognition to the Chinese and English mixed text in the video image, can solve the problem that different kinds of video texts are difficult to be treated in an unified process, and can organize and classify different kinds of text message in a video. The architecture not only can effectively treat different kinds of videos, but also can be conveniently customized, modified and expanded.

Description

A kind of method and device that extracts video text message

Technical field

The present invention relates to image and areas of information technology, be specifically related to extract the method and the device of video text message.

Background technology

In the scheme of existing extraction video text message, have processing power usually, but can't accomplish the videotext of a large amount of different-styles is all handled a certain class text.And be difficult in unified flow process, handle for the videotext of different-style.

In the prior art, difference of two squares aggregate-value is that a kind of algorithm commonly used during videotext is followed the tracks of is (at IEEETransactions on Image Processing, Vol.9, No.1, Pages 147-56,2000, be described in " AutomaticText Detection and Tracking in Digital Video "), but this algorithm is not distinguished text filed interior character and background, when background changes, difference of two squares aggregate-value just obviously increases, and causes erroneous judgement easily.

At present, the Character segmentation that solves under the Chinese and English mixing situation has two kinds of thinkings:

One) unified recognition engine.The sample of Chinese and English character put together training OCR engine (at The Proceedings of the Seventh International Conference on DocumentAnalysis and Recognition, 2003, be described in " Improving Chinese/English OCR Performanceby Using MCE-based Character-pair Modeling and Negative Traning "), solve the problem of Chinese and English mixing in the identification link.Because the radical of Chinese character may be identified as English character during Character segmentation, the combination of the combination of adjacent English character or the radical of Chinese character and English character may be identified as Chinese character, and this just brings very big challenge to scope and the classification policy that OCR engine training sample covers.

Two) separate in the Chinese and English zone.Geometric properties according to character is divided Chinese zone in the character string and English zone, the Chinese zone uses Chinese OCR engine to discern, English zone uses English OCR engine to discern, at last two groups of recognition results are merged, (at " software journal ", Vol 16, and No 5 to obtain final recognition result, 2005, be described in " Chinese and English mixes the article identification problem ").Under many circumstances, difference is not remarkable between the Chinese and English character, and the zone is difficult to make the right judgement result when separating, in case and misjudgment just can not get correct recognition result.

In the prior art, the degree of confidence of carrying out in Multiple Classifier Fusion is proofreaied and correct and is normally carried out under the same sample collection, this helps special classifier design, because identical sample set provides a natural unified standard, but, can't set up a unified recognition confidence standard between the different sample sets for the user that the Multiple Classifier Fusion demand is arranged.

In the industry the research direction that video text message is extracted concentrate on text the location, cut apart, links such as enhancing and identification, try hard to from video, extract text message comprehensively and accurately, still, in actual applications, the text message that does not add differentiation is difficult to use.

In view of above shortcomings in the prior art and defective, require to provide better solution.

Summary of the invention

In view of this, the invention provides a kind of method and device that extracts video text message, can from dissimilar videos, extract text message.

A kind of method of extracting video text message that the embodiment of the invention provides comprises:

Determine the position of video image Chinese version piece;

According to the Chinese character feature described text block image is cut apart and character recognition, obtained the Chinese character string;

Geometric properties and positional information according to connected domain in the described text block image are determined English zone, and described English zone is cut apart and character recognition, obtain the English character string;

Calculate the recognition confidence of resulting Chinese character, English character respectively, and recognition confidence is proofreaied and correct;

Based on character recognition degree of confidence after proofreading and correct and the relation of the position between Chinese character and the English character described Chinese character string and English character string are merged, obtain text message.

Ground preferably, this method also comprises:

Monitoring is also followed the tracks of text block in the continuous videos picture frame, judges whether to be the one text piece according to the position relation and the picture material of adjacent video picture frame Chinese version piece;

When described text block disappears, determine the position of text piece, and text piece is carried out follow-up cutting apart and character recognition.

Ground preferably, this method also comprises:

Text block is cut apart with character recognition before, described text block region image is carried out pre-service.

The embodiment of the invention also provides a kind of device that extracts video text message, comprising:

Position determination unit is used for determining the position of video image Chinese version piece;

First processing unit is cut apart and character recognition described text block according to the Chinese character feature, obtains the Chinese character string;

Second processing unit is determined English zone according to the geometric properties and the positional information of connected domain in the described text block, and described English zone is cut apart and character recognition, obtains the English character string;

Computing unit is used for calculating respectively the recognition confidence of resulting Chinese character, English character, and recognition confidence is proofreaied and correct;

Merge cells is used for based on character recognition degree of confidence after proofreading and correct and the relation of the position between Chinese character and the English character described Chinese character string and Chinese character string being merged, and obtains text message.

Ground preferably, this device also comprises:

The monitoring tracking cell, the text block that is used for monitoring and following the tracks of the continuous videos picture frame;

Judging unit, positional information and picture material that the adjacent video picture frame Chinese version piece that provides according to described monitoring tracking cell is provided judge whether to be the one text piece;

If in the described video frame image is different text block, described judging unit is determined the zone of this difference text block, and then described first processing unit is cut apart and character recognition these different text block respectively with second processing unit.

In sum, a kind of method and device that extracts video text message provided by the invention is by determining the position of video image Chinese version piece; According to Chinese, English character feature the text block image is cut apart and character recognition respectively again, obtained Chinese and English character string; And recognition confidence proofreaied and correct; Based on character recognition degree of confidence after proofreading and correct and the relation of the position between Chinese character and the English character Chinese character string and English character string are merged, obtain text message.According to the present invention, can carry out Character segmentation identification to the text of the Chinese and English mixing in the video image, the videotext that can solve different-style is difficult to the problem handled in unified flow process, can organize, classify text messages dissimilar in the video.This framework both can effectively be handled various dissimilar videos, also can conveniently customize, revises, expand.

Figure of description

Fig. 1 is the method flow diagram of the extraction video text message that provides of the embodiment of the invention;

Fig. 2 is the process flow diagram that text block is positioned that the embodiment of the invention provides;

Fig. 3 be the embodiment of the invention provide the text block image is carried out the process flow diagram that character string is cut apart and discerned;

Fig. 4 is the synoptic diagram that the recognition confidence of the centering that provides of the embodiment of the invention, English character is proofreaied and correct;

Fig. 5 is the synoptic diagram that extracts Chinese and English digital mixing text from video image that the embodiment of the invention provides;

Fig. 6 is the video image synoptic diagram with polytype text that the embodiment of the invention provides;

Fig. 7 is the printed page analysis process flow diagram that the embodiment of the invention provides;

Fig. 8 is the device architecture synoptic diagram of the extraction video text message that provides of the embodiment of the invention.

Embodiment

In view of deficiency of the prior art and defective, the present invention proposes a kind of method of from video image, extracting text message, can more effectively under Chinese and English mixing situation, carry out Character segmentation identification, the videotext that can solve different-style is difficult to the problem handled in unified flow process, can organize, classify text messages dissimilar in the video.This framework both can effectively be handled various dissimilar videos, also can conveniently customize, revises, expand.

Character segmentation method under the Chinese and English mixing situation that the present invention proposes, centering, English character OCR engine carry out recognition confidence and proofread and correct, make the recognition confidence of two engines have comparability, then character string is carried out cutting apart of Chinese character and discern, from character string, find the English zone of candidate according to character feature again, carry out cutting apart and discerning of English character, have in the recognition result of two kinds of characters and replenish or overlapping part, make choice by the position and the recognition confidence of character.So both avoided the complicated OCR engine of training, the judgement that the not serious domain of dependence of segmentation result is separated has guaranteed efficient and stability.

In the technical scheme provided by the invention, can on different sample sets, carry out the method that the sorter recognition confidence is proofreaied and correct.According to actual conditions,, a kind of effective ways of proofreading and correct degree of confidence on different sample sets have been proposed from the angle of statistics.

In addition, utilize character feature to carry out printed page analysis.The present invention has proposed a kind of method that character feature carries out printed page analysis of collecting from system and point of view of application, and the text message of system's export structureization is convenient to post-processed.

For making principle of the present invention, characteristic and advantage, describe specific implementation of the present invention below in detail.

Embodiment one

With reference to Fig. 1, a kind of method of extracting the video structural text message that the embodiment of the invention provides comprises the steps:

S101 determines the position of text block in video image;

As Fig. 2, at first text block is positioned: pre-service, coarse positioning, projection cutting and screening.Specific as follows:

(1) pre-service comprises that calculating stroke responds (at The Proceedings of the IEEEInternational Conference on Image Processing, October.2006, be described in " Stroke Filter forText Localization in Video Images ") and color cluster, color cluster adopts the K Mean Method (at The Proceedings of the Eighteenth International Conference onMachine Learning, 2001, be described in " Constrained K-means Clustering with BachgroundKnowledge "), the former gives prominence to character according to the uniform characteristics of character stroke, the latter gives prominence to character according to the color characteristic of character, selects wherein a kind of treatment scheme according to configuration item.

Can strengthen text by calculating the stroke response, suppress background.Calculate the step of stroke response: the spacing of determining the stroke response according to configuration file; The response of calculating stroke; Binaryzation, and the bianry image that obtains carried out expansive working, to connect the stroke of some disconnections.

(2) coarse positioning

Detect text filedly according to the characteristics of character dense arrangement, obtain its approximate location.The projection cutting splits into the single file text with detected multiline text, obtains text filed comparatively accurate border, is convenient to follow-up cutting apart.Extract text filed feature in the checking link, the screening false-alarm.

On bianry image, at first obtain text filed approximate location by coarse positioning, accurately locate at intra-zone then.The coarse positioning step: connected domain is demarcated; Determine text filed, geometrical constraint according to the real text piece, as: size, arrangement position etc., to text filed merging on level or vertical direction (at The Proceedings of International Conference on MachineVision.Dec, 2007, be described in " A Robust System for Text Extraction in Video ").

(3) projection cutting

Often occur multiline text in the video image, multiline text is detected as a text block through regular meeting when rough detection.The follow-up link of cutting apart requires to be the single file text, potential multiline text need to be cut into a plurality of single file texts at this text filed.With the connected domain is unit, the method that adopts the projection cutting is (at PatternRecognition, Volume 36, Issue 10, Pages 2287-2299,2003, be described in " Character location inscene images from digital camera "), effectively solve the adhesion of multiline text and text and its adhesion of background on every side in some cases, guarantee that the candidate region after the cutting is the single file text.

(4) screening

At first, there is false-alarm in candidate's text block that above-mentioned processing obtains, need verifies: verify that according to text filed geometric properties response is verified according to stroke, verifies according to the graded feature.The checking link can be screened most of false-alarm in the positioning result, in that follow the tracks of and cut apart link still can be according to the information sifting false-alarm of current acquisition.

Step S102 judges whether to be the one text piece according to the position of adjacent video picture frame Chinese version piece relation and picture material;

When text block tracked in the described video frame image disappears, no longer continue or be replaced as text block, determine text piece, and text piece is carried out follow-up cutting apart and character recognition.

In the text block position fixing process, because in video image, text block can continue for some time usually, so the one text piece all can be positioned on consecutive numbers frame even hundreds of two field picture.If each positioning result is all cut apart, discerned, can waste a large amount of processing times.Adopt the method for following the tracks of, the one text piece was only once cut apart, discerned in the time period that disappearance occurs, thereby avoid re-treatment.And the beginning and ending time of text block and disappearance mode all are the important evidence of printed page analysis link.Therefore need follow the tracks of text block.

Following the tracks of link comprises position judgment, sequential judgement and safeguards array three parts.Whether position judgment and sequential are judged respectively from the position overlapping and whether content continues two aspect analyzing and positioning results, according to processing logic, provide independent text block in maintenance trail array link.Specific as follows:

I) position judgment

The stationkeeping that the one text piece occurs on the two field picture of front and back is constant, the text block position that obtains during the location is overlapped, and the position difference that different text block occurs on the two field picture of front and back, can not overlap, therefore, location overlap is whether two text block that the location obtains on the frame before and after judging are the necessary condition of one text piece.Position relation has four kinds: independence, underlap, overlapping and comprise, make judgement according to area shared proportion in text block of two text block overlapping regions.If independence or underlap illustrate that then it doesn't matter on the position, be judged as different text block; If overlapping or comprise, then explanation may need be done further judgement from same text block.According to the position of text block on the frame of front and back, definite border that needs the text block of tracking.

II) sequential is judged

Sequential judges it is to judge that from picture material whether two text block that consecutive frame navigates to are from same text.Sequential relationship has four kinds: a) keep, the text in two two field pictures of front and back does not change; B) replace, the text in the former frame image is replaced the content of text difference by the new text in one two field picture of back; C) disappear, the text in the former frame image disappears; D) false-alarm, locating obtain text filed in the former frame image is noise.

Under the situation that text position is fixed, the difference of two squares aggregate-value of front and back frame gray level image is an effective standard judging whether content of text changes.If do not distinguish the pixel of text filed inner character stroke and background, calculate the difference of two squares aggregate-value in whole zone, then judged result is subjected to the influence of change of background and instability easily, this paper only compares the bigger pixel of those stroke responses, these points all are positioned on the character stroke, make this algorithm more stable.Carry out the sequential judgement according to gray difference between two text block and stroke response difference.

III) maintenance trail array

In order to follow the tracks of the text block that occurs in the video, need to safeguard one and follow the tracks of array.Particularly, to emerging text block on the present frame, be located the result and add into array; To the text block that continues to occur, in array, keep this element; To the text block that disappears, determine the beginning and ending time and the disappearance mode of text piece, in its beginning and ending time, find out top-quality piece image, submit to and cut apart link, then this element of deletion from array.

Another task of maintenance trail array is from the multiple image that text block continues to occur, and picks out a top-quality frame, submits to and cuts apart link, helps to reduce the difficulty of cutting apart link like this, improves final recognition correct rate.

Step S103 obtains the text block image and this image is carried out pre-service;

With reference to Fig. 3, before cutting apart identification,, need the text block image is carried out pre-service when video image is a coloured image, described video image is transformed gray level image; Respectively Chinese, English character are cut apart identification again, the Chinese that will obtain, the merging of English character string obtain text message then.Then need not carry out pre-service for gray level image, can directly cut apart identification Chinese, English character.

The text block image is carried out binary conversion treatment, and character in the separate picture and background are to determine the character boundary;

To carry out the connected domain analysis to the bianry image that generates, to obtain the position and the dimension information of character stroke.

Pre-service comprises conversion gray level image, binaryzation and connected domain analysis.The text filed image of candidate that obtains of link is a coloured image in the location, and what use when binaryzation and character recognition is gray level image, therefore needs conversion, specifically comprises:

I) extract light intensity level;

Ii) extract some Color Channels (R, G and B) of coloured image, the most obvious in the intensity contrast between character and the background on this Color Channel;

Iii) converting colors space, distance metric mode between the change different colours is (at The Proceedingof International Conference on Document Analysis and Recognition, 2005, be described in " Colortext extraction from camera-based images:the impact of the choice of theclustering distance "), obtain the tangible gray level image of intensity contrast between character and the background;

Iv) color strengthens.One or more representative colors of difference designated character and background, adopt the method for K average that the pixel on the coloured image is carried out cluster, the luminance component that extracts pixel simultaneously is as gray level image, on gray level image, strengthen character pixels, suppress background pixel, increase the intensity contrast between character and the background.

In actual applications, should be according to the characteristics of video image, especially the relation of the color contrast between character and the background disposes appropriate conversion method, improves the effect of follow-up binary conversion treatment.

Binaryzation is used for the character and the background of separate picture, for determining that the character boundary lays the foundation.The binaryzation algorithm is important direction that is widely studied in the OCR field, has proposed multiple algorithm at present, for example:

Overall situation binaryzation algorithm: Ostu is (at IEEE Transaction on System Man Cybernet, Vol9, Pages 62-66,1979, be described in " A threshold selection method from gray-scale histogram "), Kittler is (at Pattern Recognition, Vol.19, Issue 1, Pages 41-47,1986, be described in " Minimum Error Thresholding ").

Local binaryzation algorithm: Niblack is (at An Introduction to Digital Image Processing, Prentice Hall, be described in 1986), Sauvola is (at Pattern Recognition, Vol.33, Issue 2, Pages 225-236,2000, " Adaptive document image binarization " and TheProceedings of SPIE, 2008, be described in " Efficient Implementation of Local AdaptiveThresholding Techniques Using Integral Images ").

In application, need to select different algorithms for use according to pending video image quality situation.

To carry out the connected domain analysis to the bianry image that generates, to obtain the position and the dimension information of character stroke.The connected domain analysis comprises three partial contents: connected domain is demarcated, is screened and merges.It is in order to reflect that the connected relation between the pixel in the bianry image is (at Computer Vision and ImageUnderstanding that connected domain is demarcated, Vol 89, Issue 1, Pages 1-23,2003, be described in " Linear-timeconnected-component labeling based on sequential local operations ").After demarcating, can access the information such as position, size and pixel number of each connected region in the bianry image.In the connected domain screening, design rule removes those irrational connected domains on features such as position, size, shape, dutycycle, lays the foundation for subsequent treatment reduces to disturb.Because generally being the stroke by a plurality of dispersions, Chinese character constitutes, if its connected domain is not reasonably merged (at IEEE Transaction On Pattern Analysis And Mechine Itelligence, Vol.24, No.11, November, 2002, be described in " Lexicon-Driven S egmentation and Recognition ofHandwritten Character Strings for Japanese Address Reading "), will influence choosing of cut-point.

Step S104 is cut apart and character recognition the text block image according to the Chinese character feature, obtains the Chinese character string;

The flow process that Chinese character is cut apart comprises that definite cut-point, pre-segmentation, character recognition and character string filter four parts.

According to actual conditions, determine that the strategy of cut-point has:

A. the connected domain feature of character is (at IEEE Transactions On Pattern Analysis AndMachine Intelligence, Vol.18, No.7, July 1996, are described in " A Survey of Methods andStrategies in Character Segmentation ").Under simple, ideal situation, certain intervals is arranged between the character, character stroke can adhesion, and the height and the width of character can accurately be determined cut-point in result who analyzes in conjunction with connected domain and the configuration item.

B. the vertical projection of character zone gray level image.In some programs, character pitch is less, and the stroke of adjacent character sticks together easily, should not use the connected domain analysis, and should be based on the local minizing point in the vertical projection of character zone gray level image, in conjunction with in the configuration item to the constraint of character duration, determine cut-point.

C. the background skeleton pattern is (at Pattern Recognition, Vol 32, Pages 921-933,1999, be described in " ABackground Thinning Based Approach for Seperating and RecognizingConnected Handwriting Digit Strings ").For adjacent character stroke adhesion situation more closely, need judge the position and the adhesion width of the generation of stroke adhesion according to the vertical projection of background pixel point, in conjunction with in the configuration item to the constraint of character duration, determine cut-point.

D. contacting model is (at IEEE Transaction On Pattern Analysis And MechineItelligence, Vol.24, No.11, November, 2002, being described in " Lexicon-Driven Segmentation andRecognition of Handwritten Character Strings for Japanese Address Reading ") shape facility of the exterior contour of connected domain can be determined some cut-points during according to stroke adhesion.

In actual applications, should select appropriate segmentation strategy, perhaps Different Strategies be combined, replenish mutually, determine cut-point comprehensively and accurately according to character feature.

During pre-segmentation, determine the border of candidate characters according to cut-point.If character duration is fixed, directly use character duration in the configuration item as constraint, from cut-point, determine the candidate characters border; If character duration is along with the composing situation changes within the specific limits, need to adopt the method for statistics with histogram and estimate character duration under the present case in conjunction with the variation range of character duration, again with this estimated value as constraint, from cut-point, determine candidate characters border (in denomination of invention for " character extracting method and device " application number is 200810246654.7 application documents, being described).

When character recognition, according to the position of candidate characters, the image of the single character of intercepting is discerned from image.The logical OCR engine of Tsing-Hua University's literary composition is adopted in character recognition, optimal identification result with present image is final recognition result, and according to the degree of confidence of the current recognition result of distance calculation between the candidate's recognition result number returned and the prototype (at Pattern Recognition Letters, Vol.19, No.10,1998, be described in " Adaptive Confidence Transform Based Classifier Combination for ChineseCharacter Recognition "), as the foundation of character string filtration.

Character string cut apart the strategy that adopts over-segmentation, the number of candidate characters contains wrong character learning symbol greater than the true number of character in the recognition result, therefore need filter to obtain correct character string recognition result.When filtering, accept or reject according to location overlap degree and recognition confidence between candidate's adjacent character.The character string that obtains after the filtration is exported as net result.

Step S105 determines English zone according to the geometric properties and the positional information of connected domain in the text block image, and described English zone is cut apart and character recognition, obtains the English character string;

Exist in the text of Chinese and English mixing; the combination of single English character or adjacent English character; being known by mistake through regular meeting is Chinese character; simultaneously; the simple Chinese character of the radical of Chinese character or some strokes can be known for English character, so can not replace English to cut apart only according to recognition result by mistake.

In the embodiment of the invention,, have and tendentiously cut apart, discern determining earlier English zone according to surface, comprise and judge that English zone and English character discern, recognition result is exported with the form of English character string.

In the English region decision link of candidate,, find out the English zone of candidate in the image according to the geometric properties and the adjacent situation of connected domain.In Chinese and English mixing text, English character is compared with Chinese character, and two characteristics are arranged: the width difference of Chinese and English character, and the English character width is less; The center distance of English character is less, and the center distance of Chinese character is bigger, and at Chinese and English character intersection, the center distance of character changes.

From the pre-service result, can obtain the size and the positional information of connected domain.English character all is single character, and under the situation of not considering adhesion, the width of English character connected domain is exactly its character duration; The width of Chinese character is cut apart link by Chinese character and is obtained.The center distance of character is the distance between the connected domain central point of adjacent character.Calculate the position of character duration and central point, can determine the English zone of candidate in conjunction with above-mentioned two characteristics.

In the English zone of the candidate who determines, comprise non-English zone through regular meeting, as the stroke of punctuate, Chinese character etc., merging link at the Chinese and English character can remove.

The OCR engine of oneself developing is adopted in the identification of English character, (1) recognition engine only is absorbed in the identification of English alphabet and numeral, owing to need the classification number of differentiation very little, can obtain higher recognition correct rate, (2) can be according to the actual conditions exptended sample, the customization training set makes more closing to reality application of recognition result.

Recognition engine is extracted the directional line element feature of character (at IEEE Transactions On Pattern AnalysisAnd Machine Intelligence, Vol 21, No 3, March 1999, be described in " A Handwritten CharacterRecognition System Using Directional Element Feature and AsymmetricMahalanobis Distance ") and gradient (at IEEE Transactions On PatternAnalysis And Machine Intelligence, Vol 29, No 8, March 2007, be described in " Normalization-Cooperated Gradient Feature Extraction for HandwrittenCharacter Recognition ") assemblage characteristic, feature adopts the LDA dimensionality reduction (at " Introduction to Statistical Pattern Recognition ", 2nd edition, Academic Press, NewYork, be described in 1990), sorter adopts DLQDF (at IEEE Transactions OnNeural Networks, Vol 15, No 2, March 2004, being described in " Discriminative Learning QuadraticDiscriminant Function for Handwriting Recognition ") algorithm trains, sorter output recognition result and degree of confidence, the confidence calculations method is identical with Chinese character.

Step S106 calculates the recognition confidence of resulting Chinese character, English character respectively, and recognition confidence is proofreaied and correct;

Because different recognition engine is adopted in Chinese and English identification respectively, the prototype space scale of two recognition engine differs greatly, sample separation is also inequality from the tolerance mode, and therefore the recognition confidence that calculates does not have comparability, and insertion needs two class recognition confidences are proofreaied and correct before merging.Proofread and correct recognition confidence and generally on identical sample space, carry out, but hereinto, separately identification of English character, the sample space of two recognition engine is not overlapping, can't directly proofread and correct.

With reference to Fig. 4, for example, the recognition confidence of supposing the Chinese and English character is that Gaussian distribution is (at PatternRecognition Vol.38, Pagess 11-28,2005, be described in " Classifier Combination Based onConfidence Transformation "), be as the criterion with the recognition confidence of Chinese character, the recognition confidence of English character is proofreaied and correct:

(1) on sample set (headline), the statistical conditions according to the recognition confidence of Chinese character are divided into 5 grades, try to achieve the degree of confidence average a of each grade ₁, a ₂, a ₃, a ₄, a ₅

(2) English character in the same row headers has the grade identical with Chinese character;

(3) calculate the degree of confidence average b of the English character of each grade ₁, b ₂, b ₃, b ₄, b ₅

(4) the degree of confidence average of centering, five grades of English character is carried out linear fit (in " statistical inference ", China Machine Press is described in 2005.);

(5), redefine the recognition confidence of English character according to fitting parameter.

English character after overcorrect has and the corresponding to degree of confidence of Chinese character like this.

Step S107 merges Chinese character string and English character string based on character recognition degree of confidence after proofreading and correct and the relation of the position between Chinese character and the English character, obtains text message.

Merging link, by comparing the relation of Chinese and English character string on position and recognition confidence, two character strings are merged, the result after the merging exports as net result.Adopt " plug-in type " strategy to merge in the embodiment of the invention, specifically comprise:

In the appropriate location of Chinese character string, fill the English character of being omitted, the reason of omission is when the Chinese character pre-segmentation, the width of English character does not meet the demands and screenedly falls;

Place at the Chinese and English character overlap, the recognition confidence that compares two class characters, those recognition results of being known for Chinese character by mistake are replaced with the higher English recognition result of degree of confidence, the reason that mistake is known be two adjacent English characters when pre-segmentation by as a Chinese character.

For example, as shown in Figure 5, it is the text image that obtains from the screen intercepting with Chinese and English digital mixing, content is " London 7,200,000 pounds escort G20 summit ", what cut apart that identification obtains according to Chinese character is " London add sterling escort to add summit ", wherein general ' 7 ', ' 72 ' the character string selection link that is combined in is removed, ' 20 ' mistake is identified as ' adding ', by the relation of Chinese and English character string on position and recognition confidence relatively, obtain correct result " London 7,200,000 pounds escort G20 summit " after the merging.

Step S108 analyzes the space of a whole page of video image, obtains the text feature in the video image; The text message that obtains after merging is organized, classified.

The text that comprises in the video is of a great variety, different types of text implication difference, and as shown in Figure 6, the text in the zone comprises: types such as title, subtitle, station symbol, adjunct, scroll bar.In video search and video automated cataloging, need from video, extract structurized text message, text is the feature of equal importance with content of text.

According to text feature to its carry out careful, organize accurately and classify, the text message of export structureization to satisfy the needs of different application aspect, as shown in Figure 7, comprises and collects feature, text tissue and text classification.In printed page analysis, will use the temporal aspect of text block, and temporal aspect is handled and could be determined at one section program, therefore adopt the mode of processed offline, promptly after one section program is handled, just carry out printed page analysis.

Printed page analysis comprises collects feature, text tissue and text classification.

The text feature of using in the printed page analysis comprises:

Polarity, reflect text filed in the shade relativity of character and background, be the dark character of the light background of 0 expression as polarity, polarity is 1 expression dark-background light color character.Cutting apart link can utilize algorithm to judge text polarity automatically; Also can in configuration file, provide polarity, and instruct with this and to cut apart.

Color comprises character color and background color.In some cases, polarity is not enough to distinguish different types of text, all is 1 as white under the red background and yellow character polarity, at this moment just needs to consider colouring information.

Character size, comprise single character in the line of text mean breadth and the height.In cutting apart link, carry out accessing after the pre-segmentation width and the height of single character, add up the mean breadth and the height of single character in the line of text with this.

The text block position, comprise text block about, border, the left and right sides.

Recognition result.The text block image is the character string that obtains after over-segmentation, identification, provides cutting apart link.

The beginning and ending time of text block.The moment of text block appearing and subsiding;

The sequential relationship of text block.Following the tracks of link, carrying out providing when sequential is judged four kinds of relations: maintenance, disappearance, replacement and false-alarm belong to have two kinds of text block: disappear and replace.

These features are bases of printed page analysis, in subsequent treatment, should be according to the characteristics of processed video, and flexible combination feature and design rule do not have unified treatment scheme.

The text tissue comprises: the merging of multiline text on the same two field picture; The merging of same text block on the continuous multiple frames image.

Through after the projection cutting, the text block of processing all is the single file text, these single file texts may need to combine could The expressed implication, as the headline of multirow.On same two field picture,,, the single file group of text of disperseing on the space is synthesized complete logic unit in conjunction with the characteristics of processed video according to information such as the position of text block, character size, colors.

In some cases, continuously the text that occurs may need to combine could The expressed implication, perhaps same text is intermittent to be occurred repeatedly, as headline.This just need go up the group of text of disperseing with the time and synthesize complete logic unit according to information such as the recognition result of text block, character size, colors.

Text classification, in different video frequency programs, the form of expression of text has nothing in common with each other.At a class program, can sum up the rule that draws some text classifications by observing, but in another kind of program, rule may be set up no longer.Therefore, text classification does not have concrete unified treatment scheme, can classify in conjunction with text feature and template.

Embodiment two

With reference to Fig. 8, the embodiment of the invention also provides a kind of device 200 that extracts video text message, comprising:

Position determination unit 210 is used for determining the position in video image Chinese version piece zone;

Chinese character processing unit 220 is cut apart and character recognition described text block according to the Chinese character feature, obtains the Chinese character string;

English character processing unit 230 is determined English zone according to the geometric properties and the positional information of connected domain in the described text block, and described English zone is cut apart and character recognition, obtains the English character string;

Computing unit 240 is used for calculating respectively the recognition confidence of resulting Chinese character, English character, and recognition confidence is proofreaied and correct;

Merge cells 250 is used for based on character recognition degree of confidence and the relation between Chinese character and the English character after proofreading and correct described Chinese character string and Chinese character string being merged, and obtains text message.

This device 200 also comprises:

Monitoring tracking cell 260, the text block that is used for monitoring and following the tracks of the continuous videos picture frame;

Judging unit 270, positional information and picture material that the adjacent video picture frame Chinese version piece that provides according to described monitoring tracking cell is provided judge whether to be the one text piece;

If in the video frame image is different text block, judging unit 270 is determined the zone of this difference text block, and then Chinese character processing unit 220 is cut apart and character recognition these different text block respectively with English character processing unit 230.

Have syndrome unit 241 in the computing unit 240, be used for being as the criterion with the recognition confidence of Chinese character, the recognition confidence of English character is proofreaied and correct, this syndrome unit 241 comprises:

Diversity module 241a is used for the recognition confidence of Chinese character is divided into some grades, and calculates the degree of confidence average of each grade, and the English character of same capable text block has identical grade with Chinese character;

Computing module 241b is used to calculate the degree of confidence average of the English character of each grade;

Adjusting module 241c, the degree of confidence average that is used for centering, each grade of English character is carried out linear fit; And, redefine the recognition confidence of English character according to fitting parameter.

This device 200 in, also be provided with pretreatment unit 270, be used for text block is cut apart with character recognition before, text block is carried out pre-service, this pretreatment unit 270 specifically comprises:

Image processing module 270a carries out binary conversion treatment to text block region image, and character in the separate picture and background are to determine the character boundary;

Image analysis module 270b is used for will carrying out the connected domain analysis to the bianry image that generates, to obtain the position and the dimension information of character stroke.

In sum, a kind of method and device that extracts the video structural text message provided by the invention is by locating the position of determining video image Chinese version piece; And text block followed the tracks of; According to Chinese, English character feature the text block image is cut apart and character recognition respectively again, obtained Chinese and English character string; And the recognition confidence of centering, English character is proofreaied and correct; Based on character recognition degree of confidence after proofreading and correct and the relation of the position between Chinese character and the English character Chinese character string and English character string are merged, obtain text message.According to the present invention, can carry out Character segmentation identification to the text of the Chinese and English mixing in the video image, the videotext that can solve different-style is difficult to the problem handled in unified flow process, can organize, classify text messages dissimilar in the video.This framework both can effectively be handled various dissimilar videos, also can conveniently customize, revises, expand.

According to described disclosed embodiment, can be so that those skilled in the art can realize or use the present invention.To those skilled in the art, the various modifications of these embodiment are conspicuous, and the general principles of definition here also can be applied to other embodiment on the basis that does not depart from the scope of the present invention with purport.Above-described embodiment only is preferred embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being done, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a method of extracting video text message is characterized in that, comprising:

Determine the position of video image Chinese version piece;

2. the method for claim 1 is characterized in that, also comprises:

3. method as claimed in claim 2 is characterized in that, described position relation and picture material according to adjacent video picture frame Chinese version piece judges whether to be the one text piece, is specially:

If the regional separate or underlap of adjacent video picture frame Chinese version piece judges that then adjacent video picture frame Chinese version piece is different text block;

If the region overlapping of adjacent video picture frame Chinese version piece or comprise judges that then adjacent video picture frame Chinese version piece is the one text piece.

4. the method for claim 1 is characterized in that, the described step that recognition confidence is proofreaied and correct comprises that the recognition confidence with Chinese character is as the criterion, and the recognition confidence of English character is proofreaied and correct:

The recognition confidence of described Chinese character is divided into some grades, and calculates the degree of confidence average of each grade, and the English character of same capable text block has identical grade with Chinese character;

Calculate the degree of confidence average of the English character of each grade;

Degree of confidence average with each grade of Chinese character is a benchmark, and the degree of confidence average of centering, English character same levels is carried out linear fit;

According to fitting parameter, redefine the recognition confidence of English character.

5. the method for claim 1 is characterized in that, described text block is cut apart with character recognition before, also comprise described text block region image carried out pretreated step:

When described video image is a coloured image, described video image is transformed gray level image;

Described text block region image is carried out binary conversion treatment, and character in the separate picture and background are to determine the character boundary;

6. the method for claim 1 is characterized in that, also comprises:

Described video image is carried out printed page analysis, obtain the text feature in the described video image;

According to described text feature, described text message is organized, classified.

7. a device that extracts video text message is characterized in that, comprising:

8. device as claimed in claim 7 is characterized in that, also comprises:

9. device as claimed in claim 7 is characterized in that, has the syndrome unit in the described computing unit, is used for being as the criterion with the recognition confidence of Chinese character, and the recognition confidence of English character is proofreaied and correct, and this syndrome unit comprises:

Diversity module is used for the recognition confidence of Chinese character is divided into some grades, and calculates the degree of confidence average of each grade, and the English character of same capable text block has identical grade with Chinese character;

Computing module is used to calculate the degree of confidence average of the English character of each grade;

Adjusting module, the degree of confidence average that is used for each grade of Chinese character is a target, the degree of confidence average of centering, each grade of English character is carried out linear fit; And, redefine the recognition confidence of English character according to fitting parameter.

10. device as claimed in claim 7 is characterized in that, also is provided with pretreatment unit, be used for described text block is cut apart with character recognition before, described text block is carried out pre-service, this pretreatment unit specifically comprises:

Image processing module transforms gray level image with described text block image, and this gray level image is carried out binary conversion treatment, and character in the separate picture and background are to determine the character boundary;

Image analysis module is used for will carrying out the connected domain analysis to the bianry image that generates, to obtain the position and the dimension information of character stroke.