CN110532537A

CN110532537A - A method of text is cut based on two points of threshold methods and sciagraphy multistage

Info

Publication number: CN110532537A
Application number: CN201910763993.0A
Authority: CN
Inventors: 罗胜
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2019-12-03

Abstract

The method based on two points of threshold methods and sciagraphy multistage cutting text that the invention discloses a kind of, first detects segment word with two points of threshold methods, then text is accurately positioned with sciagraphy, finally looks in residual image and adds text that is less obvious, being easy leakage；Daimonji is first handled, small text is post-processed, while after having handled a part of text in iteration, being erased from image and having been detected by text, simplifies the difficulty of subsequent processing.The advantages of the invention comprehensively utilizes two points of threshold methods and sciagraphies, the case where capable of accurately dividing size text multiple rows of shuffling.

Description

A method of text is cut based on two points of threshold methods and sciagraphy multistage

Technical field

The present invention relates to technical field of character recognition, and in particular to one kind is cut based on two points of threshold methods and sciagraphy multistage The method for cutting text.

Background technique

In text composition, it may appear that the case where size text multiple rows of shuffling, especially layout mathematical formulae when.Text It cuts through frequently with two points of threshold methods and sciagraphy.Two points of threshold methods are discrete foreground and background picture breakdown, but threshold It is worth bad selection, often has the mistakes such as multiword adhesion, individual character more than one piece；Sciagraphy can not divide multiple rows of text.

Summary of the invention

To solve the above problems, cutting text based on two points of threshold methods and sciagraphy multistage the present invention provides a kind of Method first detects text by threshold method, reprocesses daimonji, finally handles small text, fully utilize two points of threshold methods and The advantages of sciagraphy, the case where capable of accurately dividing size text multiple rows of shuffling.

To achieve the above object, the technical scheme adopted by the invention is as follows:

A method of text being cut based on two points of threshold methods and sciagraphy multistage, is included the following steps:

S1, two points of threshold values that image is calculated using Ostu method (difference method between maximum kind), it is (black to be changed into two-value for image It is white) image, white is the word in prospect, and black is background；

S2, using size, length-width ratio, duty ratio foreground area within the possible range as candidate text, be included into text Collect T；

S3, all candidate characters are arranged by height descending, is polymerized to K class using density-based algorithms；

S4, processing is followed the steps below in descending order to all candidate characters:

S4.1, to current character T_i, with its uppermost position in fig-ure U_p, lowermost position set D_ownFor row head, the end of line of current interim row, In Similar text K_jIn find close mass center, uppermost position in fig-ure and lowermost position and set all in U_p-D_ownInterior all texts are included into character set N, I.e.

In formula, N_pIt is any text in character set N, K_jIt is jth class text, U_Np、D_NpIt is text N_pUppermost position in fig-ure and Lowermost position is set, M_NpAnd M_TiIt is text N_pMass center, Th₀、Th₁It is the tolerance limits of upper and lower position and mass center respectively；Count text Collect the text minimum widith W in N_min；

S 4.2, it finds in inhomogeneity text in U_Np、D_NpInterior all text M, i.e.,

In formula, M_qIt is any text in character set M, U_Mq、D_MqIt is text M_qUppermost position in fig-ure and lowermost position set；

S4.3, the U by image_p-D_ownInterior all pixels project into horizontal projection to vertical direction is cumulative；

S4.4, after excluding the data that horizontal projection left and right ends are 0, the projection for having character portion among data is found Maximum value S_max, minimum value S_min；

S4.5, by the position (L of all texts in character set N_eft、R_ight) one pixel (L of each diminution in left and right_eft+1、 R_ight- 1), by position (L in horizontal projection_eft+1、R_ight- 1) value in is all set to S_max, while it is all in character set M Text position is all set to S_min；

S4.6, exclusion left and right ends are found in horizontal projection to there is the institute on the position of character portion in 0 data There is minimum value；

S4.7, lowered zones are set by the region where each minimum value, finds the right boundary of lowered zones, it is low-lying Interregional region is peak region, judges a possibility that lowered zones are interword gap, peak region is text unit, can Energy property is more than that peak region deposit text unit array, the lowered zones of empirical value are stored in interword gap array, and possibility is low The peak region for crossing empirical value is merged into left and right lowered zones；

S4.8, the text mean breadth for counting the width of each text unit divided by step S4.1, will be greater than text The text unit and interword gap of mean breadth presupposition multiple are greater than the unit of maximum text width directly as detecting Text, using other units as the unit that leaves a question open, and to continuous multiple, intermediate nothing leave a question open unit detection text as literal field Domain calculates the average word width W of each character area_cWith average interword gap W_b；

S4.9, the unit that will continuously leave a question open are as region of leaving a question open, by the L unit U that leave a question open_i, including the previous inspection in region of leaving a question open Text and region the latter detection text that leaves a question open, total L+2 unit constitute the unit collection U that leaves a question open out；With this L+2 unit construction One (L+2) × matrix of (L+2), the point (U in matrix_h,U_e)(U_h≤ U_e, e-h≤4) and it indicates from unit U_hThe left side starts, In unit U_eThe right constitutes a character, point (U in the range of terminating_h,U_e) value P_heIndicate this range constitute character at Word cost；

P_he=λ₁(W_he-W_c)/(W_he+W_c)+λ₂(W_hb-W_b)/(W_hb+W_b)+λ₃(W_eb-W_b)/(W_he+W_b)；

In formula, λ₁-λ₃It is weighting coefficient, W_heIt is unit U_h, unit U_eBetween width, i.e., from unit U_hLeft margin is to unit U_e The right edge distance_,W_hbIt is unit U_hThe width in left side gap_,W_ebIt is unit U_eThe width in the right gap；

By in matrix at word cost normalized, i.e., divided by matrix at the maximum value of word cost after, in upper right three Dynamic Programming is carried out in the belt-like zone that the width of angular moment battle array is 4, finds optimal case；Optimal case is averaged into word cost most It is small, and the variance of the variance of character width, interword gap width is also minimum, such as following formula:

Cost=λ₄mean(P_he)+λ₅δ_Wt+λ₆δ_Wb

In formula, λ₄-λ₆It is weighting coefficient, mean (P_he) it is the average at word cost, δ of all the points in scheme_WtInstitute in scheme There are the variance of character width, δ_WbIt is the variance of all interword gap width in scheme；

Whether there are also other remaining texts in S5, detection image, such as there are also text L, handle according to the following steps:

S5.1, the text T for taking character set L_l, T is judged by text height_lWhether existing text class is belonged to, if belonged to existing Text class, by T_lIt is placed in corresponding text class, if being not belonging to any existing text class, text class quantity adds 1, by T_lMerging is new Text class；

S5.2, by all texts in step S5.1 iterative processing character set L, until completing.

Further, in the step S4.7, peak region width is greater than W_min, mean height ratio lowered zones are averaged Highly high H_tmin。

Further, the step of Dynamic Programming is as follows:

(1) seed is generated: with 4 points of the first row for 4 seeds, as 4 kinds of schemes；

(2) scheme is grown: every kind of scheme is grown downwards, from point (U_h,U_e) downwards growth when, select U_e+14 capable points add Enter scheme；N kind scheme, every kind of scheme has 4 kinds may select when growing downwards, therefore grows a n kind scheme and become the kind side 4n Case；

(3) scheme is cut: being calculated the cost of 4n kind scheme, is selected the smallest m kind scheme of cost as seed scheme；Scheme The number at midpoint starts to cut for the first time when being more than 3, can improve the accuracy of algorithm；

(4) step (2), (3) are repeated, until each scheme reaches the last one unit in the unit collection U that leaves a question open；

(5) selecting the smallest scheme of cost is optimal case, forms character by the tactful combining unit that optimal case provides；

(6) to the text found, text point corresponding in image is all set to background, the text found is then put into text Word collection N, then new literacy collection N is put into text class K_jIn.

The invention has the following advantages:

The advantages of the invention comprehensively utilizes two points of threshold methods and sciagraphies, can accurately divide the multiple rows of shuffling of size text The case where.

Detailed description of the invention

Fig. 1 is a kind of process for the method that text is cut based on two points of threshold methods and sciagraphy multistage of the embodiment of the present invention Figure.

Fig. 2 is constructed at word Cost matrix in the embodiment of the present invention.

Fig. 3 is in the embodiment of the present invention at the dynamic programming process in word Cost matrix.

Specific embodiment

The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention Protection scope.

Text is cut based on two points of threshold methods and sciagraphy multistage as shown in Figure 1, the embodiment of the invention provides one kind Method, include the following steps:

S2, using length-width ratio, foreground area of the size in text possible range as candidate text, be included into character set T；

S4.1, to current character T_i, with its uppermost position in fig-ure U_p, lowermost position set D_ownFor row head, the end of line of current interim row, In Similar text K_jIn find close mass center, uppermost position in fig-ure and lowermost position and set all in U_p-D_ownInterior all text N, i.e.,

In formula, N_pIt is any text in character set N, K_jIt is jth class text, U_Np、D_NpIt is text N_pUppermost position in fig-ure and Lowermost position is set, M_NpAnd M_TiIt is text N_pMass center, Th₀、Th₁It is the tolerance limits of upper and lower position and mass center respectively.Count text Collect the text minimum widith W in N_min；

S 4.2, it finds in inhomogeneity text in U_Np、D_NpInterior all text M, i.e.,

S4.7, lowered zones are set by the region where each minimum value, finds the right boundary of lowered zones, it is low-lying Interregional region is peak region, judges a possibility that lowered zones are interword gap, peak region is text unit, can Energy property is more than that peak region deposit text unit array, the lowered zones of empirical value are stored in interword gap array, and possibility is low The peak region for crossing empirical value is merged into left and right lowered zones；Peak region width is greater than W_min, mean height ratio lowered zones The high H of average height_tmin。

P_he=λ₁(W_he-W_c)/(W_he+W_c)+λ₂(W_hb-W_b)/(W_hb+W_b)+λ₃(W_eb-W_b)/(W_he+W_b)

In formula, λ₁-λ₃It is weighting coefficient, W_heIt is unit U_h, unit U_eBetween width, i.e., from unit U_hLeft margin is to unit U_e The right edge distance, W_hbIt is unit U_hThe width in left side gap_,W_ebIt is unit U_eThe width in the right gap；This formula illustrates structure At character and left and right character similarity degree.

By in matrix at word cost normalized (divided by matrix at the maximum value of word cost).Such as add in Fig. 2 The point of Δ illustrates U₄、U₅A possibility that being merged into a character.If U_h=U_e, indicate U_hUnit can not be closed with right cell And individually become a character.

Since row is character start unit, column are character ends units, therefore this matrix only has upper right triangular portions；And Due to being at most divided into four units, e-h≤4 in a character horizontal direction, this upper right triangular matrix is also diagonally gone up only Having a width is 4 belt-like zone.

Dynamic Programming is carried out in the belt-like zone that the width of upper right triangular matrix is 4, finds optimal case.Optimal case Averagely at word cost minimization, and the variance of the variance of character width, interword gap width is also minimum, such as following formula:

Cost=λ₄mean(P_he)+λ₅δ_Wt+λ₆δ_Wb

In formula, λ₄-λ₆It is weighting coefficient, mean (P_he) it is the average at word cost, δ of all the points in scheme_WtInstitute in scheme There are the variance of character width, δ_WbIt is the variance of all interword gap width in scheme；The step of Dynamic Programming, is as follows:

(1) seed is generated: with 4 points of the first row for 4 seeds, as 4 kinds of schemes (such as Fig. 3 (a))；

(2) scheme is grown: every kind of scheme is grown downwards, such as from point (U_h,U_e) downwards growth when, select U_e+1Capable 4 Point addition scheme；N kind scheme, every kind of scheme has 4 kinds may select when growing downwards, therefore grows a n kind scheme and become 4n Kind scheme；

(6) to the text found, text point corresponding in image is all set to background, the text found is then put into text Word collection N, then new literacy collection N is put into text class K_jIn；

S5.2, by all texts in 5.1 iterative processing character set L of step S, until completing.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make a variety of changes or modify within the scope of the claims, this not shadow Ring substantive content of the invention.In the absence of conflict, the feature in embodiments herein and embodiment can any phase Mutually combination.

Claims

1. a kind of method based on two points of threshold methods and sciagraphy multistage cutting text, it is characterised in that:

Image is changed into bianry image by S1, two points of threshold values that image is calculated using Ostu method, and white is the word in prospect, Black is background；

S2, using size, length-width ratio, duty ratio foreground area within the possible range as candidate text, be included into character set T；

S4.1, to current character T_i, with its uppermost position in fig-ure U_p, lowermost position set D_ownFor current interim row row is first, end of line, similar Text K_jIn find close mass center, uppermost position in fig-ure and lowermost position and set all in U_p-D_ownInterior all texts are included into character set N, i.e.,

In formula, N_pIt is any text in character set N, K_jIt is jth class text, U_Np、D_NpIt is text N_pUppermost position in fig-ure and lowermost position It sets, M_NpAnd M_TiIt is text N_pMass center, Th₀、Th₁It is the tolerance limits of upper and lower position and mass center respectively；It counts in character set N Text minimum widith W_min；

S 4.2, it finds in inhomogeneity text in U_Np、D_NpInterior all text M, i.e.,

S4.4, after excluding the data that horizontal projection left and right ends are 0, the maximum for having the projection of character portion among data is found Value S_max, minimum value S_min；

S4.5, by the position (L of all texts in character set N_eft、R_ight) one pixel (L of each diminution in left and right_eft+1、R_ight- 1), by position (L in horizontal projection_eft+1、R_ight- 1) value in is all set to S_max, while all texts in character set M Position is all set to S_min；

S4.6, found in horizontal projection exclude left and right ends be 0 data in have on the position of character portion it is all most Small value；

S4.7, lowered zones are set by the region where each minimum value, finds the right boundary of lowered zones, lowered zones Between region be peak region, judge a possibility that lowered zones are interword gap, peak region is text unit, it would be possible to property Peak region deposit text unit array, lowered zones more than empirical value are stored in interword gap array, and low cross of possibility passes through The peak region for testing threshold value is merged into left and right lowered zones；

It is average to will be greater than text by S4.8, the text mean breadth for counting the width of each text unit divided by step S4.1 The text unit and interword gap of width presupposition multiple are greater than the unit of maximum text width directly as the text detected, Using other units as the unit that leaves a question open, and to continuous multiple, intermediate nothing leave a question open unit detection text as character area, calculating The average word width W of each character area_cWith average interword gap W_b；

S4.9, the unit that will continuously leave a question open are as region of leaving a question open, by the L unit U that leave a question open_i, including the previous detection text in region of leaving a question open With region the latter detection text that leaves a question open, total L+2 unit constitutes the unit collection U that leaves a question open；(L+ is constructed with this L+2 unit 2) × (L+2) matrix, the point (U in matrix_h,U_e)(U_h≤ U_e, e-h≤4) and it indicates from unit U_hThe left side starts, in unit U_e The right constitutes a character, point (U in the range of terminating_h,U_e) value P_heIndicate this range constitute character at word cost；

In formula, λ₁-λ₃It is weighting coefficient, W_heIt is unit U_h, unit U_eBetween width, i.e., from unit U_hLeft margin is to unit U_eThe right side The distance at edge_,W_hbIt is unit U_hThe width in left side gap_,W_ebIt is unit U_eThe width in the right gap；

By in matrix at word cost normalized, i.e., divided by matrix at the maximum value of word cost after, in three angular moment of upper right Dynamic Programming is carried out in the belt-like zone that the width of battle array is 4, finds optimal case；Optimal case is average at word cost minimization, and And variance, the variance of interword gap width of character width are also minimum, such as following formula:

Cost=λ₄mean(P_he)+λ₅δ_Wt+λ₆δ_Wb

In formula, λ₄-λ₆It is weighting coefficient, mean (P_he) it is the average at word cost, δ of all the points in scheme_WtAll words in scheme Accord with the variance of width, δ_WbIt is the variance of all interword gap width in scheme；

S5.1, the text T for taking character set L_l, T is judged by text height_lWhether existing text class is belonged to, if belonging to existing text Class, by T_lIt is placed in corresponding text class, if being not belonging to any existing text class, text class quantity adds 1, by T_lIt is placed in new text Word class；

Further, in the step S4.7, peak region width is greater than W_min, the average height of mean height ratio lowered zones High H_tmin。

Further, the step of Dynamic Programming is as follows:

(2) scheme is grown: every kind of scheme is grown downwards, from point (U_h,U_e) downwards growth when, select U_e+14 capable point addition sides Case；N kind scheme, every kind of scheme has 4 kinds may select when growing downwards, therefore grows a n kind scheme and become 4n kind scheme；

(3) scheme is cut: being calculated the cost of 4n kind scheme, is selected the smallest m kind scheme of cost as seed scheme；Scheme midpoint Number start when being more than 3 to cut for the first time, the accuracy of algorithm can be improved；

(6) to the text found, text point corresponding in image is all set to background, the text found is then put into character set N, then new literacy collection N is put into text class K_jIn.

2. a kind of method based on two points of threshold methods and sciagraphy multistage cutting text as described in claim 1, feature Be: in the step S4.7, peak region width is greater than W_min, the high H of the average height of mean height ratio lowered zones_tmin。

3. a kind of method based on two points of threshold methods and sciagraphy multistage cutting text as described in claim 1, feature Be: the step of Dynamic Programming, is as follows: