CN103839060B - A kind of merging method in individual character region and device - Google Patents

A kind of merging method in individual character region and device Download PDF

Info

Publication number
CN103839060B
CN103839060B CN201210486972.7A CN201210486972A CN103839060B CN 103839060 B CN103839060 B CN 103839060B CN 201210486972 A CN201210486972 A CN 201210486972A CN 103839060 B CN103839060 B CN 103839060B
Authority
CN
China
Prior art keywords
combined region
region
literal line
connected component
merging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210486972.7A
Other languages
Chinese (zh)
Other versions
CN103839060A (en
Inventor
郑琪
王永攀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710112654.7A priority Critical patent/CN107122778B/en
Priority to CN201210486972.7A priority patent/CN103839060B/en
Publication of CN103839060A publication Critical patent/CN103839060A/en
Application granted granted Critical
Publication of CN103839060B publication Critical patent/CN103839060B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/158Segmentation of character regions using character size, text spacings or pitch estimation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Input (AREA)

Abstract

The embodiment of the invention discloses the merging method in individual character region and device.Wherein, the method includes:Extract the connected component in image, described connected component is merged, obtain multiple combined region of merging process generation;Arrange described combined region, obtain at least one literal line;Count the number of the combined region that described literal line comprises, retain the most maximum literal line of the number comprising described combined region, and delete other literal lines overlapping, wherein, the described combined region included in described maximum literal line is described individual character region.According to embodiments of the present invention, the inaccurate problem of merging of the prior art can be solved.

Description

A kind of merging method in individual character region and device
Technical field
The present invention relates to image processing field, more particularly to a kind of merging method in individual character region and device.
Background technology
Character recognition technology in image has extensive practical application, such as the content recognition of scanned document or automatically postal Code identification etc..Popularization with digital camera and the development of Internet technology, through human-edited in the image basis shooting The image that generated afterwards gets more and more, as shown in figure 1, these human-edited's images generally have the background picture of complexity, changeable Foreground color and texture, in order to identify the word in these complicated human-edited's images, first need to carry out determining of character area Position and cutting, so-called character area just refers to the set in all individual character regions in above-mentioned human-edited's image, and herein in " individual character " is to refer to, including the word in Arabic numerals and various language, e.g., Chinese character or the Latin alphabet etc..
It is important to each the individual character area in human-edited's image will be determined in the positioning of character area and cutting process Domain.In all types of individual characters, Chinese character compared with the Latin alphabet, due to its be by multiple radicals (in graph theory, one Multiple radicals in Chinese character are multiple mutually disconnected connected components) form, there is more complicated structure, therefore, When determining the region of a Chinese character it is necessary to the multiple mutually disconnected connected component originally belonging to a Chinese character is carried out group Close, i.e. merge process.Identical with Chinese character areas case it is also desirable to the individual character region merging process also includes Korea Spro Character area and Japan word region etc..
The existing method merging individual character region typically all analyzes spacing and the position relationship between each connected component, Using all connected components meeting specific distance threshold value and certain positional relationship as the connected component belonging to an individual character region, And merge.In merging process, stop when the number of merged connected component reaches specific quantity threshold value merging.
But, during realizing invention, the inventors found that existing individual character region merging method at least There is following technical problem:The number of the connected component being comprised due to each individual character region is different, and different individual character region it Between spacing also vary, therefore, select spacing threshold or amount threshold anyway, all easily produce in merging process The multiple connected components originally belonging to an individual character region are merged into the over-segmentation problem in multiple individual character regions, or will originally not The connected component belonging to an individual character region is also merged into the problem crossing merging in this individual character region.
Content of the invention
In order to solve above-mentioned technical problem, embodiments provide a kind of merging method in individual character region and device, To solve the inaccurate problem of merging of the prior art.
The embodiment of the present invention discloses following technical scheme:
A kind of merging method in individual character region, including:
Extract the connected component in image, described connected component is merged, obtain multiple conjunctions of merging process generation And region;
Arrange described combined region, obtain at least one literal line;
Count the number of the combined region that described literal line comprises, retain the number comprising described combined region most Daimonji row, and delete other literal lines overlapping, wherein, described combined region included in described maximum literal line For described individual character region.
A kind of merging device in individual character region, including:
Merge module, for extracting the connected component in image, described connected component is merged, obtains merging process The multiple combined region producing;
Literal line arrangement analysis module, for arranging described combined region, obtains at least one literal line;
First choice module, for counting the number of the combined region that described literal line comprises, retains and comprises described merging The most maximum literal line of the number in region, and delete other literal lines overlapping, wherein, institute in described maximum literal line The described combined region comprising is described individual character region.
As can be seen from the above-described embodiment, in human-edited's image, individual character is often arranged in rows regularly, therefore, such as The individual character region that fruit merges is correct, the individual character region of this merging correct individual character region and surrounding should sizableness, row Row are neat, and can make up longer literal line.On the contrary, if it is wrong that the individual character region merging is closed, create over-segmentation Or cross merging, the individual character region of this merging mistake and the individual character region of surrounding just can form the probability of a longer literal line Meeting very little, therefore, the embodiment of the present invention, by carrying out literal line arrangement analysis to all combined region, obtains literal line, therefrom Select the most literal line of connected component number, i.e. literal line the longest, and the combined region in this literal line the longest is Merge correct individual character region, thus solving the inaccurate problem of merging of the prior art.
Brief description
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description be only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, also may be used So that other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is human-edited's image schematic diagram;
Fig. 2 is a kind of method flow diagram of the merging method in individual character region that the embodiment of the present invention one discloses;
Fig. 3 is the connected component schematic diagram in graph theory;
Fig. 4 is a kind of method flow diagram of the merging method in individual character region that the embodiment of the present invention two discloses;
Fig. 5 is the combined region schematic diagram being in over-segmentation state produced by the intermediate link merging;
Fig. 6 is a kind of method flow diagram of the merging method in individual character region that the embodiment of the present invention three discloses;
Fig. 7 implements a kind of structure drawing of device of the merging device in individual character region of four announcements for the present invention;
Fig. 8 is the structural representation of literal line arrangement analysis module of the present invention;
Fig. 9 merges the structural representation of module for the present invention.
Specific embodiment
Embodiments provide merging method and the device in individual character region.In human-edited's image, individual character often has It is regularly arranged and embark on journey, therefore, if the individual character region merging is correct, the list of this merging correct individual character region and surrounding Block domain should sizableness, marshalling, and can make up longer literal line.On the contrary, if the individual character region merging is closed It is wrong, create over-segmentation or cross merging, the individual character region of this merging mistake and the individual character region of surrounding can form one The probability of individual longer literal line will very little, therefore, the embodiment of the present invention is by carrying out literal line row to all combined region Row analysis, obtains literal line, therefrom the most literal line of selection connected component number, i.e. literal line the longest, and the longest at this Literal line in combined region be merge correct individual character region.
Understandable for enabling the above objects, features and advantages of the present invention to become apparent from, below in conjunction with the accompanying drawings to the present invention Embodiment is described in detail.
Embodiment one
Refer to Fig. 2, it is a kind of method flow diagram of the merging method in individual character region that the embodiment of the present invention one discloses, The method comprises the following steps:
Step 201:Extract the connected component in image, described connected component is merged, obtain merging process and produce Multiple combined region;
As shown in figure 3, its be graph theory in connected component schematic diagram, in graph theory, if any two therein point it Between all there are paths, and the point that they are all got along well outside subgraph is connected, and such subgraph is referred to as connected component.Example As, in the human-edited's image shown in Fig. 1, " rush " and " sale " in " sales promotion " is respectively two individual character regions, wherein, " rush " this individual character region includes two connected components:" Ren " and " sufficient ";" pin " this individual character region includes two connected components: " Jin " and " Xiao ".
The embodiment of the present invention is not defined to merging connected component, the method obtaining combined region, can be using existing Any one method in technology is had to merge.
A kind of preferred implementation method is:Connected component in human-edited's image is compared two-by-two, bag will be met Any two connected component of closed structure relation and adjacent structure relation merges, and obtains combined region;By connected component and The combined region that merging process produces each time, as combining objects, repeats combining objects are compared two-by-two, will meet bag Any two combining objects of closed structure relation and adjacent structure relation merge, until can not be merged.
For example, describe for convenience, taking in human-edited's image, comprise 5 connected components (connected component 1-5) as a example, will 5 connected components are compared it is assumed that connected component 1 and 2 meets encirclement structural relation, in " side ", " area " and " figure " two-by-two Connected component is to surround structural relation, and connected component 3 and 4 meets adjacent structure relation, " product ", " word " and connecting in " OK " Component is adjacent structure relation.In first time merging process, by connected component 1 and 2-in-1 and be combined region 1, by connection point Amount 3 and 4 merges into combined region 2,.It is further continued for being compared connected component 1-5 and combined region 1-2 it is assumed that merging two-by-two Region 1 and connected component 5 meet encirclement structure.In second merging process, connected component 5 and combined region 1 are merged into Combined region 3.The like, till can not merging again.Finally, obtain the assembly section that merging process produces each time Domain:Combined region 1,2 and 3.
Preferably, the present invention can be, but not limited in the following way to two connections point meeting encirclement structural relation Amount, a connected component and a combined region, or two combined region merge:
For two connected components, judge that the overlapping area between the fitted rectangle of two connected components is connected with two point In the fitted rectangle of amount, whether the ratio of the minimum fitted rectangle area of area is more than the first preset multiple, and two connections point Whether the color of amount and stroke width are close, if it is, meeting encirclement structural relation, otherwise, do not meet encirclement structural relation.
For a connected component and a combined region (combining objects), or two combined region (combining objects), Judge the overlapping area between the fitted rectangle of two combining objects and area minimum in the fitted rectangle of two combining objects Whether the ratio of fitted rectangle area be more than the first preset multiple, and the color of two combining objects and stroke whether close, If it is, meeting encirclement structural relation, otherwise, do not meet encirclement structural relation.
Preferably, the present invention can be, but not limited in the following way to two connections point meeting adjacent structure relation Amount, a connected component and a combined region, or two combined region merge:
For two connected components, judge the width of the fitted rectangle of two connected components and the ratio and centre distance between Whether value is more than the second preset multiple, and whether the color of two connected components and stroke are close, and the plan in the region after merging Whether the ratio of the length and width of closing rectangle is less than the 3rd preset multiple, if it is, meeting adjacent structure relation, otherwise, no Meet adjacent structure relation.
For a connected component and a combined region (combining objects), or two combined region (combining objects), Judge whether the width of the fitted rectangle of two combining objects and the ratio and centre distance between are more than the second preset multiple, two Whether the color of individual combining objects and stroke are close, and the length of the fitted rectangle in region after merging with the ratio of width is No be less than the 3rd preset multiple, if it is, meeting adjacent structure relation, otherwise, do not meet adjacent structure relation.
It should be noted that the embodiment of the present invention is preset again to above-mentioned first preset multiple, the second preset multiple and the 3rd The concrete numerical value of number is not defined, and beforehand through test, can calculate the matching of each connected component in the individual character surrounding structure The ratio of the minimum fitted rectangle area of area in the fitted rectangle of the overlapping area between rectangle and each connected component, and utilize Sample statistics method determines a mean ratio, using this mean ratio as the first preset multiple.In the same manner, it may also be determined that going out Two preset multiple and the 3rd preset multiple.
Step 202:Arrange described combined region, obtain at least one literal line;
The embodiment of the present invention can adopt any one of prior art literal line arrangement analysis method to obtain in step 201 The combined region obtaining carries out literal line arrangement analysis.
For example, the literal line arrangement analysis method based on projection and Hough transformation, both sides are included in prior art Method is all based on the statistical information in region to obtain literal line arrangement information.Additionally, also include one kind in prior art being based on The literal line arrangement analysis method of region clustering, such method would generally define similar between the region in same a line and region Relation, then becomes one group using a kind of polymerization to the region clustering with similarity relation, the behavior literal line being formed.
Step 203:Count the number of the combined region that described literal line comprises, retain the number comprising described combined region Most maximum literal lines, and delete other literal lines overlapping, wherein, described included in described maximum literal line Combined region is described individual character region.
After all of literal line is obtained based on step 202, count the number of the combined region comprising in each literal line, Therefrom find out the most literal line of number, i.e. literal line the longest, the combined region in this literal line the longest is just to merge True individual character region, meanwhile, deleting the literal line the longest with this has overlapping literal line, and these have with literal line the longest overlapping Literal line in combined region be due to over-segmentation or cross the individual character region of mistake merging and leading to.
Preferably, methods described can further include:
Step 204:If also having surplus in addition to described maximum literal line and the literal line overlapping with described maximum literal line Remaining literal line, relays continuation of insurance from described remaining literal line and stays next maximum literal line, and deletes other that overlap Literal line, by that analogy, till no maximum literal line can retain;
Wherein, the described combined region included in the maximum literal line of described each reservation is described individual character region.
According to method as above, from all literal lines of the residue in addition to the literal line selected above, then look for Go out the most literal line of number, the like, till there is no selectable literal line.
As can be seen from the above-described embodiment, in human-edited's image, individual character is often arranged in rows regularly, therefore, such as The individual character region that fruit merges is correct, the individual character region of this merging correct individual character region and surrounding should sizableness, row Row are neat, and can make up longer literal line.On the contrary, if it is wrong that the individual character region merging is closed, create over-segmentation Or cross merging, the individual character region of this merging mistake and the individual character region of surrounding just can form the probability of a longer literal line Meeting very little, therefore, the embodiment of the present invention, by carrying out literal line arrangement analysis to all combined region, obtains literal line, therefrom Select the most literal line of connected component number, i.e. literal line the longest, and the combined region in this literal line the longest is Merge correct individual character region, thus solving the inaccurate problem of merging of the prior art.
In addition in addition it is also necessary to especially emphasize that a kind of common inter-bank crosses combination situation, due to line space very little, adjacent multirow Word is crossed by inter-bank and is merged into a character area.In this case, although crossing the region merging also can form longer row, But due to crossing the presence merging, the number of its combined region comprising will necessarily less than in correct row combined region Number, according to the strategy of the present invention, still can select the row correctly merging.Therefore, the present invention also can solve this class well Inter-bank crosses the problem of merging.
Embodiment two
When carrying out literal line arrangement analysis and obtaining literal line, due to processed to as if each time merging process produce Combined region, and the combined region that merging process produces each time inherently can include the substantial amounts of conjunction being in over-segmentation state And region (combined region produced by the intermediate link before completing to merge for the last time), using in the intermediate link merging The produced combined region that these are in over-segmentation state carries out literal line arrangement analysis, will necessarily affect literal line arrangement point The accuracy of analysis and execution efficiency.In order to solve this problem, the present embodiment two is with the difference of embodiment one, to merging During region carries out literal line arrangement analysis, the above-mentioned combined region being in over-segmentation state is not carried out with literal line arrangement point Analysis is processed.Refer to Fig. 4, it is a kind of method flow diagram of the merging method in individual character region that the embodiment of the present invention two discloses, The method comprises the following steps:
Step 401:Extract the connected component in image, described connected component is merged, obtain merging process and produce Multiple combined region;
This step implement the step 201 that process may refer in embodiment one, due to carrying out in embodiment one Describe in detail, therefore here is omitted.
Step 402:Obtain the first combined region set, described first combined region set include at least two have identical The combined region of connected component, is carried based on comprising the most combined region of connected component number in described first combined region set Take literal line, obtain the second combined region set, described second combined region set includes at least one and do not have identical connection The combined region of component, extracts literal line based on the combined region in described second combined region set;
For example, when merging to three connected components " mouth " in " product " word, three kinds of combined region (1,2 can be produced With 3), as shown in Figure 5.Two kinds of combined region (1 and 2) therein are to be in over-segmentation shape produced by the intermediate link merging The combined region of state, respectively includes two connected components, and combined region 3 is correctly to merge produced by the final tache merging Region, includes three connected components.Above three combined region is all close with the word of surrounding in size and arrangement, because This, in literal line arrangement analysis, these three combined region all can be extracted on same literal line, and this not only can affect literary composition The accuracy of word row arrangement analysis and execution efficiency, and, also can affect the number of combined region comprising in literal line, That is, the number making the combined region comprising in the literal line extracting will have more 2 than actual number, and comprise in literal line The number of combined region is the foundation whether final decision literal line retains, the number of the combined region comprising in impact literal line Finally also affect the accuracy merging individual character region.
It is found that three kinds of combined region all comprise identical connected component from Fig. 5, and comprise connected component number Many combined region are correct combined region produced by the final tache merging.Therefore, in all of combined region, such as There are certain several combined region and comprise identical connected component in fruit, then, in these combined region, connected component number is most Combined region is correct combined region produced by the final tache merging, and remaining combined region is to be in over-segmentation state Combined region, row is extracted based on the most combined region of connected component number, thus not to the assembly section being in over-segmentation state Domain carries out literal line arrangement analysis process.
It should be noted that the method for the literal line arrangement analysis adopting is different, based on described first combined region set In comprise the most combined region of connected component number extract literal line method also different.
Preferably, when adopting the literal line arrangement analysis method based on Hough transformation, by described first combined region collection In conjunction, the line relationship between each combined region is set to calculate, and obtains from carrying out literal line arrangement analysis to combined region Literal line in search described first combined region set;Connected component is retained in the described first combined region set searched The most combined region of number, removes other combined region.
Or it is preferred that when using literal line arrangement analysis method based on region clustering, in each combined region weight Weight factor in increase the connected component number that comprises of combined region;By each combined region in described first combined region set Between weight be set to 0.
Step 403:Count the number of the combined region that described literal line comprises, retain the number comprising described combined region Most maximum literal lines, and delete other literal lines overlapping, wherein, described included in described maximum literal line Combined region is described individual character region.
This step implement the step 203 that process may refer in embodiment one, due to carrying out in embodiment one Describe in detail, therefore here is omitted.
As can be seen from the above-described embodiment, in addition to there is the technique effect in embodiment one, because the present embodiment exists During literal line arrangement analysis are carried out to combined region, literal line is not carried out to the above-mentioned combined region being in over-segmentation state Arrangement analysis are processed, and therefore, further increase the accuracy of literal line arrangement analysis.
Embodiment three
Below so that literal line arrangement analysis are carried out using region clustering method as a example, describe a kind of individual character region in detail and carry out The method merging.Refer to Fig. 6, it is a kind of method flow of the merging method in individual character region that the embodiment of the present invention three discloses Figure, the method comprises the following steps:
Step 601:All connected components in human-edited's image are compared two-by-two, encirclement structural relation will be met Merge with any two connected component of adjacent structure relation, obtain combined region;
Step 602:Using the combined region of all connected components and merging process generation each time as combining objects, repeat Combining objects are compared two-by-two, is carried out meeting any two combining objects surrounding structural relation and adjacent structure relation Merge, until can not be merged;
Step 603:When adopting the literal line arrangement analysis method based on region clustering, in the power of each combined region weight Increase the connected component number that comprises of combined region in repeated factor, the power between each combined region of identical connected component will be comprised Reset and be set to 0, obtain literal line;
For example, according to the existing literal line arrangement analysis method based on region clustering, the old weight of combined region R is W, After increasing, in weight factor, the connected component number that combined region comprises, the new weight of combined region R is W+kn, and wherein, k is One constant, the number of the connected component that n then comprises for combined region R.According to the existing literal line based on region clustering row Row analysis method, the old weight (typicallying represent the probability belonging to same a line) between combined region R1 and R2 is W, then new weight is W+kn1+kn2, wherein, k is a constant, the number of the connected component that n1 then comprises for combined region R1, and n2 is then assembly section The number of the connected component that domain R2 comprises.As, in the Clique extracting method based on greedy algorithm, selected coupling to (in figure Summit) weight be the side that this summit is connected number N, and new weight then may be configured as N+kn1+kn2, n1 and n2 is respectively It is the number of connected component that the combined region of coupling centering comprises.
Step 604:From all literal lines in addition to the literal line selected, circulation selection comprises combined region The most literal line of number, deleting has overlapping literal line with described literal line, wherein, comprises in the literal line selected Combined region is the individual character region merging.
As can be seen from the above-described embodiment, in human-edited's image, individual character is often arranged in rows regularly, therefore, such as The individual character region that fruit merges is correct, the individual character region of this merging correct individual character region and surrounding should sizableness, row Row are neat, and can make up longer literal line.On the contrary, if it is wrong that the individual character region merging is closed, create over-segmentation Or cross merging, the individual character region of this merging mistake and the individual character region of surrounding just can form the probability of a longer literal line Meeting very little, therefore, the embodiment of the present invention, by carrying out literal line arrangement analysis to all combined region, obtains literal line, therefrom Select the most literal line of connected component number, i.e. literal line the longest, and the combined region in this literal line the longest is Merge correct individual character region, thus solving the inaccurate problem of merging of the prior art.
In addition in addition it is also necessary to especially emphasize that a kind of common inter-bank crosses combination situation, due to line space very little, adjacent multirow Word is crossed by inter-bank and is merged into a character area.In this case, although crossing the region merging also can form longer row, But due to crossing the presence merging, the number of its combined region comprising will necessarily less than in correct row combined region Number, according to the strategy of the present invention, still can select the row correctly merging.Therefore, the present invention also can solve this class well Inter-bank crosses the problem of merging.
Example IV
Corresponding with a kind of above-mentioned merging method in individual character region, the embodiment of the present invention additionally provides a kind of individual character region Merge device.Refer to Fig. 7, it implements a kind of structure drawing of device of the merging device in individual character region of four announcements for the present invention, This device includes:Merge module 701, literal line arrangement analysis module 702 and selecting module 703.Work with reference to this device It is further described its internal structure and annexation as principle.
Merge module 701, for extracting the connected component in image, described connected component is merged, is merged Multiple combined region that process produces;
Literal line arrangement analysis module 702, for arranging described combined region, obtains at least one literal line;
First choice module 703, for counting the number of the combined region that described literal line comprises, retains and comprises described conjunction And the most maximum literal line of number in region, and delete other literal lines overlapping, wherein, in described maximum literal line The described combined region being comprised is described individual character region.
Preferably, the device shown in Fig. 7 can further include:Circulation selecting module, if for removing described maximum Also has remaining literal line, from described remaining literal line outside literal line and the literal line overlapping with described maximum literal line Continue to retain next maximum literal line, and delete other literal lines overlapping, by that analogy, until no maximum literal line Till can retaining;
Wherein, the described combined region included in the maximum literal line of described each reservation is described individual character region.
Preferably, as shown in figure 8, literal line arrangement analysis module 702 further includes:The first row extracting sub-module 7021 With the second row extracting sub-module 7022, wherein,
The first row extracting sub-module 7021, for obtaining the first combined region set, described first combined region set bag Include at least two combined region with identical connected component, individual based on comprising connected component in described first combined region set The most combined region of number extracts literal line;
Second row extracting sub-module 7022, for obtaining the second combined region set, described second combined region set bag Include the combined region that at least one does not have identical connected component, carried based on the combined region in described second combined region set Take literal line.
It is further preferred that the first row extracting sub-module 7021 includes:
First mutual exclusion condition setting submodule, for when using literal line arrangement analysis method based on Hough transformation, Line relationship between each combined region in described first combined region set is set to calculate, from entering to combined region Described first combined region set is searched in the literal line that row literal line arrangement analysis obtain;
Row selects submodule, comprises connected component number for retaining in the described first combined region set searched Many combined region, remove other combined region.
Or, it is further preferred that the first row extracting sub-module 7021 includes:
Weight factor arranges subelement, for when adopting the literal line arrangement analysis method based on region clustering, each Increase the connected component number that combined region comprises in the weight factor of combined region weight;
Second mutual exclusion condition setting submodule, for when using literal line arrangement analysis method based on region clustering, Weight between the described combined region comprising identical connected component is set to 0.
Preferably, include as shown in figure 9, merging module 701:Connected component merges submodule 7011 and combining objects merge Submodule 7012, wherein,
Connected component merges submodule 7011, for being compared two-by-two to the connected component in human-edited's image, will Meet encirclement structural relation and any two connected component of adjacent structure relation merges, obtain combined region;
It is further preferred that connected component merges submodule including:First judging submodule, for judging two connections point Overlapping area between the fitted rectangle of amount and the minimum fitted rectangle area of area in the fitted rectangle of two connected components Whether ratio be more than the first preset multiple, and the color of two connected components and stroke width whether close, if it is, meeting Surround structural relation, otherwise, do not meet encirclement structural relation;With the second judging submodule, for judging two connected components Whether the width of fitted rectangle and the ratio and centre distance between are more than the second preset multiple, the color of two connected components and Whether stroke is close, and whether the length of the fitted rectangle in region after merging is less than the 3rd default times with the ratio of width Number, if it is, meeting adjacent structure relation, otherwise, does not meet adjacent structure relation.
Combining objects merge submodule 7012, for making the combined region of connected component and merging process generation each time For combining objects, repeat combining objects are compared two-by-two, surround any of structural relation and adjacent structure relation by meeting Two combining objects merge, until can not be merged.
It is further preferred that combining objects merge submodule 7012 including:3rd judging submodule, for judging two conjunctions And the overlapping area between the fitted rectangle of object and the minimum fitted rectangle face of area in the fitted rectangle of two combining objects Whether long-pending ratio be more than the first preset multiple, and the color of two combining objects and stroke width whether close, if it is, Meet encirclement structural relation, otherwise, do not meet encirclement structural relation;With, the 4th judging submodule is right for judging two merging Whether the width of the fitted rectangle of elephant and the ratio and centre distance between are more than the second preset multiple, the face of two combining objects Whether normal complexion stroke is close, and whether the length of the fitted rectangle in region after merging is default less than the 3rd with the ratio of width Multiple, if it is, meeting adjacent structure relation, otherwise, does not meet adjacent structure relation.
As can be seen from the above-described embodiment, in human-edited's image, individual character is often arranged in rows regularly, therefore, such as The individual character region that fruit merges is correct, the individual character region of this merging correct individual character region and surrounding should sizableness, row Row are neat, and can make up longer literal line.On the contrary, if it is wrong that the individual character region merging is closed, create over-segmentation Or cross merging, the individual character region of this merging mistake and the individual character region of surrounding just can form the probability of a longer literal line Meeting very little, therefore, the embodiment of the present invention, by carrying out literal line arrangement analysis to all combined region, obtains literal line, therefrom Select the most literal line of connected component number, i.e. literal line the longest, and the combined region in this literal line the longest is Merge correct individual character region, thus solving the inaccurate problem of merging of the prior art.
In addition in addition it is also necessary to especially emphasize that a kind of common inter-bank crosses combination situation, due to line space very little, adjacent multirow Word is crossed by inter-bank and is merged into a character area.In this case, although crossing the region merging also can form longer row, But due to crossing the presence merging, the number of its combined region comprising will necessarily less than in correct row combined region Number, according to the strategy of the present invention, still can select the row correctly merging.Therefore, the present invention also can solve this class well Inter-bank crosses the problem of merging.
It should be noted that one of ordinary skill in the art will appreciate that realizing the whole or portion in above-described embodiment method Split flow, can be by computer program to complete come the hardware to instruct correlation, described program can be stored in a computer In read/write memory medium, this program is upon execution, it may include as the flow process of the embodiment of above-mentioned each method.Wherein, described Storage medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
Above a kind of merging method in individual character region provided by the present invention and device are described in detail, herein Apply specific embodiment the principle of the present invention and embodiment are set forth, the explanation of above example is only intended to help Assistant's solution method of the present invention and its core concept;Simultaneously for one of ordinary skill in the art, according to the think of of the present invention Think, all will change in specific embodiments and applications, in sum, it is right that this specification content should not be construed as The restriction of the present invention.

Claims (16)

1. a kind of merging method in individual character region is it is characterised in that include:
Extract the connected component in image, described connected component is merged, obtain multiple assembly sections of merging process generation Domain;
Arrange described combined region, obtain at least one literal line;
Count the number of the combined region that described literal line comprises, retain the most maximum literary composition of the number comprising described combined region Word row, and delete other literal lines overlapping, wherein, the described combined region included in described maximum literal line is institute State individual character region.
2. method according to claim 1 is it is characterised in that methods described also includes:
If also having remaining literal line in addition to described maximum literal line and the literal line overlapping with described maximum literal line, from Next maximum literal line is stayed in described remaining literal line relaying continuation of insurance, and deletes other literal lines overlapping, with such Push away, till no maximum literal line can retain;
Wherein, the described combined region included in maximum literal line retaining every time is described individual character region.
3. method according to claim 1, it is characterised in that the described combined region of described arrangement, obtains at least one literary composition Word row, including:
Obtain the first combined region set, described first combined region set includes at least two conjunctions with identical connected component And region, extract literal line based on comprising the most combined region of connected component number in described first combined region set;
Obtain the second combined region set, described second combined region set includes at least one and do not have identical connected component Combined region, extracts literal line based on the combined region in described second combined region set.
4. it is characterised in that described acquisition the first combined region set, described first closes method according to claim 3 And regional ensemble includes at least two combined region with identical connected component, wrap based in described first combined region set The most combined region of number containing connected component extracts literal line, including:
When adopting the literal line arrangement analysis method based on Hough transformation, by each assembly section in described first combined region set Line relationship between domain is set to calculate, from carrying out to combined region looking into the literal line that literal line arrangement analysis obtain Look for described first combined region set;
Retain the most combined region of connected component number in the described first combined region set searched, remove other conjunctions And region.
5. it is characterised in that described acquisition the first combined region set, described first closes method according to claim 3 And regional ensemble includes at least two combined region with identical connected component, wrap based in described first combined region set The most combined region of number containing connected component extracts literal line, including:
When adopting the literal line arrangement analysis method based on region clustering, increase in the weight factor of each combined region weight The connected component number that combined region comprises;
Weight between each combined region in described first combined region set is set to 0.
6. method according to claim 1 is it is characterised in that connected component in described extraction image, to described connection Component merges, and obtains multiple combined region of merging process generation, including:
The connected component extracting is compared two-by-two, any two surrounding structural relation and adjacent structure relation will be met even Reduction of fractions to a common denominator amount merges, and obtains combined region;
The described combined region that connected component and merging process are produced, as combining objects, repeats combining objects are carried out two-by-two Relatively, merge meeting any two combining objects surrounding structural relation and adjacent structure relation, until merging Till.
7. method according to claim 6 will be it is characterised in that described will meet encirclement structural relation and adjacent structure relation Any two connected component merge, including:
Judge overlapping area between the fitted rectangle of two connected components with area in the fitted rectangle of two connected components Whether the ratio of little fitted rectangle area is more than the first preset multiple, and the color of two connected components and stroke width are No close, if it is, meeting encirclement structural relation, otherwise, do not meet encirclement structural relation;
Judge the width of the fitted rectangle of two connected components and the ratio and centre distance between whether default times more than second Whether number, the color of two connected components and stroke are close, and the length of the fitted rectangle in region after merging and width Whether ratio is less than the 3rd preset multiple, if it is, meeting adjacent structure relation, otherwise, does not meet adjacent structure relation.
8. method according to claim 6 will be it is characterised in that described will meet two combining objects of adjacent structure relation Merge, including:
Judge overlapping area between the fitted rectangle of two combining objects with area in the fitted rectangle of two combining objects Whether the ratio of little fitted rectangle area is more than the first preset multiple, and the color of two combining objects and stroke width are No close, if it is, meeting encirclement structural relation, otherwise, do not meet encirclement structural relation;
Judge the width of the fitted rectangle of two combining objects and the ratio and centre distance between whether default times more than second Whether number, the color of two combining objects and stroke are close, and the length of the fitted rectangle in region after merging and width Whether ratio is less than the 3rd preset multiple, if it is, meeting adjacent structure relation, otherwise, does not meet adjacent structure relation.
9. a kind of merging device in individual character region is it is characterised in that include:
Merge module, for extracting the connected component in image, described connected component is merged, obtain merging process and produce Multiple combined region;
Literal line arrangement analysis module, for arranging described combined region, obtains at least one literal line;
First choice module, for counting the number of the combined region that described literal line comprises, retains and comprises described combined region The most maximum literal line of number, and delete other literal lines overlapping, wherein, included in described maximum literal line Described combined region be described individual character region.
10. device according to claim 9 is it is characterised in that described device also includes:
Circulation selecting module, if for going back in addition to described maximum literal line and the literal line overlapping with described maximum literal line There is remaining literal line, relay continuation of insurance from described remaining literal line and stay next maximum literal line, and delete and overlap Other literal lines, by that analogy, till no maximum literal line can retain;
Wherein, the described combined region included in maximum literal line retaining every time is described individual character region.
11. devices according to claim 9 are it is characterised in that described literal line arrangement analysis module includes:
The first row extracting sub-module, for obtaining the first combined region set, described first combined region set includes at least two The individual combined region with identical connected component, most based on comprising connected component number in described first combined region set Combined region extracts literal line;
Second row extracting sub-module, for obtaining the second combined region set, described second combined region set includes at least one The individual combined region without identical connected component, extracts word based on the combined region in described second combined region set OK.
12. devices according to claim 11 are it is characterised in that described the first row extracting sub-module includes:
First mutual exclusion condition setting submodule, for when adopting the literal line arrangement analysis method based on Hough transformation, by institute State the line relationship between each combined region in the first combined region set to be set to calculate, to combined region enter style of writing Described first combined region set is searched in the literal line that word row arrangement analysis obtain;
Row selects submodule, for retaining the most merging of connected component number in the described first combined region set searched Region, removes other combined region.
13. devices according to claim 11 are it is characterised in that described the first row extracting sub-module includes:
Weight factor arranges subelement, for when adopting the literal line arrangement analysis method based on region clustering, in each merging Increase the connected component number that combined region comprises in the weight factor of region weight;
Second mutual exclusion condition setting submodule, for when adopting the literal line arrangement analysis method based on region clustering, by institute The weight stated between the combined region comprising identical connected component is set to 0.
14. devices according to claim 9 are it is characterised in that described merging module includes:
Connected component merges submodule, for being compared two-by-two to the connected component in human-edited's image, will meet encirclement Any two connected component of structural relation and adjacent structure relation merges, and obtains combined region;
Combining objects merge submodule, for the combined region of connected component and merging process generation each time is right as merging As repeating combining objects are compared two-by-two, merging meeting any two surrounding structural relation and adjacent structure relation Object merges, until can not be merged.
15. devices according to claim 14 are it is characterised in that described connected component merging submodule includes:
First judging submodule, for judging the overlapping area between the fitted rectangle of two connected components and two connected components Fitted rectangle in the ratio of the minimum fitted rectangle area of area whether be more than the first preset multiple, and two connected components Color and stroke width whether close, if it is, meeting encirclement structural relation, otherwise, do not meet encirclement structural relation;
Second judging submodule, for judging the width of fitted rectangle of two connected components and the ratio and centre distance between Whether it is more than the second preset multiple, whether the color of two connected components and stroke are close, and the matching in the region after merging Whether the length of rectangle is less than the 3rd preset multiple with the ratio of width, if it is, meeting adjacent structure relation, otherwise, is not inconsistent Close adjacent structure relation.
16. devices according to claim 14 are it is characterised in that described combining objects merging submodule includes:
3rd judging submodule, for judging the overlapping area between the fitted rectangle of two combining objects and two combining objects Fitted rectangle in the ratio of the minimum fitted rectangle area of area whether be more than the first preset multiple, and two combining objects Color and stroke width whether close, if it is, meeting encirclement structural relation, otherwise, do not meet encirclement structural relation;
4th judging submodule, for judging the width of fitted rectangle of two combining objects and the ratio and centre distance between Whether it is more than the second preset multiple, whether the color of two combining objects and stroke are close, and the matching in the region after merging Whether the length of rectangle is less than the 3rd preset multiple with the ratio of width, if it is, meeting adjacent structure relation, otherwise, is not inconsistent Close adjacent structure relation.
CN201210486972.7A 2012-11-26 2012-11-26 A kind of merging method in individual character region and device Active CN103839060B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710112654.7A CN107122778B (en) 2012-11-26 2012-11-26 Method and device for merging single character areas
CN201210486972.7A CN103839060B (en) 2012-11-26 2012-11-26 A kind of merging method in individual character region and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210486972.7A CN103839060B (en) 2012-11-26 2012-11-26 A kind of merging method in individual character region and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201710112654.7A Division CN107122778B (en) 2012-11-26 2012-11-26 Method and device for merging single character areas

Publications (2)

Publication Number Publication Date
CN103839060A CN103839060A (en) 2014-06-04
CN103839060B true CN103839060B (en) 2017-03-01

Family

ID=50802539

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201210486972.7A Active CN103839060B (en) 2012-11-26 2012-11-26 A kind of merging method in individual character region and device
CN201710112654.7A Active CN107122778B (en) 2012-11-26 2012-11-26 Method and device for merging single character areas

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201710112654.7A Active CN107122778B (en) 2012-11-26 2012-11-26 Method and device for merging single character areas

Country Status (1)

Country Link
CN (2) CN103839060B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989366A (en) * 2015-01-30 2016-10-05 深圳市思路飞扬信息技术有限责任公司 Inclination angle correcting method of text image, page layout analysis method of text image, vision assistant device and vision assistant system
CN107977593A (en) * 2016-10-21 2018-05-01 富士通株式会社 Image processing apparatus and image processing method
CN106951893A (en) * 2017-05-08 2017-07-14 奇酷互联网络科技(深圳)有限公司 Text information acquisition methods, device and mobile terminal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251892A (en) * 2008-03-07 2008-08-27 北大方正集团有限公司 Method and apparatus for cutting character
CN101266654A (en) * 2007-03-14 2008-09-17 中国科学院自动化研究所 Image text location method and device based on connective component and support vector machine
US7697760B2 (en) * 2001-02-22 2010-04-13 International Business Machines Corporation Handwritten word recognition using nearest neighbor techniques that allow adaptive learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7697760B2 (en) * 2001-02-22 2010-04-13 International Business Machines Corporation Handwritten word recognition using nearest neighbor techniques that allow adaptive learning
CN101266654A (en) * 2007-03-14 2008-09-17 中国科学院自动化研究所 Image text location method and device based on connective component and support vector machine
CN101251892A (en) * 2008-03-07 2008-08-27 北大方正集团有限公司 Method and apparatus for cutting character

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于连通域单元和穿越算法的汉字切分;王琳琬等;《信息技术》;20040430;第28卷(第4期);第30-33页 *

Also Published As

Publication number Publication date
CN107122778B (en) 2020-06-23
CN107122778A (en) 2017-09-01
CN103839060A (en) 2014-06-04

Similar Documents

Publication Publication Date Title
Clausner et al. Icdar2017 competition on recognition of documents with complex layouts-rdcl2017
Qiao et al. Lgpma: Complicated table structure recognition with local and global pyramid mask alignment
Zhang et al. Road extraction by deep residual u-net
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
Bansal et al. Segmentation of touching and fused Devanagari characters
Ma et al. Joint layout analysis, character detection and recognition for historical document digitization
CN108763483A (en) A kind of Text Information Extraction method towards judgement document
CA2315456C (en) Schematic organization tool
US20070234258A1 (en) Method for post-routing redundant via insertion in integrated circuit layout
CN105574524B (en) Based on dialogue and divide the mirror cartoon image template recognition method and system that joint identifies
CN101510252A (en) Area extraction program, character recognition program, and character recognition device
Harit et al. Table detection in document images using header and trailer patterns
CN105528614A (en) Cartoon image layout recognition method and automatic recognition system
CN103839060B (en) A kind of merging method in individual character region and device
CN102968619B (en) Recognition method for components of Chinese character pictures
CN106339481A (en) Chinese compound new-word discovery method based on maximum confidence coefficient
Colter et al. Tablext: A combined neural network and heuristic based table extractor
CN103995816A (en) Information processing apparatus, information processing method
Bansal et al. Table extraction from document images using fixed point model
CN103729638A (en) Text row arrangement analytical method and device for text area recognition
JP2005043990A (en) Document processor and document processing method
CN101814141A (en) Storage medium, character identifying method and character recognition device
Nguyen TableSegNet: a fully convolutional network for table detection and segmentation in document images
Roy et al. Diag2graph: Representing deep learning diagrams in research papers as knowledge graphs
CN110955892B (en) Hardware Trojan horse detection method based on machine learning and circuit behavior level characteristics

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant