CN108564078A - The method for extracting language of the Manchus word image central axes - Google Patents
The method for extracting language of the Manchus word image central axes Download PDFInfo
- Publication number
- CN108564078A CN108564078A CN201810371803.6A CN201810371803A CN108564078A CN 108564078 A CN108564078 A CN 108564078A CN 201810371803 A CN201810371803 A CN 201810371803A CN 108564078 A CN108564078 A CN 108564078A
- Authority
- CN
- China
- Prior art keywords
- language
- central axes
- manchus
- word image
- manchus word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
- G06V10/225—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on a marking or identifier characterising the area
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
Abstract
The method for extracting language of the Manchus word image central axes, belongs to character segmentation field, improves language of the Manchus cutting precision for solving the problems, such as, technical essential is:Position language of the Manchus word image central axes, detect language of the Manchus word image axis line width, effect is for language of the Manchus word image axis detection, the accuracy rate for directly affecting cutting, it is necessary to be enhanced the precision of axis detection in order to provide cutting accuracy rate, thus, central axes are positioned, and detect its width, central axes can accurately be extracted.
Description
Technical field
The invention belongs to character segmentation fields, are related to a kind of method of extraction language of the Manchus word image central axes.
Background technology
The language of the Manchus is the spoken and written languages that the ethnic groups such as the Manchu of China, Xibe use, and is pushed away as legal word in the Qing Dynasty
Wide and use forms the Manchu literatures of a large amount of preciousnesses.It has been on the verge of to disappear due to expiring Chinese language at present, Manchu's language cultural heritage is urgently
Wait rescuing and protect to obtain the approval and attention of state and society all circles.Study the language of the Manchus optical character recognition technology to protection and
Succession Qing Dynasty cultural heritage is particularly important.The language of the Manchus is a kind of phonemic language, shares 38 letters, wherein 6 vowels,
22 consonants, in addition there are 10 particular letters dedicated for spelling Chinese FrameNet.The language of the Manchus is write using word sequence from upper
It arrives down, the rule of the form and arrangement of lines in calligraphy or printing from left to right.Language of the Manchus identification is generally required language of the Manchus elder generation cutting basic unit (such as letter) first,
It is identified again, thus, the precision for improving language of the Manchus identification can be set about from its cutting precision is improved.
Invention content
In order to solve the problems, such as to improve language of the Manchus cutting precision, the following technical solutions are proposed by the present invention:
A method of extraction language of the Manchus word image central axes include the following steps:
S1. positioning language of the Manchus word image central axes;
S2. language of the Manchus word image axis line width is detected.
Supplement as technical solution:The step S1 is specifically included:
S1.1. language of the Manchus word image is negated, the pixel value of word segment is enabled to take 1, and the pixel value of background parts takes 0;
S1.2. the morphologic thinning function of MATLAB image processing toolboxes is used to realize the morphology of language of the Manchus word image
Refinement;
S1.3. to the language of the Manchus word image after morphologic thinning, using Hough transformation to determine that the central axes institute of refinement is right
The row coordinate answered, the position of the row coordinate as language of the Manchus word image central axes, wherein limit the angle of Hough transformation search straight line
Degree is θ=90, only searches the straight line of vertical direction, and connects identical lengthwise position, is smaller than language of the Manchus word image height
And it is straight line that length, which is more than the straight line of 1 pixel, finds out the center of central axes.
Supplement as technical solution:The step S2 is specifically included:
S2.1. the region of search of maximum run rule of three is determined;
S2.2. maximum run rule of three in region of search is imposed to language of the Manchus word image and determined in language of the Manchus word image
The width of axis;
S2.3. by the width of the center of language of the Manchus word image central axes and central axes calculate central axes left margin and
Right margin.
Supplement as technical solution:The step S2.1 is specially:
The region of search of maximum run rule of three range as defined in following formula is determined:
Wherein, sl is the left margin of the search range limited, and sr is the right margin of the search range limited, and baseline is
The center of central axes, round indicate that, to nearest integer rounding, W is the width of language of the Manchus word image.
Supplement as technical solution:The step of maximum run rule of three of the step S2.2:Scan language of the Manchus word graph
As every a line of region of search word image, and the number of the run length and length appearance of continuous black picture element is counted, then
Run length with maximum occurrence number is exactly the width of language of the Manchus word image central axes.
Supplement as technical solution:The left margin and right margin of the central axes of the step S2.3 are calculated by following formula;
Wherein:Bl is the left margin of central axes, and br is the right margin of central axes, and baseline is language of the Manchus word image axis
The center of line, baseline_width are the width of language of the Manchus word image central axes, and round indicates to take to nearest integer
It is whole.
Advantageous effect:For language of the Manchus word image axis detection, the accuracy rate of cutting is directly affected, in order to provide cutting
Accuracy rate, it is necessary to be enhanced the precision of axis detection, thus, central axes are positioned, and detect its width, can will in
Axis accurately extracts.
Description of the drawings
Fig. 1 language of the Manchus part sets build flow chart;
Fig. 2 language of the Manchus components divide flow chart;
Fig. 3 conventional method language of the Manchus word image axis detection error instance figures;
Fig. 4 determines the figure of language of the Manchus axis line width using the maximum run rule of three that region limits, wherein:(1) maximum trip
Journey rule of three error instance figure, (2) search range figure of the invention limited, (3) the method for the present invention result figure;
Fig. 5 the method for the present invention axis detection design sketch;
Fig. 6 language of the Manchus component cutting flow charts;
Fig. 7 language of the Manchus component cutting result figures, wherein:(1) the weak cut zone in weak segmentation phenomenon figure, (2) through frittering component,
(3) over-segmentation phenomenon figure, (4) overdivided region are by merging figure, (5) partial segmentation result figure.
Specific implementation mode
From the angle analysis of optical character recognition technology, the language of the Manchus has the characteristics that:(1) according in word, position is not
Together, the same letter of the language of the Manchus generally has 4 kinds of different forms of shape and suffix shape in independent shape, prefix shape, word.Language of the Manchus difference font
Letter is 114 total.(2) language of the Manchus document same column word is all located near identical central axes, between the block letter language of the Manchus two arranges
The case where word is not in intersection substantially is conducive to row extraction.Between having between language of the Manchus word in same row language of the Manchus text centainly
Every, be conducive to word extraction.(3) language of the Manchus word is made of the vertical central axes connection of one or more Manchu alphabets, same word
There is no gap between interior letter and letter.But be located on language of the Manchus word image central axes at letter and the mutually spelling of letter, it can
To consider to divide Manchu alphabet using the pixel characteristic at central axes.(4) part Manchu alphabet has " a shape multiword " phenomenon.Example
Such as character, while being shape in the word of alphabetical a, e and n, it can be distinguished according to the spelling rule of adjacent letters in identification.
(5) part Manchu alphabet component part having the same.Such as character(the prefix shape of alphabetical o), can regard as by character(the prefix shape of alphabetical e) and character(shape in the word of alphabetical o) two parts are composed.Therefore it is basic point with Manchu alphabet
Cut the phenomenon that unit is then susceptible to over-segmentation and weak segmentation.(6) certain monograms do not have separability.Such as(bo),
It is cut into(alphabetical b) and(alphabetical o) is extremely difficult.
The characteristics of based on the above-mentioned language of the Manchus, the present embodiment propose a kind of thinking deconstructing language of the Manchus word again with component, with full
Basic unit of the literary component (hereinafter referred to as component) as segmentation and identification, can solve with Manchu alphabet as basic cutting unit
Caused over-segmentation and weak segmentation problem, language of the Manchus part set include a part for Manchu alphabet, letter or monogram, alphabetical group
3 kinds of sources, the purpose of structure language of the Manchus part set such as close and be to reduce the wrong identification come by segmentation band, this is because if according to
Letter is that basic cutting unit over-segmentation and weak segmentation problem easily occurs then such as Such analysis, then is subsequently used for identification letter
Grader will certainly generate the part of over-segmentation and weak segmentation identification mistake, or even can not identify;And it is (method) of the invention
The language of the Manchus part set of proposition is built by navigation designing of the result of dividing method, that is, common over-segmentation (word
Female or monogram a part) and weak segmentation (monogram) be no longer regarded as being one kind " mistake ", but one kind " correct "
Segmentation, therefore the grader of subsequent design can identify these components, to reduce the identification mistake generated by segmentation errors
The problems such as.Understanding for language of the Manchus component can refer to and carry out analogy with to the identification of English word.With English word study
For, it can be with the whole word study of Direct Recognition;Can also be whole word segmentation at s, t, u, the letters such as d, y, identification is alphabetical respectively, so
After be combined into word study;If being cut into letter to be difficult to realize, and it is relatively easy to be cut into component, for example, being easy cutting
At:St, u, dy, (wherein, st, u, dy are components) then identification component, then component combination at word, however, the language of the Manchus by
It is not to be easy as the English of citing in the cutting with These characteristics, component, as shown in Figure 1, the structure of language of the Manchus part set
Building flow is:With reference to Manchu alphabet table, National Standard of the People's Republic of China《Information technology universal multiple-octet coded character set tin
Bai Wen, language of the Manchus font》, Mongolian part set in document [1-2], at the beginning of proposing a language of the Manchus component comprising 99 initial parts
Initial set (hereinafter referred to as " initial set "), the Flag=0 of each language of the Manchus component of juxtaposition.Using language of the Manchus cutting to language of the Manchus word image
It is split, and statistics and analysis segmentation result:If the component after segmentation is not belonging to initial set, which is added initial
Collection, the Flag=1 of the juxtaposition component;If the component after segmentation belongs to initial part collection, the Flag=1 of corresponding component is set.Inspection
It looks into whether initial set has the Flag=0 of component, judges whether there is component and never occurred in segmentation result, if there are the portions
Part then deletes the component from initial set.It arranges and exports language of the Manchus part set.Include component 106 altogether in language of the Manchus part set,
For details see attached table 1.The document [1-2] wherein addressed:
[1]Hongxi Wei,Guanglai Gao.A keyword retrieval system for historical
Mongolian document images[J].International journal on document analysis
andrecognition,2014,17(1),33-45.
[2]Liangrui Peng,Changsong Liu,Xiaoqing Ding,Jianming Jin,Youshou Wu,
Hua Wang,Yanhua Bao.Multi-font printed Mongolian document recognition system
[J].International journal on document analysis and recognition,2010,13(2):93-
106.
As shown in Fig. 2, the dicing step of full component text is as follows:
S1. language of the Manchus paper document is converted into the digitized map that can be stored and processed with computer by photoelectric conversion device
As document, the digital picture of language of the Manchus document is subjected to image preprocessing (smooth, binaryzation);
S2. printed page analysis (Slant Rectify, row cutting and the segmentation of words);
S3. language of the Manchus word image is extracted;
S4. place normalization;
S5. axis detection;
S6. according to the relational implementation language of the Manchus component cutting of language of the Manchus component and axis line position.
Wherein, slant correction determines space of a whole page angle of inclination using Hough transform method, then rotates image and corrects back vertical
Text status;The language of the Manchus document for being inclined by correction uses vertical projection method into ranks cutting, using horizontal projection method's cutting word
And the language of the Manchus word in language of the Manchus row image is extracted, then place normalization is carried out to language of the Manchus word image.It is completed by above-mentioned steps
The pretreatment of language of the Manchus word image, the height for writing literary word image all over are H, width W.It should be noted that language of the Manchus list
It is to cut away the extra white background edge of language of the Manchus word image that word image, which carries out place normalization, Fig. 2 shows flow, be to compile
Journey is convenient and carries out Image Reversal, shows that is removed is the black border of language of the Manchus word.The figure of black matrix wrongly written or mispronounced character in Fig. 2, that is,
The figure of described Image Reversal.Artwork should be white gravoply, with black engraved characters, but in order to program conveniently, is turned into black matrix wrongly written or mispronounced character and removes four
The more convenient programming in the edge at angle, therefore directly given the image for removing black border after overturning again.
In the present embodiment, for language of the Manchus word image axis detection, the accuracy rate of segmentation is directly affected, it is following to it
Concrete scheme is described in detail.
For language of the Manchus word image axis detection, i.e. step S5, vertical projection method and most is generally used in the prior art
Big accumulation vertical projection method, however there are central axes locating bias and axis line width to estimate the feelings such as mistake for above two method
Condition, as shown in Figure 3.The present embodiment provides a kind of methods of extraction language of the Manchus word image central axes, include the following steps:
S5.1. language of the Manchus word image central axes position:
Language of the Manchus word image is negated first, even the pixel value of word segment takes 1 and the pixel value of background takes 0.It uses
The morphologic thinning function of MATLAB image processing toolboxes, using 3 × 3 structural element templates, each template includes 9 pictures
Element, each pixel can only take 0 or 1, therefore template has 512 kinds of different forms, and template, which is divided into 8 directions, realizes language of the Manchus word graph
The morphologic thinning of picture.To the language of the Manchus word image after refinement, the row corresponding to the central axes of refinement are determined using Hough transformation
Coordinate, the as position of language of the Manchus word image central axes.In the extraction of language of the Manchus word image central axes, limits Hough transformation and search
The angle of rope straight line is θ=90, i.e., only searches the straight line of vertical direction, and connects identical lengthwise position, is smaller than word
The straight line that picture altitude H and length are more than 1 pixel is straight line, that is, finds out the center of central axes, be denoted as
baseline.Language of the Manchus word image central axes refer to that in a width language of the Manchus word image, language of the Manchus word central axes are in the picture
Row coordinate position, rather than the center line of piece image.
S5.2. language of the Manchus word image central axes width detection
S5.2.1. the maximum run rule of three of axis line width is used:Every a line of language of the Manchus word image is scanned first, and
Count the number of the run length and length appearance of continuous black picture element;All rows are scanned successively, then there is maximum to go out occurrence
Several run lengths is exactly the width of language of the Manchus word image central axes, is denoted as w0.Using maximum run rule of three to detecting the language of the Manchus
Word image axis line width is effective, but still there is the error situation as shown in Fig. 4 (1).Generate the original of this mistake
Because being, maximum run rule of three is to carry out continuous black picture element distance of swimming statistics to whole picture language of the Manchus word image, and the language of the Manchus is different
Font deforms severe jamming maximum run rule of three to global statistical result.The statistics write to the language of the Manchus shows language of the Manchus axis
Line width does not exceed the 1/2 of word of width W generally, therefore limits the region of search of maximum run rule of three, by searching for algorithm
Rope region is limited in the range of formula (1) regulation, the maximum run rule of three that referred to as region limits.
In formula (1), sl is the left margin of the search range limited, and sr is the right margin of the search range limited,
Baseline is the center of central axes, and round is indicated to nearest integer rounding.Limit search regional extent weakens full
Text is free and limb stroke influences the statistics of central axes width, then uses maximum run rule of three in limit search range again
The detection that axis line width is completed in language of the Manchus word image afterwards, as a result as shown in Fig. 4 (3).
S5.2.2. by the center baseline of the width baseline_width of central axes and central axes, according to public affairs
Formula (2) calculates the left margin bl and right margin br of central axes.
The language of the Manchus image of different fonts font size totally 400 width is randomly selected, the region that the present embodiment is respectively adopted limits most
Big distance of swimming rule of three, vertical projection method extract central axes, and the results are shown in Table 1.Axis is correctly extracted using the method for the present invention
The certain embodiments of line are as shown in Figure 5.The experimental results showed that language of the Manchus list can be accurately positioned using morphologic thinning and Hough transformation
Word image axis line position, the maximum run probabilistic method limited using region can correctly determine the width of language of the Manchus word image central axes
Degree.
1 language of the Manchus word image axis detection result statistical form of table
The method of the present invention | Vertical projection method | |
Correct sample number | 397 | 210 |
Error sample number | 3 | 190 |
Accuracy | 99.25% | 52.50% |
In the present embodiment, the accuracy of language of the Manchus character cutting is to improve the bottleneck problem of language of the Manchus recognition accuracy, following
Its concrete scheme is described in detail.
For language of the Manchus component cutting, i.e. step S6, as shown in fig. 6, including:
S6.1. the thick cutting of language of the Manchus component;
S6.2. the weak segmentation judgement in segmentation candidates region with fritter point;
S6.3. the over-segmentation in segmentation candidates region is adjudicated and is merged.
Above-mentioned steps are made below and being illustrated:
S6.1. the thick cutting of language of the Manchus component
Since language of the Manchus component is connection with central axes, first centered on central axes, by language of the Manchus word be divided into it is left,
In, right 3 parts.Wherein, the 1st row of the ranging from language of the Manchus word of left part are arranged to bl-1, and ranging from the of right part
Br+1 arranges the W row of language of the Manchus word.Floor projection is carried out to left part and right part respectively, is denoted as pl and pr.Definition
The cutting cost function of i-th row is:
Cost (i)=pl (i)+pr (i), i=1,2 ..., H (3)
Ideally the cost function value of cutting row should be 0, i.e., left and right two parts the row all without except central axes it
Outer stroke.But in actual conditions, due to the influence of noise that the pretreatments such as scanning, slant correction, binaryzation are brought, to cutting row
Constraints strictly can then lead to serious weak segmentation problem very much.If T1 is the thick cutting threshold value of language of the Manchus component, by a large amount of real
The value for testing determining T1 isOnly meet condition:
Cost(i)≤T1 (4)
Row be only candidate cutting row, and remember it is all meet condition (4) candidate cutting rows composition sequence be Can_
seg.Wherein, determining for the value of T1 is tested, is to select the baseline_width of different multiples as T1, these multiples are all
It is<=1 score executes language of the Manchus component cutting method, the image after cutting is compared, and it is preferably full to select cutting effect
T1 corresponding to literary word image, final choice are above-mentioned T1 values.
The candidate cutting row set obtained through the thick cutting of language of the Manchus component, it may appear that following three kinds of situations:
1) candidate cutting row is done in the 1st trade of image, this is clearly unreasonable candidate row, therefore should be from candidate cutting row
It is deleted in set;
2) continuous phase adjacent rows/image since the 1st row of image is the continuous adjacent terminated with last 1 row (H rows)
Row is all unreasonable candidate row subsegment, therefore should delete these subsegments from candidate cutting set;
3) in addition to the subsegment of the continuous phase adjacent rows composition in 2), it is only necessary to the candidate cutting of wherein centrally located one
Row, remaining is not needed to;Therefore the candidate row in centre position should be used to substitute the subsegment that entire continuous phase adjacent rows form.
By above-mentioned, often there is also extra candidate cutting rows in Can_seg, for this purpose, further being deleted using following strategy
Extra candidate cutting row in Can_seg:
(1) if there was only 1 segmentation candidates row in Can_seg, and it is the 1st row, then deletes the row;Otherwise it goes to step (2);
(2) the subsegment conti_subseg of continuous candidate cutting row composition is searched, if the 1st row of starting behavior of subsegment, or
The end behavior H rows of person's subsegment, then delete all rows of the subsegment;Otherwise it goes to step (3);
(3) in continuous candidate cutting subsegment conti_subseg, from small to large ord, the subsegment is substituted with median
All rows (average value of intermediate two values is taken to round up again when even number candidate row);
(4) the fresh cut branch sequence C an_seg_new of extra candidate cutting row is deleted in output.
S7.2. the weak segmentation judgement in segmentation candidates region with fritter point
There may be weak cutting situations for language of the Manchus component through thick cutting.Statistical result shows that the height of language of the Manchus component is general
No more than 5 times baseline_width, therefore set weak segmentation decision threshold T_less=5.Calculate each cutting in Can_seg_new
The height hl in region, then the cutting region of height hl > (T_less × baseline_width) be judged as weak cut zone.
For weak cut zone, using above-mentioned thick cutting method and point secondary cutting of threshold value T2 progress is frittered, and be stored in Seg1 sequences
In.It fritters point threshold value T2 and relaxes constraint to cutting candidate row again on the basis of thick cutting, determined by many experiments Wherein, determining for the value of T2 is tested, is to select the baseline_width of different multiples
As T2, these multiples are all<=1 score executes language of the Manchus component cutting method, the image after cutting is compared, and selects
Go out the T2 corresponding to the better language of the Manchus word image of cutting effect, final choice is above-mentioned T2 values.
S7.3. the over-segmentation in segmentation candidates region is adjudicated and is merged
After thick cutting and frittering point, there is likely to be overdivided regions for Seg1 sequences.Statistical result shows language of the Manchus component
Height be generally higher than baseline_width, therefore set over-segmentation decision threshold T_over=1.Calculate each cutting area in Seg1
The height ho in domain, then the cutting region of height ho < (T_over × baseline_width) be judged as overdivided region, need
Merge, merging has following situations:
1) it counts from top to bottom, first cutting region is judged as over-segmentation, then is only possible to and the 2nd region merging technique;
2) from the bottom up, second-to-last region is judged as over-segmentation, then is only possible to and a region merging technique last;
3) if overdivided region is located at centre, need to consider two regions up and down that its is adjacent.Calculate separately with above
The height h_up of region merging technique rear region, and merge with following area the height h_lw of rear region, height is small after selection combining
That Merge Scenarios;
If 4) equal with height after the merging in upper and lower two regions, i.e., according to 3) not can determine that Merge Scenarios, then count respectively
Connected domain number after calculation and upper and lower two region merging techniques, and the Merge Scenarios for selecting connected domain number few;
5) cutting row of the output after region merging technique.
For this purpose, using following compatible rule merging overdivided region:
(1) if the 1st cutting region over-segmentation, with the 2nd cutting region merging technique;Otherwise (2) is gone to step
(2) if second-to-last cutting region over-segmentation, with the last one cutting region merging technique;Otherwise (3) is gone to step
(3) if overdivided region is neither the 1st, nor second-to-last, then calculate separately its adjacent upper and lower 2
Cut subregional height h_up and h_lw.If h_up < h_lw, with a upper cutting region merging technique;If h_up > h_
Lw is then merged into next cutting region;Otherwise (4) is gone to step
(4) it if the height of upper and lower 2 adjacent areas of overdivided region is equal, calculates separately and is closed with up or down region
Connected domain number num_up, num_lw after and.If num_up < num_lw, with a upper cutting region merging technique;If
Num_up > num_lw, then with next cutting region merging technique.
(5) output merges the cutting row sequence of overdivided region.
By said program, the cutting of language of the Manchus component is obtained as a result, as shown in fig. 7, Fig. 7 (1)-(2) are weak cut zone processes
The result frittered point;Fig. 7 (3)-(4) are that overdivided region passes through combined result.
It is further processed by the language of the Manchus component cutting result of above-mentioned completion, to be identified to language of the Manchus component, which removes
Further include following steps outside the cutting of above-mentioned language of the Manchus word image:
(1) language of the Manchus component normalizes
Including the normalization of language of the Manchus component locations and size normalization.
The language of the Manchus component locations normalization be exactly by language of the Manchus image of component with stroke pixel it is most upper, most under, it is most left, most right
Pixel be boundary, cut off background parts, only remain with the part of stroke.The normalization of language of the Manchus component sizes refers to by above-mentioned warp
Cross the image normalization after place normalization be identical size (such as:The pixel of 64 pixels × 64).
(2) language of the Manchus component feature extracts
Extract the method for being usually used in minority language feature extraction respectively first, including:Contour feature, grid search-engine,
Directional element features, visual direction feature and affine not displacement feature.Then these features are merged, and use principal component analysis pair
Fusion feature carries out dimensionality reduction.
(3) language of the Manchus component identifies
Using the support vector machine classifier with gaussian kernel function, the Combination of Multiple Classifiers rule of " a pair of remaining " is used
Realize the identification to some language of the Manchus component.
(4) language of the Manchus component identification post-processing,
For the language of the Manchus component identified, according to the spelling rules of the recognition result of neighbouring component and Manchu alphabet,
The recombination from component to word is completed, to realize the identification to language of the Manchus word.
Subordinate list 1:
Claims (6)
1. a kind of method of extraction language of the Manchus word image central axes, which is characterized in that include the following steps:
S1. positioning language of the Manchus word central axes;
S2. language of the Manchus word axis line width is detected.
2. the method for extraction language of the Manchus word image central axes as described in claim 1, which is characterized in that the step S1 is specific
Including:
S1.1. language of the Manchus word image is negated, the pixel value of word segment is enabled to take 1, and the pixel value of background parts takes 0;
S1.2. the morphologic thinning function of MATLAB image processing toolboxes is used to realize that the morphology of language of the Manchus word image is thin
Change;
S1.3. to the language of the Manchus word image after morphologic thinning, using Hough transformation to determine corresponding to the central axes of refinement
Row coordinate, the position of the row coordinate as language of the Manchus word central axes, wherein limit Hough transformation search straight line angle as θ=
90, the straight line of vertical direction is only searched, and connects identical lengthwise position, is smaller than language of the Manchus word image height and from height
Degree is straight line more than the straight line of 1 pixel, finds out the center of central axes.
3. the method for extraction language of the Manchus word image central axes as described in claim 1, which is characterized in that the step S2 is specific
Including:
S2.1. the region of search of maximum run rule of three is determined;
S2.2. maximum run rule of three in region of search is imposed to language of the Manchus word image and determines language of the Manchus word image central axes
Width;
S2.3. left margin and the right of central axes are calculated by the width of the center of language of the Manchus word image central axes and central axes
Boundary.
4. the method for extraction language of the Manchus word image central axes as claimed in claim 3, which is characterized in that the step S2.1 tools
Body is:
The region of search of maximum run rule of three range as defined in following formula is determined:
Wherein, sl is the left margin of the search range limited, and sr is the right margin of the search range limited, and baseline is axis
The center of line, round indicate that, to nearest integer rounding, W is the width of language of the Manchus word image.
5. the method for extraction language of the Manchus word image central axes as claimed in claim 3, which is characterized in that the step S2.2's
The step of maximum run rule of three:Every a line of language of the Manchus word image region of search is scanned, and counts the trip of continuous black picture element
The number that Cheng Changdu and the length occur, then it is exactly the width of language of the Manchus word central axes to have the run length of maximum occurrence number
Degree.
6. the method for extraction language of the Manchus word image central axes as claimed in claim 3, which is characterized in that described in being calculated by following formula
The left margin and right margin of the central axes of step S2.3;
Wherein:Bl is the left margin of central axes, and br is the right margin of central axes, and baseline is language of the Manchus word image central axes
Center, baseline_width are the width of language of the Manchus word image central axes, and round is indicated to nearest integer rounding.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810371803.6A CN108564078B (en) | 2018-04-24 | 2018-04-24 | Method for extracting axle wire of Manchu word image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810371803.6A CN108564078B (en) | 2018-04-24 | 2018-04-24 | Method for extracting axle wire of Manchu word image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108564078A true CN108564078A (en) | 2018-09-21 |
CN108564078B CN108564078B (en) | 2020-11-13 |
Family
ID=63536492
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810371803.6A Active CN108564078B (en) | 2018-04-24 | 2018-04-24 | Method for extracting axle wire of Manchu word image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108564078B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115331232A (en) * | 2022-07-08 | 2022-11-11 | 黑龙江省科学院智能制造研究所 | Manchu historical document image column segmentation method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5105471A (en) * | 1990-02-14 | 1992-04-14 | Brother Kogyo Kabushiki Kaisha | Apparatus for converting character outline data into dot data, having means for correcting reproduction width of strokes of rotated or italicized characters |
CN101017533A (en) * | 2007-03-09 | 2007-08-15 | 清华大学 | Recognition method of printed mongolian character |
CN101025791A (en) * | 2007-04-06 | 2007-08-29 | 清华大学 | Printed Monggol language text segmentation method |
CN102982328A (en) * | 2011-08-03 | 2013-03-20 | 夏普株式会社 | Character recognition apparatus and character recognition method |
CN105279506A (en) * | 2015-09-29 | 2016-01-27 | 大连民族大学 | Manchu script central axis positioning method |
-
2018
- 2018-04-24 CN CN201810371803.6A patent/CN108564078B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5105471A (en) * | 1990-02-14 | 1992-04-14 | Brother Kogyo Kabushiki Kaisha | Apparatus for converting character outline data into dot data, having means for correcting reproduction width of strokes of rotated or italicized characters |
CN101017533A (en) * | 2007-03-09 | 2007-08-15 | 清华大学 | Recognition method of printed mongolian character |
CN101025791A (en) * | 2007-04-06 | 2007-08-29 | 清华大学 | Printed Monggol language text segmentation method |
CN102982328A (en) * | 2011-08-03 | 2013-03-20 | 夏普株式会社 | Character recognition apparatus and character recognition method |
CN105279506A (en) * | 2015-09-29 | 2016-01-27 | 大连民族大学 | Manchu script central axis positioning method |
Non-Patent Citations (5)
Title |
---|
SOUMENBAG 等,: "An improved contour-based thinning method for character images", 《PATTERN RECOGNITION LETTERS》 * |
李志敏: "《垃圾邮件识别与处理技术研究》", 31 December 2015 * |
赵珀璋 等: "《中央广播电视大学继续教育教材 中文信息处理技术》", 31 May 1990 * |
魏宏喜: "印刷体蒙古文字识别中关键技术的研究", 《中国优秀博硕士学位论文全文数据库 (硕士)》 * |
魏宏喜: "印刷体蒙古文字识别中蒙古文字特征的选择", 《内蒙古大学学报(自然科学版)》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115331232A (en) * | 2022-07-08 | 2022-11-11 | 黑龙江省科学院智能制造研究所 | Manchu historical document image column segmentation method |
CN115331232B (en) * | 2022-07-08 | 2023-08-18 | 黑龙江省科学院智能制造研究所 | Method for segmenting image columns of full-text historical document |
Also Published As
Publication number | Publication date |
---|---|
CN108564078B (en) | 2020-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Roy et al. | HMM-based Indic handwritten word recognition using zone segmentation | |
US5539841A (en) | Method for comparing image sections to determine similarity therebetween | |
EP1016033B1 (en) | Automatic language identification system for multilingual optical character recognition | |
Razak et al. | Off-line handwriting text line segmentation: A review | |
Kumar et al. | Segmentation of printed text in devanagari script and gurmukhi script | |
CN103034848A (en) | Identification method of form type | |
CN104966051A (en) | Method of recognizing layout of document image | |
Ma et al. | Adaptive Hindi OCR using generalized Hausdorff image comparison | |
Peng et al. | Multi-font printed Mongolian document recognition system | |
Ramappa et al. | Skew detection, correction and segmentation of handwritten Kannada document | |
Boudraa et al. | An improved skew angle detection and correction technique for historical scanned documents using morphological skeleton and progressive probabilistic hough transform | |
Ali et al. | An efficient character segmentation algorithm for recognition of Arabic handwritten script | |
Das et al. | Heuristic based script identification from multilingual text documents | |
Roy et al. | Word-wise hand-written script separation for indian postal automation | |
CN108564078A (en) | The method for extracting language of the Manchus word image central axes | |
CN108596182A (en) | Language of the Manchus component cutting method | |
Sharma et al. | Segmentation of handwritten text in Gurmukhi script | |
Ladwani et al. | Novel approach to segmentation of handwritten Devnagari word | |
Razak et al. | A real-time line segmentation algorithm for an offline overlapped handwritten Jawi character recognition chip | |
CN108596183A (en) | The overdivided region merging method of language of the Manchus component cutting | |
CN108564089A (en) | The construction method of language of the Manchus part set | |
CN108549896A (en) | The method that extra candidate cutting row is deleted in language of the Manchus component cutting | |
CN108564139A (en) | Block letter language of the Manchus identification device based on language of the Manchus component cutting | |
CN108537229A (en) | Block letter language of the Manchus recognition methods based on language of the Manchus component cutting | |
Humied | Segmentation accuracy for offline Arabic handwritten recognition based on bounding box algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |