CN108564078A

CN108564078A - The method for extracting language of the Manchus word image central axes

Info

Publication number: CN108564078A
Application number: CN201810371803.6A
Authority: CN
Inventors: 郑蕊蕊; 李敏; 贺建军; 许爽; 吴宝春; 卢海涛
Original assignee: Dalian Nationalities University
Current assignee: Dalian Minzu University
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2018-09-21
Anticipated expiration: 2038-04-24
Also published as: CN108564078B

Abstract

The method for extracting language of the Manchus word image central axes, belongs to character segmentation field, improves language of the Manchus cutting precision for solving the problems, such as, technical essential is：Position language of the Manchus word image central axes, detect language of the Manchus word image axis line width, effect is for language of the Manchus word image axis detection, the accuracy rate for directly affecting cutting, it is necessary to be enhanced the precision of axis detection in order to provide cutting accuracy rate, thus, central axes are positioned, and detect its width, central axes can accurately be extracted.

Description

The method for extracting language of the Manchus word image central axes

Technical field

The invention belongs to character segmentation fields, are related to a kind of method of extraction language of the Manchus word image central axes.

Background technology

The language of the Manchus is the spoken and written languages that the ethnic groups such as the Manchu of China, Xibe use, and is pushed away as legal word in the Qing Dynasty Wide and use forms the Manchu literatures of a large amount of preciousnesses.It has been on the verge of to disappear due to expiring Chinese language at present, Manchu's language cultural heritage is urgently Wait rescuing and protect to obtain the approval and attention of state and society all circles.Study the language of the Manchus optical character recognition technology to protection and Succession Qing Dynasty cultural heritage is particularly important.The language of the Manchus is a kind of phonemic language, shares 38 letters, wherein 6 vowels, 22 consonants, in addition there are 10 particular letters dedicated for spelling Chinese FrameNet.The language of the Manchus is write using word sequence from upper It arrives down, the rule of the form and arrangement of lines in calligraphy or printing from left to right.Language of the Manchus identification is generally required language of the Manchus elder generation cutting basic unit (such as letter) first, It is identified again, thus, the precision for improving language of the Manchus identification can be set about from its cutting precision is improved.

Invention content

In order to solve the problems, such as to improve language of the Manchus cutting precision, the following technical solutions are proposed by the present invention：

A method of extraction language of the Manchus word image central axes include the following steps：

S1. positioning language of the Manchus word image central axes；

S2. language of the Manchus word image axis line width is detected.

Supplement as technical solution：The step S1 is specifically included：

S1.1. language of the Manchus word image is negated, the pixel value of word segment is enabled to take 1, and the pixel value of background parts takes 0；

S1.2. the morphologic thinning function of MATLAB image processing toolboxes is used to realize the morphology of language of the Manchus word image Refinement；

S1.3. to the language of the Manchus word image after morphologic thinning, using Hough transformation to determine that the central axes institute of refinement is right The row coordinate answered, the position of the row coordinate as language of the Manchus word image central axes, wherein limit the angle of Hough transformation search straight line Degree is θ=90, only searches the straight line of vertical direction, and connects identical lengthwise position, is smaller than language of the Manchus word image height And it is straight line that length, which is more than the straight line of 1 pixel, finds out the center of central axes.

Supplement as technical solution：The step S2 is specifically included：

S2.1. the region of search of maximum run rule of three is determined；

S2.2. maximum run rule of three in region of search is imposed to language of the Manchus word image and determined in language of the Manchus word image The width of axis；

S2.3. by the width of the center of language of the Manchus word image central axes and central axes calculate central axes left margin and Right margin.

Supplement as technical solution：The step S2.1 is specially：

The region of search of maximum run rule of three range as defined in following formula is determined：

Wherein, sl is the left margin of the search range limited, and sr is the right margin of the search range limited, and baseline is The center of central axes, round indicate that, to nearest integer rounding, W is the width of language of the Manchus word image.

Supplement as technical solution：The step of maximum run rule of three of the step S2.2：Scan language of the Manchus word graph As every a line of region of search word image, and the number of the run length and length appearance of continuous black picture element is counted, then Run length with maximum occurrence number is exactly the width of language of the Manchus word image central axes.

Supplement as technical solution：The left margin and right margin of the central axes of the step S2.3 are calculated by following formula；

Wherein：Bl is the left margin of central axes, and br is the right margin of central axes, and baseline is language of the Manchus word image axis The center of line, baseline_width are the width of language of the Manchus word image central axes, and round indicates to take to nearest integer It is whole.

Advantageous effect：For language of the Manchus word image axis detection, the accuracy rate of cutting is directly affected, in order to provide cutting Accuracy rate, it is necessary to be enhanced the precision of axis detection, thus, central axes are positioned, and detect its width, can will in Axis accurately extracts.

Description of the drawings

Fig. 1 language of the Manchus part sets build flow chart；

Fig. 2 language of the Manchus components divide flow chart；

Fig. 3 conventional method language of the Manchus word image axis detection error instance figures；

Fig. 4 determines the figure of language of the Manchus axis line width using the maximum run rule of three that region limits, wherein：(1) maximum trip Journey rule of three error instance figure, (2) search range figure of the invention limited, (3) the method for the present invention result figure；

Fig. 5 the method for the present invention axis detection design sketch；

Fig. 6 language of the Manchus component cutting flow charts；

Fig. 7 language of the Manchus component cutting result figures, wherein：(1) the weak cut zone in weak segmentation phenomenon figure, (2) through frittering component, (3) over-segmentation phenomenon figure, (4) overdivided region are by merging figure, (5) partial segmentation result figure.

Specific implementation mode

From the angle analysis of optical character recognition technology, the language of the Manchus has the characteristics that：(1) according in word, position is not Together, the same letter of the language of the Manchus generally has 4 kinds of different forms of shape and suffix shape in independent shape, prefix shape, word.Language of the Manchus difference font Letter is 114 total.(2) language of the Manchus document same column word is all located near identical central axes, between the block letter language of the Manchus two arranges The case where word is not in intersection substantially is conducive to row extraction.Between having between language of the Manchus word in same row language of the Manchus text centainly Every, be conducive to word extraction.(3) language of the Manchus word is made of the vertical central axes connection of one or more Manchu alphabets, same word There is no gap between interior letter and letter.But be located on language of the Manchus word image central axes at letter and the mutually spelling of letter, it can To consider to divide Manchu alphabet using the pixel characteristic at central axes.(4) part Manchu alphabet has " a shape multiword " phenomenon.Example Such as character, while being shape in the word of alphabetical a, e and n, it can be distinguished according to the spelling rule of adjacent letters in identification. (5) part Manchu alphabet component part having the same.Such as character(the prefix shape of alphabetical o), can regard as by character(the prefix shape of alphabetical e) and character(shape in the word of alphabetical o) two parts are composed.Therefore it is basic point with Manchu alphabet Cut the phenomenon that unit is then susceptible to over-segmentation and weak segmentation.(6) certain monograms do not have separability.Such as(bo), It is cut into(alphabetical b) and(alphabetical o) is extremely difficult.

The characteristics of based on the above-mentioned language of the Manchus, the present embodiment propose a kind of thinking deconstructing language of the Manchus word again with component, with full Basic unit of the literary component (hereinafter referred to as component) as segmentation and identification, can solve with Manchu alphabet as basic cutting unit Caused over-segmentation and weak segmentation problem, language of the Manchus part set include a part for Manchu alphabet, letter or monogram, alphabetical group 3 kinds of sources, the purpose of structure language of the Manchus part set such as close and be to reduce the wrong identification come by segmentation band, this is because if according to Letter is that basic cutting unit over-segmentation and weak segmentation problem easily occurs then such as Such analysis, then is subsequently used for identification letter Grader will certainly generate the part of over-segmentation and weak segmentation identification mistake, or even can not identify；And it is (method) of the invention The language of the Manchus part set of proposition is built by navigation designing of the result of dividing method, that is, common over-segmentation (word Female or monogram a part) and weak segmentation (monogram) be no longer regarded as being one kind " mistake ", but one kind " correct " Segmentation, therefore the grader of subsequent design can identify these components, to reduce the identification mistake generated by segmentation errors The problems such as.Understanding for language of the Manchus component can refer to and carry out analogy with to the identification of English word.With English word study For, it can be with the whole word study of Direct Recognition；Can also be whole word segmentation at s, t, u, the letters such as d, y, identification is alphabetical respectively, so After be combined into word study；If being cut into letter to be difficult to realize, and it is relatively easy to be cut into component, for example, being easy cutting At：St, u, dy, (wherein, st, u, dy are components) then identification component, then component combination at word, however, the language of the Manchus by It is not to be easy as the English of citing in the cutting with These characteristics, component, as shown in Figure 1, the structure of language of the Manchus part set Building flow is：With reference to Manchu alphabet table, National Standard of the People's Republic of China《Information technology universal multiple-octet coded character set tin Bai Wen, language of the Manchus font》, Mongolian part set in document [1-2], at the beginning of proposing a language of the Manchus component comprising 99 initial parts Initial set (hereinafter referred to as " initial set "), the Flag=0 of each language of the Manchus component of juxtaposition.Using language of the Manchus cutting to language of the Manchus word image It is split, and statistics and analysis segmentation result：If the component after segmentation is not belonging to initial set, which is added initial Collection, the Flag=1 of the juxtaposition component；If the component after segmentation belongs to initial part collection, the Flag=1 of corresponding component is set.Inspection It looks into whether initial set has the Flag=0 of component, judges whether there is component and never occurred in segmentation result, if there are the portions Part then deletes the component from initial set.It arranges and exports language of the Manchus part set.Include component 106 altogether in language of the Manchus part set, For details see attached table 1.The document [1-2] wherein addressed：

[1]Hongxi Wei,Guanglai Gao.A keyword retrieval system for historical Mongolian document images[J].International journal on document analysis andrecognition,2014,17(1),33-45.

[2]Liangrui Peng,Changsong Liu,Xiaoqing Ding,Jianming Jin,Youshou Wu, Hua Wang,Yanhua Bao.Multi-font printed Mongolian document recognition system [J].International journal on document analysis and recognition,2010,13(2):93- 106.

As shown in Fig. 2, the dicing step of full component text is as follows：

S1. language of the Manchus paper document is converted into the digitized map that can be stored and processed with computer by photoelectric conversion device As document, the digital picture of language of the Manchus document is subjected to image preprocessing (smooth, binaryzation)；

S2. printed page analysis (Slant Rectify, row cutting and the segmentation of words)；

S3. language of the Manchus word image is extracted；

S4. place normalization；

S5. axis detection；

S6. according to the relational implementation language of the Manchus component cutting of language of the Manchus component and axis line position.

Wherein, slant correction determines space of a whole page angle of inclination using Hough transform method, then rotates image and corrects back vertical Text status；The language of the Manchus document for being inclined by correction uses vertical projection method into ranks cutting, using horizontal projection method's cutting word And the language of the Manchus word in language of the Manchus row image is extracted, then place normalization is carried out to language of the Manchus word image.It is completed by above-mentioned steps The pretreatment of language of the Manchus word image, the height for writing literary word image all over are H, width W.It should be noted that language of the Manchus list It is to cut away the extra white background edge of language of the Manchus word image that word image, which carries out place normalization, Fig. 2 shows flow, be to compile Journey is convenient and carries out Image Reversal, shows that is removed is the black border of language of the Manchus word.The figure of black matrix wrongly written or mispronounced character in Fig. 2, that is, The figure of described Image Reversal.Artwork should be white gravoply, with black engraved characters, but in order to program conveniently, is turned into black matrix wrongly written or mispronounced character and removes four The more convenient programming in the edge at angle, therefore directly given the image for removing black border after overturning again.

In the present embodiment, for language of the Manchus word image axis detection, the accuracy rate of segmentation is directly affected, it is following to it Concrete scheme is described in detail.

For language of the Manchus word image axis detection, i.e. step S5, vertical projection method and most is generally used in the prior art Big accumulation vertical projection method, however there are central axes locating bias and axis line width to estimate the feelings such as mistake for above two method Condition, as shown in Figure 3.The present embodiment provides a kind of methods of extraction language of the Manchus word image central axes, include the following steps：

S5.1. language of the Manchus word image central axes position：

Language of the Manchus word image is negated first, even the pixel value of word segment takes 1 and the pixel value of background takes 0.It uses The morphologic thinning function of MATLAB image processing toolboxes, using 3 × 3 structural element templates, each template includes 9 pictures Element, each pixel can only take 0 or 1, therefore template has 512 kinds of different forms, and template, which is divided into 8 directions, realizes language of the Manchus word graph The morphologic thinning of picture.To the language of the Manchus word image after refinement, the row corresponding to the central axes of refinement are determined using Hough transformation Coordinate, the as position of language of the Manchus word image central axes.In the extraction of language of the Manchus word image central axes, limits Hough transformation and search The angle of rope straight line is θ=90, i.e., only searches the straight line of vertical direction, and connects identical lengthwise position, is smaller than word The straight line that picture altitude H and length are more than 1 pixel is straight line, that is, finds out the center of central axes, be denoted as baseline.Language of the Manchus word image central axes refer to that in a width language of the Manchus word image, language of the Manchus word central axes are in the picture Row coordinate position, rather than the center line of piece image.

S5.2. language of the Manchus word image central axes width detection

S5.2.1. the maximum run rule of three of axis line width is used：Every a line of language of the Manchus word image is scanned first, and Count the number of the run length and length appearance of continuous black picture element；All rows are scanned successively, then there is maximum to go out occurrence Several run lengths is exactly the width of language of the Manchus word image central axes, is denoted as w₀.Using maximum run rule of three to detecting the language of the Manchus Word image axis line width is effective, but still there is the error situation as shown in Fig. 4 (1).Generate the original of this mistake Because being, maximum run rule of three is to carry out continuous black picture element distance of swimming statistics to whole picture language of the Manchus word image, and the language of the Manchus is different Font deforms severe jamming maximum run rule of three to global statistical result.The statistics write to the language of the Manchus shows language of the Manchus axis Line width does not exceed the 1/2 of word of width W generally, therefore limits the region of search of maximum run rule of three, by searching for algorithm Rope region is limited in the range of formula (1) regulation, the maximum run rule of three that referred to as region limits.

In formula (1), sl is the left margin of the search range limited, and sr is the right margin of the search range limited, Baseline is the center of central axes, and round is indicated to nearest integer rounding.Limit search regional extent weakens full Text is free and limb stroke influences the statistics of central axes width, then uses maximum run rule of three in limit search range again The detection that axis line width is completed in language of the Manchus word image afterwards, as a result as shown in Fig. 4 (3).

S5.2.2. by the center baseline of the width baseline_width of central axes and central axes, according to public affairs Formula (2) calculates the left margin bl and right margin br of central axes.

The language of the Manchus image of different fonts font size totally 400 width is randomly selected, the region that the present embodiment is respectively adopted limits most Big distance of swimming rule of three, vertical projection method extract central axes, and the results are shown in Table 1.Axis is correctly extracted using the method for the present invention The certain embodiments of line are as shown in Figure 5.The experimental results showed that language of the Manchus list can be accurately positioned using morphologic thinning and Hough transformation Word image axis line position, the maximum run probabilistic method limited using region can correctly determine the width of language of the Manchus word image central axes Degree.

1 language of the Manchus word image axis detection result statistical form of table

	The method of the present invention	Vertical projection method
			Correct sample number	397	210
Error sample number	3	190
			Accuracy	99.25%	52.50%

In the present embodiment, the accuracy of language of the Manchus character cutting is to improve the bottleneck problem of language of the Manchus recognition accuracy, following Its concrete scheme is described in detail.

For language of the Manchus component cutting, i.e. step S6, as shown in fig. 6, including：

S6.1. the thick cutting of language of the Manchus component；

S6.2. the weak segmentation judgement in segmentation candidates region with fritter point；

S6.3. the over-segmentation in segmentation candidates region is adjudicated and is merged.

Above-mentioned steps are made below and being illustrated：

S6.1. the thick cutting of language of the Manchus component

Since language of the Manchus component is connection with central axes, first centered on central axes, by language of the Manchus word be divided into it is left, In, right 3 parts.Wherein, the 1st row of the ranging from language of the Manchus word of left part are arranged to bl-1, and ranging from the of right part Br+1 arranges the W row of language of the Manchus word.Floor projection is carried out to left part and right part respectively, is denoted as pl and pr.Definition The cutting cost function of i-th row is：

Cost (i)=pl (i)+pr (i), i=1,2 ..., H (3)

Ideally the cost function value of cutting row should be 0, i.e., left and right two parts the row all without except central axes it Outer stroke.But in actual conditions, due to the influence of noise that the pretreatments such as scanning, slant correction, binaryzation are brought, to cutting row Constraints strictly can then lead to serious weak segmentation problem very much.If T1 is the thick cutting threshold value of language of the Manchus component, by a large amount of real The value for testing determining T1 isOnly meet condition：

Cost(i)≤T1 (4)

Row be only candidate cutting row, and remember it is all meet condition (4) candidate cutting rows composition sequence be Can_ seg.Wherein, determining for the value of T1 is tested, is to select the baseline_width of different multiples as T1, these multiples are all It is<=1 score executes language of the Manchus component cutting method, the image after cutting is compared, and it is preferably full to select cutting effect T1 corresponding to literary word image, final choice are above-mentioned T1 values.

The candidate cutting row set obtained through the thick cutting of language of the Manchus component, it may appear that following three kinds of situations：

1) candidate cutting row is done in the 1st trade of image, this is clearly unreasonable candidate row, therefore should be from candidate cutting row It is deleted in set；

2) continuous phase adjacent rows/image since the 1st row of image is the continuous adjacent terminated with last 1 row (H rows) Row is all unreasonable candidate row subsegment, therefore should delete these subsegments from candidate cutting set；

3) in addition to the subsegment of the continuous phase adjacent rows composition in 2), it is only necessary to the candidate cutting of wherein centrally located one Row, remaining is not needed to；Therefore the candidate row in centre position should be used to substitute the subsegment that entire continuous phase adjacent rows form.

By above-mentioned, often there is also extra candidate cutting rows in Can_seg, for this purpose, further being deleted using following strategy Extra candidate cutting row in Can_seg：

(1) if there was only 1 segmentation candidates row in Can_seg, and it is the 1st row, then deletes the row；Otherwise it goes to step (2)；

(2) the subsegment conti_subseg of continuous candidate cutting row composition is searched, if the 1st row of starting behavior of subsegment, or The end behavior H rows of person's subsegment, then delete all rows of the subsegment；Otherwise it goes to step (3)；

(3) in continuous candidate cutting subsegment conti_subseg, from small to large ord, the subsegment is substituted with median All rows (average value of intermediate two values is taken to round up again when even number candidate row)；

(4) the fresh cut branch sequence C an_seg_new of extra candidate cutting row is deleted in output.

S7.2. the weak segmentation judgement in segmentation candidates region with fritter point

There may be weak cutting situations for language of the Manchus component through thick cutting.Statistical result shows that the height of language of the Manchus component is general No more than 5 times baseline_width, therefore set weak segmentation decision threshold T_less=5.Calculate each cutting in Can_seg_new The height hl in region, then the cutting region of height hl ＞ (T_less × baseline_width) be judged as weak cut zone. For weak cut zone, using above-mentioned thick cutting method and point secondary cutting of threshold value T2 progress is frittered, and be stored in Seg1 sequences In.It fritters point threshold value T2 and relaxes constraint to cutting candidate row again on the basis of thick cutting, determined by many experiments Wherein, determining for the value of T2 is tested, is to select the baseline_width of different multiples As T2, these multiples are all<=1 score executes language of the Manchus component cutting method, the image after cutting is compared, and selects Go out the T2 corresponding to the better language of the Manchus word image of cutting effect, final choice is above-mentioned T2 values.

S7.3. the over-segmentation in segmentation candidates region is adjudicated and is merged

After thick cutting and frittering point, there is likely to be overdivided regions for Seg1 sequences.Statistical result shows language of the Manchus component Height be generally higher than baseline_width, therefore set over-segmentation decision threshold T_over=1.Calculate each cutting area in Seg1 The height ho in domain, then the cutting region of height ho ＜ (T_over × baseline_width) be judged as overdivided region, need Merge, merging has following situations：

1) it counts from top to bottom, first cutting region is judged as over-segmentation, then is only possible to and the 2nd region merging technique；

2) from the bottom up, second-to-last region is judged as over-segmentation, then is only possible to and a region merging technique last；

3) if overdivided region is located at centre, need to consider two regions up and down that its is adjacent.Calculate separately with above The height h_up of region merging technique rear region, and merge with following area the height h_lw of rear region, height is small after selection combining That Merge Scenarios；

If 4) equal with height after the merging in upper and lower two regions, i.e., according to 3) not can determine that Merge Scenarios, then count respectively Connected domain number after calculation and upper and lower two region merging techniques, and the Merge Scenarios for selecting connected domain number few；

5) cutting row of the output after region merging technique.

For this purpose, using following compatible rule merging overdivided region：

(1) if the 1st cutting region over-segmentation, with the 2nd cutting region merging technique；Otherwise (2) is gone to step

(2) if second-to-last cutting region over-segmentation, with the last one cutting region merging technique；Otherwise (3) is gone to step

(3) if overdivided region is neither the 1st, nor second-to-last, then calculate separately its adjacent upper and lower 2 Cut subregional height h_up and h_lw.If h_up ＜ h_lw, with a upper cutting region merging technique；If h_up ＞ h_ Lw is then merged into next cutting region；Otherwise (4) is gone to step

(4) it if the height of upper and lower 2 adjacent areas of overdivided region is equal, calculates separately and is closed with up or down region Connected domain number num_up, num_lw after and.If num_up ＜ num_lw, with a upper cutting region merging technique；If Num_up ＞ num_lw, then with next cutting region merging technique.

(5) output merges the cutting row sequence of overdivided region.

By said program, the cutting of language of the Manchus component is obtained as a result, as shown in fig. 7, Fig. 7 (1)-(2) are weak cut zone processes The result frittered point；Fig. 7 (3)-(4) are that overdivided region passes through combined result.

It is further processed by the language of the Manchus component cutting result of above-mentioned completion, to be identified to language of the Manchus component, which removes Further include following steps outside the cutting of above-mentioned language of the Manchus word image：

(1) language of the Manchus component normalizes

Including the normalization of language of the Manchus component locations and size normalization.

The language of the Manchus component locations normalization be exactly by language of the Manchus image of component with stroke pixel it is most upper, most under, it is most left, most right Pixel be boundary, cut off background parts, only remain with the part of stroke.The normalization of language of the Manchus component sizes refers to by above-mentioned warp Cross the image normalization after place normalization be identical size (such as:The pixel of 64 pixels × 64).

(2) language of the Manchus component feature extracts

Extract the method for being usually used in minority language feature extraction respectively first, including：Contour feature, grid search-engine, Directional element features, visual direction feature and affine not displacement feature.Then these features are merged, and use principal component analysis pair Fusion feature carries out dimensionality reduction.

(3) language of the Manchus component identifies

Using the support vector machine classifier with gaussian kernel function, the Combination of Multiple Classifiers rule of " a pair of remaining " is used Realize the identification to some language of the Manchus component.

(4) language of the Manchus component identification post-processing,

For the language of the Manchus component identified, according to the spelling rules of the recognition result of neighbouring component and Manchu alphabet, The recombination from component to word is completed, to realize the identification to language of the Manchus word.

Subordinate list 1：

Claims

1. a kind of method of extraction language of the Manchus word image central axes, which is characterized in that include the following steps：

S1. positioning language of the Manchus word central axes；

S2. language of the Manchus word axis line width is detected.

2. the method for extraction language of the Manchus word image central axes as described in claim 1, which is characterized in that the step S1 is specific Including：

S1.2. the morphologic thinning function of MATLAB image processing toolboxes is used to realize that the morphology of language of the Manchus word image is thin Change；

S1.3. to the language of the Manchus word image after morphologic thinning, using Hough transformation to determine corresponding to the central axes of refinement Row coordinate, the position of the row coordinate as language of the Manchus word central axes, wherein limit Hough transformation search straight line angle as θ= 90, the straight line of vertical direction is only searched, and connects identical lengthwise position, is smaller than language of the Manchus word image height and from height Degree is straight line more than the straight line of 1 pixel, finds out the center of central axes.

3. the method for extraction language of the Manchus word image central axes as described in claim 1, which is characterized in that the step S2 is specific Including：

S2.1. the region of search of maximum run rule of three is determined；

S2.2. maximum run rule of three in region of search is imposed to language of the Manchus word image and determines language of the Manchus word image central axes Width；

S2.3. left margin and the right of central axes are calculated by the width of the center of language of the Manchus word image central axes and central axes Boundary.

4. the method for extraction language of the Manchus word image central axes as claimed in claim 3, which is characterized in that the step S2.1 tools Body is：

Wherein, sl is the left margin of the search range limited, and sr is the right margin of the search range limited, and baseline is axis The center of line, round indicate that, to nearest integer rounding, W is the width of language of the Manchus word image.

5. the method for extraction language of the Manchus word image central axes as claimed in claim 3, which is characterized in that the step S2.2's The step of maximum run rule of three：Every a line of language of the Manchus word image region of search is scanned, and counts the trip of continuous black picture element The number that Cheng Changdu and the length occur, then it is exactly the width of language of the Manchus word central axes to have the run length of maximum occurrence number Degree.

6. the method for extraction language of the Manchus word image central axes as claimed in claim 3, which is characterized in that described in being calculated by following formula The left margin and right margin of the central axes of step S2.3；

Wherein：Bl is the left margin of central axes, and br is the right margin of central axes, and baseline is language of the Manchus word image central axes Center, baseline_width are the width of language of the Manchus word image central axes, and round is indicated to nearest integer rounding.